Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

I Pulled Data From 1.5 Million US Websites – What Data Would You Want To Know?

Started out with a question, how do I spend $300 in free GCC credits, and how much could I do with it. I started with figuring out how to query HTTP Archives, pulling CRuX data to correlate sites, and learning a bit about BigQuery along the way. I went from ~12 million total sites and pared that down to 1.5 million that I could verify were live, had enough data to be able to classify/categorize, and then built a front end to access the highlights.

So far, I’ve been focused on identifying key business segments with missing opportunities, classic one click misses, some schema mapping for business type, and wondering why in the world any sane business owner would use Weebly.

What would YOU want to know?

submitted by /u/gillygangopolus
[link] [comments]

720M+ Public Images Indexed With Full EXIF/IPTC/XMP Metadata — Searchable Via REST API/Web [OC]

Sharing a dataset resource that may be useful for researchers, data scientists, and investigators.

Image-Meta has indexed the embedded metadata (EXIF/IPTC/XMP) from ~720 million publicly accessible images using ExifTool. The data is queryable via a REST API & web rather than a bulk download.

**What’s in the dataset:**

– Camera make, model, and serial number

– Author, copyright, rights, title, description

– GPS coordinates (where present, subject to strict TOS/paid tier not publicly free available)

– Software chain

– Creation, modification, and index dates

– Filename and document ID

– Creation, Modify Date, Date Found

– Extra JSON supplimental metadata in full per image

**Potential research uses:**

– Camera device attribution studies

– Metadata privacy/leakage research

– Image provenance and disinformation analysis

– Geospatial studies using embedded GPS

– Timeline reconstruction of image publication

**Access:**

Web or Queryable via REST API with field-level boolean search, date ranges,

https://image-meta.com

API docs: https://image-meta.com/api-docs

submitted by /u/cstadler
[link] [comments]

Free JSON Dataset: 50 Traditional Recipes From 25 Countries (ingredients + Instructions)

I just released a free sample dataset of 50 traditional recipes from 25 countries.
Each recipe includes:
Ingredients
Step-by-step instructions
Prep time & cook time
Serving size
Format: JSON
The full dataset contains 1,925 recipes from 194 countries and is available on HuggingFace under the name:
“FoodieAtlas World Traditional Recipes Dataset”
Disclosure: I am the creator of this dataset.

submitted by /u/BayJeolog
[link] [comments]

Finance Database: 300,000+ Financial Instruments With Rich Metadata, Free And Queryable Via Python

Finding a clean, structured list of financial instruments has always been harder than it should be. Bloomberg sells it. Refinitiv sells it. Yahoo Finance gives you a search bar. If you want “all biotech companies listed in Germany” or “all fixed income ETFs from Vanguard” as a filterable dataset, you’re usually either scraping something or paying for a data vendor. I’ve spent the last few years building and maintaining a free alternative.

The Finance Database covers seven asset classes across 300,000+ symbols:

Asset Class Count Dimensions
Equities 160,869 11 sectors, 68 industries, 117 countries, 84 exchanges
Indices 91,181 63 exchanges
Funds 57,853 1,540 families, 74 categories
ETFs 36,483 320 families, 51 categories
Cryptocurrencies 3,367 351 base currencies
Currencies 2,556 175 currency pairs
Money Markets 1,367 2 exchanges

Each equity record includes: symbol, name, currency, sector, industry group, industry, exchange, market, country, city, market cap tier, ISIN, CUSIP, FIGI, composite FIGI, share class FIGI, and website. ETFs and funds carry family, category group, and category instead of GICS-style classification. Every record has what you need to cross-reference against other data sources.

The data is an aggregation of publicly available sources – no paid API required to use the database itself. It is community-maintained, MIT-licensed, and lives on GitHub as CSV files you can open in Excel if that’s your preference.

The Python package gives you structured filtering and text search:

“`python

Install via: pip install financedatabase -U

import financedatabase as fd

equities = fd.Equities()

All semiconductor companies in Taiwan on primary listings only

equities.select( country=’Taiwan’, industry=’Semiconductors’, only_primary_listing=True )

Free-text search: robotics or automation companies on the Frankfurt exchange

equities.search( summary=[‘Robotics’, ‘Automation’], index=’.F’ )

Explore what’s available before filtering

fd.show_options(‘equities’) “`

The show_options call is useful before you filter – it returns every distinct value per column without loading the full dataset, so you can scope your query without memory overhead.

For anyone doing universe construction for backtests or systematic strategies, the ISIN/FIGI coverage is the most practical part. You can pull a filtered symbol list here and pipe it directly into your price data provider.

The database is not a price or fundamentals source – that’s intentional. Metadata and categorization data is the hard part to get for free and I’ve built a seperate tool for that, the Finance Toolkit.

GitHub page: https://github.com/JerBouma/FinanceDatabase

submitted by /u/Traditional_Yogurt
[link] [comments]

I Processed The Entire ArXiv LaTeX Source Corpus (3M+ Papers) Into A Metadata-aligned Parquet Dataset To Save On S3 Egress Fees

I’ve spent the last few weeks working on a pipeline to solve a problem that has frustrated me (and likely other researchers) for a while: working with arXiv source files at scale.

If you have ever tried to analyze the LaTeX source code of arXiv papers, you have probably run into two major roadblocks:

  1. The Egress Tax: arXiv’s official bulk S3 bucket is configured as “requester-pays.” If you try to download the complete 5 TB corpus to any machine outside of the AWS us-east-1 region, you get hit with standard egress fees. At $0.09 per GB, a single full download can cost over $450 in bandwidth alone.
  2. Unpacking Pain: The raw S3 data is packaged as hundreds of nested .tar archives containing gzipped payloads of individual papers. Extracting these, parsing the inner LaTeX code, and matching the files with their JSON metadata snapshots is quite CPU-intensive and requires a lot of boilerplate ingestion code.

To make this easier, I built a pipeline that runs inside AWS us-east-1 (where transfer is free), pulls the raw source files, unpacks them, matches them with the official metadata, and bundles them into ready-to-query Parquet partitions.

What is inside:

Each row represents a single paper and contains both the official metadata and the parsed source files:

  • Core Metadata: id, title, authors, abstract, doi, categories, license, versions, etc.
  • latex (Large String): The parsed, compiled LaTeX source code from the paper. I wrote a parser to bundle the primary .tex, .bib, and .sty files into a single, readable Markdown-style tree structure.

Maintenance & Syncing:

  • Monthly Updates: I plan to sync the pipeline once a month to capture new uploads.
  • Resilient Syncing: I maintain an XML manifest file in the HuggingFace repository (arxiv_parquet_manifest.xml) that maps each Parquet partition to its size, MD5 checksum, and the raw S3 .tar source files used to generate it. This should make incremental syncing or troubleshooting much easier.

If you are working on NLP, training LLMs on scientific text, analyzing citation networks, or doing sociolinguistic research, hopefully this saves you some time and cloud budget.

submitted by /u/Invicto_50
[link] [comments]

[OSS] Open Dataset: All 78 Tarot Card Meanings (upright + Reversed, Structured) With A Zenodo DOI

I built a clean, structured dataset of all 78 Rider-Waite tarot card meanings. Each entry has upright + reversed interpretations plus separate love / career / general context fields, so it’s usable for NLP, recommender experiments, or hobby projects.

Released open with a permanent DOI so it’s citable.

– Hugging Face: https://huggingface.co/datasets/Blacik/deckaura-tarot-card-meanings

– DOI (Zenodo): https://doi.org/10.5281/zenodo.19475329

Happy to take feedback on the schema or labeling. If anyone uses it in a project I’d love to see what you build.

submitted by /u/Dry_Issue282
[link] [comments]

Launch: Source Streams For Data Discovery

Hey there! I am the founder of Brickroad, a frontier AI lab building agentic infrastructure for data provisioning.

Super excited to share that source streaming is now live on Brickroad. Set your search parameters once, and your agent runs continuously, notifying you the moment a new data supplier comes online.

For those who rely on data to get a performance edge, the directories, the catalogs, the curated lists of “alternative data providers” — they are useful, but they are lagging indicators of alpha. A vendor only lands in one of these catalogs after they have built a website, hired a salesperson, and shopped themselves to enough buyers that an analyst notices. By then, the first ten funds, AI labs, and corporates have already signed contracts. The information edge has dissipated into consensus.

We launched the Information Frontier Agent to compress that lag. A Source Stream is a continuous, agent-driven feed of novel data suppliers that match a thesis you define. The agent runs in the background indefinitely, scanning the complete corpus of its resources to find new suppliers that fit your criteria. Every time it finds a new supplier, the agent notifies you and logs the source into your lead table.

It’s free to trial – we’d love your feedback.

submitted by /u/EmetResearch
[link] [comments]

Federal Contractor Violations Dataset [dataset][self-promotion]

I built a dataset joining USAspending federal contract awards to seven federal enforcement databases at the contractor level: OSHA, WHD, MSHA, EPA ECHO, NLRB, SEC, the UVA Corporate Prosecution Registry, and the SAM.gov debarment list. 5,557 contractors with documented violations, $3.19T in lifetime federal contracts, 758 OSHA-investigated fatalities.

The novel slice is the multi-agency overlap. Roughly 2000 contractors appear in 2+ federal enforcement databases. 500 in 3+. 70 in 4+. Topping the 4+ cohort by lifetime contract value: Raytheon ($68B, OSHA + WHD + NLRB + SEC + UVA), GE ($47B, same five), Merck, Microsoft, Austal USA, Marinette Marine.

Hugging Face: https://huggingface.co/datasets/FastDOLz/Federal-Contractor-Violations-Dataset

Kaggle: https://www.kaggle.com/datasets/benturneroffice365/federal-contractor-violations-dataset

Zenodo DOI (all versions): https://doi.org/10.5281/zenodo.20777627

Methodology + limitations: https://www.fastdol.com/methodology

CC-BY-4.0.

disclosure: I run FastDOL (https://www.fastdol.com), a federal workplace-enforcement search by employer, where this corpus comes from. Free for individual lookups; the dataset is one of several full extracts.

submitted by /u/chill-botulism
[link] [comments]

Inconsistency And Differences Among Fire Datasets From FDNY

Hello Friends,

I am interested in exploring the data on the fires that have happened in NYC for different spatiotemporal analysis. I came across the following datasets from the open data platforms:

[Fire Incident Dispatch Data from NYC open data](https://data.cityofnewyork.us/Public-Safety/Fire-Incident-Dispatch-Data/8m42-w767/about\_data)

[Incidents Responded to by Fire Companies (NYFIR)](https://data.cityofnewyork.us/Public-Safety/Incidents-Responded-to-by-Fire-Companies/tm6d-hbzd/about\_data)

[NFIR](https://fema.hub.arcgis.com/search?collection=dataset&tags=nfirs)

What I noticed is that there is a lot of inconsistencies across these datasets, and the volume of the data dramatically decreases from dispatch to NYFIR an NFIR.
Please share your experiences how you guys handle this datasets for more granular analysis.

submitted by /u/Usual-Cost-6848
[link] [comments]

FDA Novel Drug Approvals (2021–2024) + US Nonprofit Hospital Charity-care Reporting — Parquet/JSON/CSV, Public Domain

Disclosure: I’m the author of the open-source project (trove) that parses and repackages these. Original government sources are linked below; my bundles are at the end. MIT code, public-domain data, nothing paid.

Two public-domain US healthcare datasets that get cited constantly but are painful to use in raw form:

  1. FDA novel drug approvals, 2021–2024 — 218 drugs (192 CDER NMEs + 26 CBER cell & gene therapies). Each row: application number, sponsor, approval date, indication, regulatory center, and a deep link to the approval-package docs.

Original sources:

– CDER Novel Drug Approvals: https://www.fda.gov/drugs/development-approval-process-drugs/novel-drug-approvals-fda

– CBER Approved Cellular and Gene Therapy Products: https://www.fda.gov/vaccines-blood-biologics/cellular-gene-therapy-products/approved-cellular-and-gene-therapy-products

– Drugs@FDA: https://www.fda.gov/drugsatfda

  1. Nonprofit hospital charity-care reporting, TY2022 — 1,295 nonprofit hospital systems, with CMS HCRIS Worksheet S-10 and IRS Form 990 Schedule H side by side. Both lines are meant to capture the cost of care for patients who couldn’t pay, but the rules diverge, so the two numbers often disagree. Each row also carries a CDC Social Vulnerability Index county percentile and a deep link to the 990 on ProPublica.

Original sources:

– CMS HCRIS (Hospital 2552-10 cost reports): https://www.cms.gov/data-research/statistics-trends-and-reports/cost-reports/hospital-2552-2010-form

– IRS Form 990 series XML downloads: https://www.irs.gov/charities-non-profits/form-990-series-downloads

– CDC Social Vulnerability Index 2022: https://www.atsdr.cdc.gov/place-health/php/svi/index.html

– ProPublica Nonprofit Explorer (where the 990 deep links point): https://projects.propublica.org/nonprofits/

What I added on top: parsing the raw formats (headerless 100k-row HCRIS CSVs, IRS bulk-XML ZIPs, hundreds of FDA PDF directories) into tidy Parquet/JSON/CSV, plus a CCN↔EIN crosswalk that joins the two hospital filings.

My packaged bundles + parsers (self-promo — I built this): https://github.com/cbetz/trove — browsable lookup at https://troveproject.com

Happy to answer questions about the parsing or add fields people want!

submitted by /u/scrapdog
[link] [comments]

Anti-bot / WAF Adoption Across The Top 1,000,000 Websites — Open Dataset (CC BY 4.0, ~1M Rows) [self-promotion]

I scanned the Tranco top 1,000,000 sites (June 2026) and recorded, per domain, which anti-bot/WAF vendor protects it and whether a plain request gets challenged. Releasing it as open data.

– 998,497 probed, 818,614 reachable

– Fields: domain, rank, reachable, protected, vendor(s), kind (waf/captcha/bot_management/…), difficulty band, block reason, enforcement, CAPTCHA type, final URL, status, probed_at — names only, no PII

– Plus a top-50k “deep-page census” (86,792 rows) with a page_type field (homepage vs product/listing/profile)

– License: CC BY 4.0

Headline: 53.5% of reachable sites run a managed anti-bot/WAF (Cloudflare ~45%), but only 9.8% actively challenged the request. The busiest sites run the least (top-1k 44% → long tail 54%).

Dataset (gzipped JSONL + sample + summary.json): https://github.com/Crawlora-org/anti-bot-adoption-index-data

Open-source detector CLI: go install github.com/Crawlora-org/crawlora-antibot@latest

submitted by /u/the_bigbang
[link] [comments]

Would You Be Interested In Daily Updated Fund Holdings?

Hey,

I’m planning to add broad support for daily updated fund holdings!

Problem: SEC N-PORT data lags behind a LOOOOONG time when it comes to fund holdings.

Solution: Funds actually release holdings with much more up-to-date information on their website. It’s just a huge hassle to actually fetch them reliably.

If I were to say that I have found a reliable way to pull this off for a large and expanding set of funds, would you be interested in that kind of data?

submitted by /u/Either_Door_5500
[link] [comments]

Using Kaggle’s International Football Dataset (1872–2026) For Live World Cup Elo Rankings

Built a site that uses the Kaggle international football results dataset to compute Elo ratings and championship probabilities for World Cup 2026 in real time.
Layered on top: AI-generated match reports combining live data with news sentiment via OpenRouter.
Site: skorradar.live — the methodology is explained in the About section. Curious if anyone has thoughts on improving the Elo calibration for tournament play vs. friendlies.

submitted by /u/tremdem
[link] [comments]

Driver Drowsiness Datasets For South Asians?

hi! like my title states, I was wondering whether anyone has any good datasets of driver drowsiness or just drowsiness in general for south asian people? or Asians, actually, because my project is catered to a more minor demographic in my country (Sri Lanka). it would also be a major advantage if any of you could also help with datasets that have driver fatigue data in low-light conditions, or with people wearing glasses / sunglasses.

thank you! I’d really appreciate it 🙂

submitted by /u/Defiant-Ad3530
[link] [comments]

[Self-Promotion] [PAID] Free US, UK And Australian Robotics Data Samples

Disclosure: I work with a team that collects and licenses paid robotics training datasets.

I’ve been speaking with robotics teams about human demonstration data, and every team seems to evaluate it differently.

Some only need egocentric video, while others require synchronized wrist views, task labels, collection metadata and licensing documentation.

We currently have small evaluation samples from the US, UK and Australia, covering:

Egocentric demonstrations
Egocentric + two wrist views
Task and step labels
Country and collection metadata

The small evaluation samples are free, but the complete datasets and custom collection services are paid.

For teams working on robot manipulation or embodied AI, what do you normally check first?

Camera coverage, task diversity, collection country, metadata quality or licensing?

I’m mainly trying to understand what makes a sample genuinely useful before preparing more of them.

submitted by /u/WideAmbition1964
[link] [comments]

Is Anyone Here Interested In A ‘Filipino Recipe Dataset’ Containing 1,574 Recipes?

📊 Filipino Recipe Dataset — 1,574 Recipes I've compiled a clean, structured dataset of Filipino recipes scraped from a top Filipino recipe site. Perfect for food tech startups, recipe apps, meal planners, nutrition analysis, or AI training data. What's included: • 1,574 recipes spanning 2009–2026 • Complete ingredients list with measurements (every recipe) • Step-by-step cooking instructions (every recipe) • Full nutritional data per serving: calories, protein, fat, carbs, fiber, sugar, sodium, etc. (97% of recipes) • Prep time, cook time, total time • YouTube video links (31% of recipes) • User ratings and vote counts (28% of recipes) • Categories, cuisines, and keywords • High-resolution image URLs Data format: Clean JSON, ready to import into any application or database. Use cases: - Build a Filipino recipe search engine or mobile app - Train a recipe recommendation model - Analyze Filipino cuisine nutrition trends - Power a meal planning or grocery list tool - Academic research on Southeast Asian food culture DM me if interested. Can provide a sample file upon request. 

submitted by /u/JonretsTheFriendly
[link] [comments]

Need Dataset For Photovoltaic Output

I am writing a thesis. For this I need a data set which includes the effects of environmental conditions on solar panel energy output. This includes things like cloud cover temperature wind precipitation atmospheric pressure etc.

If anyone knows where I can get a large data set with all of this, I’d appreciate it.

submitted by /u/Complex-Branch-4754
[link] [comments]

381 Model Merging Papers From ArXiv + Semantic Scholar; Quality-scored JSONL, Free

Sharing a dataset I built. Disclosure: this is my project. Free to download and use.

https://huggingface.co/datasets/fineset-io/model-merging-papers

Stats:

– 381 records, 2021–2026

– Sources: arXiv + Semantic Scholar, cross-referenced by arxiv_id and DOI

– quality_score: 0-1, citation-normalized

Fields: id, title, abstract, authors, categories, published_date,

citation_count, quality_score, has_code, code_url, venue

The most-cited paper in the set is “Model soups: averaging weights of multiple

fine-tuned models improves accuracy without increasing inference time” (1,565 citations,

2022); if you’re doing any merging work this is probably already in your reading list,

but the rest of the dataset has 380 more.

109 papers have code repos; filter has_code=true if you want reproducible implementations.

Built with FineSet (fineset.io). Sign up free to get daily-refreshed datasets on your own topic.

submitted by /u/fineset-io
[link] [comments]

Tested Some Proxy Providers For City-level Geotrgeting And Most Of Them Lied To Me

Just finished a few weeks of testing proxy providers for a project that needs accurate location data. pulling localized pricing, so if the geo is wrong the whole thing is useless.
Short version: Most of the advertised coverage numbers are pretty meaningless. had requests that allegedly originated from some cities in completely different areas. not like a little bit off, like wrong country level off on a couple of them.
Across all of the providers I tested, ASN targeting was far more reliable than city targeting. If you need location accuracy that’s probably where to start rather than trusting city-level claims.

One provider did truly better than the rest on consistency. Happy to chat through what I found if anyone has the same problem.

submitted by /u/Infinity-artist
[link] [comments]