Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

The ENTIRE Epstein Files Dataset Is Now Fully Viewable

So, I was going through Hugging Face and was wondering. Hmm, did someone upload the ENTIRE new Epstein Files? And I found out, nobody did. Nobody uploaded the complete ones and even worse, nobody processed them well..

So, Ladies and Gents, here is the full dataset, easily processable for everyone. If you want to recreate what I did, here is the GitHub: GitHub Link

What does the dataset include? Audio files, Videos, Images, PDF texts.. even excel files?

Questions? Just ask.
Compliments? Just give me some.
Love y’all ❤️

submitted by /u/itsnikity
[link] [comments]

Made A Fast Go Downloader For Massive Files (beats Aria2 By 1.4x)

Hey guys, we’re a couple of CS students who got annoyed with slow single-connection downloads, so we built Surge. Figured the datasets crowd might find it handy for scraping huge CSVs or image directories.

It’s a TUI download manager, but it also has a headless server mode which is perfect if you just want to leave it running on a VPS to pull data overnight.

  • It splits files and maximizes bandwidth by using parallel chunk downloading.
  • It is much more stable and fast than using a browser like Chrome or Firefox!
  • You can use it remotely (over LAN for something like a home lab)
  • You can deploy it easily via Docker compose.
  • We benched it against standard tools and it beat aria2c by about 1.38x, and was over 2x faster than wget.

Check it out if you want to speed up your data scraping pipelines.

GH: github.com/surge-downloader/surge

submitted by /u/SuperCoolPencil
[link] [comments]

Made A Fast Go Downloader For Massive Files (beats Aria2 By 1.4x)

Hey guys, we’re a couple of CS students who got annoyed with slow single-connection downloads, so we built Surge. Figured the datasets crowd might find it handy for scraping huge CSVs or image directories.

It’s a TUI download manager, but it also has a headless server mode, which is perfect if you just want to leave it running on a VPS to pull data overnight.

  • It splits files and maximizes bandwidth by using parallel chunk downloading.
  • You can deploy it easily via Docker compose.
  • We benched it against standard tools, and it beat aria2c by about 1.38x, and was over 2x faster than wget.

Check it out if you want to speed up your data scraping pipelines.

submitted by /u/SuperCoolPencil
[link] [comments]

I Analyzed 25M+ Public Records To Measure Racial Disparities In Sentencing, Traffic Stops, And Mortgage Lending Across The US

I built three investigations using only public government data:

Same Crime, Different Time — 1.3M federal sentencing records (USSC, 2002-2024). Black defendants receive 3.85 months longer sentences than white defendants for the same offense, controlling for offense type, criminal history, and other factors.

Same Stop, Different Outcome — 8.6M traffic stops across 18 states (Stanford Open Policing Project). Black and Hispanic drivers are searched at 2-4x the rate of white drivers, yet contraband is found less often.

Same Loan, Different Rate — 15.3M mortgage applications (HMDA, 2018-2023). Black borrowers pay 7.1 basis points more and Hispanic borrowers 9.7 basis points more in interest rate spread, even after OLS regression controls.

All data is public, all code is open source, and the interactive sites are free:

• samecrimedifferenttime.org (http://samecrimedifferenttime.org/)

• samestopdifferentoutcome.org (http://samestopdifferentoutcome.org/)

• sameloandifferentrate.org (http://sameloandifferentrate.org/)

Happy to answer questions about methodology.

submitted by /u/justiceindexhub
[link] [comments]

How Do MTGTop8 / Tcdecks And Other Actually Get Their Decklists? (noob Here)

Hello guys,

I’m looking into building a small tournament/decklist aggregator (just a personal project, something easy looking), and I’m curious about the data sourcing behind the big sites like MTGTop8 or Tcdeck, Mtgdecks, Mtggoldfish and others.

I doubt these sites are manually updated by people typing in lists 24/7. So, can you help me to understand how them works?:

Where do these sites “pull” their lists from? Is there a an API for tournament results (besides the official MTGO ones), or is it 100% web scraping?

Does a public archive/database of historical decklists (from years ago) exist, or is everyone just sitting on their own proprietary?

Is there a standard way/format to programmatically receive updated decklists from smaller organizers?

If anyone has experience with MTG data engineering or knows of any open-source scrapers/repos any help is really appreciated.

thank you guys

submitted by /u/Dariospinett
[link] [comments]

Alternatives To The UDC (Universal Decimal Classification) Knowledge Taxonomy

I’ve been looking for a general taxonomy with breadth and depth, somewhat similar to the Dewey-Decimal, or UDC taxonomies.

I can’t find an expression of the Dewey-Decimal (and tbh it’s probably fairly out of date now) and while the UDC offer a widely available 2,500-concept summary version, it doesn’t go down into enough detail for practical use. The master-reference file is ~70k in size, but costs >€350 a year to license.

Are there any openly available, broad and deep taxonomical datasets that I can easily download that are both reasonably well-accepted as standards, and which do a good job of defining a range of topics, themes or concepts I can use to help classify documents and other written resources.

One minute I might be looking at a document that provides technical specifications for a data-processing system, the next, a summary of some banking regulations around risk-management, or a write-up of the state of the art in AI technology. I’d like to be able to tag each of these different documents within a standard scheme of classifications.

submitted by /u/ResidentTicket1273
[link] [comments]

“Why Does Our Scraping Pipeline Break Every Two Weeks?”

Most enterprise teams consider only the costs of proxy APIs and cloud servers, overlooking the underlying issue.

Senior Data Engineers, who command salaries of $150,000 or more, spend up to 30% of their time addressing Cloudflare blocks and broken DOM selectors. From a capital allocation perspective, assigning top engineering talent to manage website layout changes is inefficient when web scraping is not your core product.

The solution is not to purchase better scraping tools, but to shift from building infrastructure to procuring outcomes.

Forward-thinking enterprises are adopting Fully Managed Data-as-a-Service. In practice, this approach offers the following benefits:

Engineers are no longer required to fix broken scripts. The managed partner employs autonomous AI agents to handle layout changes and anti-bot systems seamlessly.

Instead of purchasing code, you secure a contract. If a target site undergoes a complete redesign overnight, the partner’s AI adapts, ensuring your data is delivered on time.

Extraction costs are capped, allowing your engineering team to focus on developing features that drive revenue.

A more reliable data supply chain is needed, not just a better scraper.

Is your engineering team focused on building your core product, or are they managing broken pipelines?

Multiple solutions are available; choose the one that best fits your needs.

submitted by /u/3iraven22
[link] [comments]

Lowest Level Of Geospatial Demographic Dataset

Please where can I get block level demographic data that I can use a clip analysis tool to just clip the area I want without it suffering any “casualties “(adding the full data from a block group or zip code of adjoining bg just because a small part of the adjoining bg is part of my area of interest. )

Ps I’ve tried census bureau and nghis and they don’t give me anything that I like . Census bureau is near useless btw . I don’t mind paying from one of those brokers website that charge like $20 but which one is credible ? Please help

submitted by /u/owuraku_ababio
[link] [comments]

Trying To Work With NOAA Coastal Data. How Are People Navigating This?

I’ve been trying to get more familiar with NOAA coastal datasets for a research project, and honestly the hardest part hasn’t been modeling — it’s just figuring out what data exists and how to navigate it.

I was looking at stations near Long Beach because I wanted wave + wind data in the same area. That turned into a lot of bouncing between IOOS and NDBC pages, checking variable lists, figuring out which station measures what, etc. It felt surprisingly manual.

I eventually started exploring here:
https://aquaview.org/explore?c=IOOS_SENSORS%2CNDBC&lon=-118.2227&lat=33.7152&z=12.39

Seeing IOOS and NDBC stations together on a map made it much easier to understand what was available. Once I had the dataset IDs, I pulled the data programmatically through the STAC endpoint:
https://aquaview-sfeos-1025757962819.us-east1.run.app/api.html#/

From there I merged:

  • IOOS/CDIP wave data (significant wave height + periods)
  • Nearby NDBC wind observations

Resampled to hourly (2016–2025), added a couple lag features, and created a simple extreme-wave label (95th percentile threshold). The actual modeling was straightforward.

What I’m still trying to understand is: what’s the “normal” workflow people use for NOAA data? Are most people manually navigating portals? Are STAC-based approaches common outside satellite imagery?

Just trying to learn how others approach this. Would appreciate any insight.

submitted by /u/Signal_Sea9103
[link] [comments]

Epstein File Explorer Or How I Personally Released The Epstein Files

[OC] I built an automated pipeline to extract, visualize, and cross-reference 1 million+ pages from the Epstein document corpus

Over the past ~2 weeks I’ve been building an open-source tool to systematically analyze the Epstein Files — the massive trove of court documents, flight logs, emails, depositions, and financial records released across 12 volumes. The corpus contains 1,050,842 documents spanning 2.08 million pages.

Rather than manually reading through them, I built an 18-stage NLP/computer-vision pipeline that automatically:

Extracts and OCRs every PDF, detecting redacted regions on each page

Identifies 163,000+ named entities (people, organizations, places, dates, financial figures) totaling over 15 million mentions, then resolves aliases so “Jeffrey Epstein”, “JEFFREY EPSTEN”, and “Jeffrey Epstein*” all map to one canonical entry

Extracts events (meetings, travel, communications, financial transactions) with participants, dates, locations, and confidence scores

Detects 20,779 faces across document images and videos, clusters them into 8,559 identity groups, and matches 2,369 clusters against Wikipedia profile photos — automatically identifying Epstein, Maxwell, Prince Andrew, Clinton, and others

Finds redaction inconsistencies by comparing near-duplicate documents: out of 22 million near-duplicate pairs and 5.6 million redacted text snippets, it flagged 100 cases where text was redacted in one copy but left visible in another

Builds a searchable semantic index so you can search by meaning, not just keywords

The whole thing feeds into a web interface I built with Next.js. Here’s what each screenshot shows:

Documents — The main corpus browser. 1,050,842 documents searchable by Bates number and filterable by volume.

  1. Search Results — Full-text semantic search. Searching “Ghislaine Maxwell” returns 8,253 documents with highlighted matches and entity tags.

  2. Document Viewer — Integrated PDF viewer with toggleable redaction and entity overlays. This is a forwarded email about the Maxwell Reddit account (r/maxwellhill) that went silent after her arrest.

  3. Entities — 163,289 extracted entities ranked by mention frequency. Jeffrey Epstein tops the list with over 1 million mentions across 400K+ documents.

  4. Relationship Network — Force-directed graph of entity co-occurrence across documents, color-coded by type (people, organizations, places, dates, groups).

  5. Document Timeline — Every document plotted by date, color-coded by volume. You can clearly see document activity clustered in the early 2000s.

  6. Face Clusters — Automated face detection and Wikipedia matching. The system found 2,770 face instances of Epstein, 457 of Maxwell, 61 of Prince Andrew, and 59 of Clinton, all matched automatically from document images.

  7. Redaction Inconsistencies — The pipeline compared 22 million near-duplicate document pairs and found 100 cases where redacted text in one document was left visible in another. Each inconsistency shows the revealed text, the redacted source, and the unredacted source side by side.

Tools: Python (spaCy, InsightFace, PyMuPDF, sentence-transformers, OpenAI API), Next.js, TypeScript, Tailwind CSS, S3

Source: github.com/doInfinitely/epsteinalysis

Data source: Publicly released Epstein court documents (EFTA volumes 1-12)

submitted by /u/lymn
[link] [comments]

Where Are You Buying High-quality/unique Datasets For Model Training? (Tired Of DIY Scraping & AI Sludge)

Hey everyone, I’m currently looking for high-quality, unique datasets for some model training, and I’ve hit a bit of a wall. Off-the-shelf datasets on Kaggle or HuggingFace are great for getting started, but they are too saturated for what I’m trying to build.

Historically, my go-to has been building a scraper to pull the data myself. But honestly, the “DIY tax” is getting exhausting.

Here are the main issues I’m running into with scraping my own training data right now:

  • The “Splinternet” Defenses: The open web feels closed. It seems like every target site now has enterprise CDNs checking for TLS fingerprinting and behavioral biometrics. If my headless browser mouse moves too robotically, I get blocked.
  • Maintenance Nightmares: I spend more time patching my scripts than training my models.
  • The “Dead Internet” Sludge: This is the biggest risk for model training. So much of the web is now just AI-generated garbage. If I just blanket-scrape, I’m feeding my models hallucinations and bot-farm reviews.

I was recently reading an article about the shift from using web scraping tools (like Puppeteer or Scrapy) to using automated web scraping companies (like Forage AI), and it resonated with me.

These managed providers supposedly use self-healing AI agents that automatically adapt to layout changes, spoof fingerprints at an industrial scale, and even run “hallucination detection” to filter out AI sludge before it hits your database. Basically, you just ask for the data, and they hand you a clean schema-validated JSON file or a direct feed into BigQuery.

So, my question for the community is: Where do you draw the line between “Build” and “Buy” for your training data?

  1. Do you have specific vendors or marketplaces you trust for buying high-quality, ready-made datasets?
  2. Has anyone moved away from DIY scraping and switched to these fully managed, AI-driven data extraction companies? Does the “self-healing” and anti-bot magic actually hold up in production?

Would love to hear how you are all handling data sourcing right now!

submitted by /u/3iraven22
[link] [comments]

Newly Published Big Kink Dataset + Explorer

https://www.austinwallace.ca/survey

Explore connections between kinks, build and compare demographic profiles, and ask your AI agent about the data using our MCP:
I’ve built a fully interactive explorer on top of Aella’s newly released Big Kink Survey dataset: https://aella.substack.com/p/heres-my-big-kink-survey-dataset

All of the data is local on your browser using DuckDB-WASM: A ~15k representative sample of a ~1mil dataset.

No monetization at all, just think this is cool data and want to give people tools to be able to explore it themselves. I’ve even built an MCP server if you want to get your LLM to answer a specific question about the data!

I have taken a graduate class in information visualization, but that was over a decade ago, and I would love any ideas people have to improve my site! My color palette is fairly colorblind safe (black/red/beige), so I do clear the lowest of bars 🙂

https://github.com/austeane/aella-survey-site

submitted by /u/austeane
[link] [comments]

Thinking Of Open Sourcing A 250k Tables Dataset, Would This Be Valuable?

I’ve been working on a company for about 3 years with my co-founder. Our original goal was to build an intelligent document processing tool because we tried building a research co-pilot but found the document processing services available were bad. We got kind of carried away and built a data engine pipeline that reads in any latex, cleans it, and brings it to an intermediate representation where we can apply any augmentation (color, alignment, spacing). However, this has been a massive undertaking (~200k lines of python code), and to this point we have focused mostly on tables (the full document is written but it’s not refined or ready for production).

Due to our burnout and need to hit the real world, we decided to train an image -> Word, Excel, and latex converter using an architecture similar to nougat. It out-performed (except robustness) basically all table extraction models we’ve seen (and we’ve studied them all), but launching something that only extracts tables is not really a commercial product (it lacks focus). So hardly anyone used it.

We were looking into different use cases for the technology, but kept finding that it required the full document and meaningfully higher robustness to be commercially viable. Furthermore, we are good at focusing on one thing and doing it perfectly, and training a model + launching a website + marketing are a lot of things that split our focus. Not to mention that there is a lot of (well funded) competition in the space and we’re just a team of two.

Then we got to thinking: what if we sold our data. We have a pipeline that lets us create virtually any table (eventually document) with any kind of source data which can be augmented via an LLM. Then because we bring it into a form where we have control, we can apply programmatic augmentations to said tables of any kind and then go to any output ground truth format (Word, json, latex, html, …). That is to say, we have complete control and can generate any kind of data someone would need to improve their model.

So, we were thinking of dropping 250k tables + a benchmark based on our synthetic data (and real world validation) to demonstrate our capability and hopefully get companies that have custom requirements that can pay us to generate the data their model lacks. We can also probe the weaknesses of existing models similar to a security researcher and then offer our data as a solution.

What do you think? Is dropping 250k highly diverse and perfectly annotated tables (with multiple ground truth formats) a good idea? Would that be something that’s valuable to people and could gain traction?

We’re trying to be quick about it (next month or two) so publishing a paper or going to a conference probably isn’t the best move.

submitted by /u/Says_Watt
[link] [comments]

[self-promotion] Dataset Search For Kaggle & Huggingface

We made a tool for searching datasets and calculate their influence on capabilities. It uses second-order loss functions making the solution tractable across model architectures. It can be applied irrespective of domain and has already helped improve several models trained near convergence as well as more basic use cases.

The influence scores act as a prioritization in training. You are able to benchmark the search results in the app.
The research is based on peer-reviewed work.
We started with Huggingface and this weekend added Kaggle support.

Am looking for feedback and potential improvements.

https://durinn-concept-explorer.azurewebsites.net/

Currently supported models are casualLM but we have research demonstrating good results for multimodal support.

submitted by /u/New-Mathematician645
[link] [comments]

I Built An Open Hebrew Wikipedia Sentences Corpus: 11M Sentences From 366K Articles, Cleaned And Deduplicated

Hey all,

I just released a dataset I’ve been working on: a sentence-level corpus extracted from the entire Hebrew Wikipedia. It’s up on HuggingFace now:

https://huggingface.co/datasets/tomron87/hebrew-wikipedia-sentences-corpus

Why this exists: Hebrew is seriously underrepresented in open NLP resources. If you’ve ever tried to find a clean, large-scale Hebrew sentence corpus for downstream tasks, you know the options are… limited. I wanted something usable for language modeling, sentence similarity, NER, text classification, and benchmarking embedding models, so I built it.

What’s in it:

  • ~11 million sentences from ~366,000 Hebrew Wikipedia articles
  • Crawled via the MediaWiki API (full article text, not dumps)
  • Cleaned and deduplicated (exact + near-duplicate removal)
  • Licensed under CC BY-SA 3.0 (same as Wikipedia)

Pipeline overview: Articles were fetched through the MediaWiki API, then run through a rule-based sentence splitter that handles Hebrew-specific abbreviations and edge cases. Deduplication was done at both the exact level (SHA-256 hashing) and near-duplicate level (MinHash).

I think this could be useful for anyone working on Hebrew NLP or multilingual models where Hebrew is one of the target languages. It’s also a decent foundation for building evaluation benchmarks.

I’d love feedback. If you see issues with the data quality, have ideas for additional metadata (POS tags, named entities, topic labels), or think of other use cases, I’m all ears. This is v1 and I want to make it better.

submitted by /u/tomron87
[link] [comments]

Videos From DFDC Dataset Https://ai.meta.com/datasets/dfdc/

The official page has no s3 link anymore and it goes blank. The alternatives are already extracted images and not the videos. I want the videos for a recent competition. Any help is highly appreciated. I already tried
1. kaggle datasets download -d ashifurrahman34/dfdc-dataset(not videos)
2. kaggle datasets download -d fakecatcherai/dfdc-dataset(not videos)
3. kaggle competitions download -c deepfake-detection-challenge(throws 401 error as competition ended)
4. kaggle competitions download -c deepfake-detection-challenge -f dfdc_train_part_0.zip
5. aws s3 sync s3://dmdf-v2 . –request-payer –region=us-east-1

submitted by /u/Illustrious_Coast_68
[link] [comments]

Looking For Real Transport & Logistics Document Datasets To Validate My Platform

Hi everyone,

I’ve been building a platform focused on automated processing of transport and logistics documents, and I’m now at the stage where I need real-world data to properly test and validate it.

The system already handles structured and unstructured data for common logistics documents, including (but not limited to):

  • CMR (Consignment Note)
  • Commercial Invoices
  • Delivery Notes / POD
  • Bills of Lading
  • Air Waybills
  • Packing Lists
  • Customs documents
  • Certificates of Origin
  • Dangerous Goods Declarations
  • Freight Bills / Freight Invoices
  • And other related transport / logistics paperwork

Right now I’ve only used synthetic and manually designed doucments samples following publicly available templates, which isn’t representative of the complexity and messiness of real operations. I’m specifically looking for:

  • Anonymized / redacted real document sets, or
  • Companies, freight forwarders, carriers, 3PLs, etc. who are open to a collaboration where I can run their existing documents through the platform in exchange for insights, automation prototypes, or custom integrations.

I’m happy to sign NDAs, follow strict data handling rules, and either work with fully anonymized PDFs/images or set up a secure environment depending on what’s feasible.

  • Questions:
    • Do you know of any public datasets with realistic logistics documents (PDFs, scans, etc.)?
    • Are there any companies or projects that share sample packs for research or validation purposes?
    • Would anyone here be interested in collaborating or running a small pilot using their historical docs?

Any pointers, contacts, or links to datasets would be hugely appreciated.

Thanks in advance!

submitted by /u/AcanthisittaNo6887
[link] [comments]