Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

ENRON Dataset Request Without Spam Message

Hi

I am meant to investigate the ENRON Dataset for a study but the large file and its messiness proves to be a challenge. I have found via Reddit, Kaggle and github ways that people have explored this dataset, mostly regarding fraudulent spam (I assume to delete these?) or created scripts that allow investigation of specific employees (e.g. CEOs that ended up in jail bc of the scandal).
For instance here: Enron Fraud Email Dataset
Now, my question is whether anyone has the Enron Dataset CLEAN version i.e free from spam OR has cleaned the Enron data set so that you can look at how some fraudulent requests were made/questionable favours were asked etc.

Any advice in this direction would be so helpful since I am not super fluent in Python and coding so this dataset is proving challenging to work with as a social science researcher.

Thank you so much

Talia

submitted by /u/Whynotjerrynben
[link] [comments]

Building A Multi-source Feminism Corpus (France–Québec) – Need Advice On APIs & Automation

Hi,

I’m prototyping a PhD project on feminist discourse in France & Québec. Goal: build a multi-source corpus (academic APIs, activist blogs, publishers, media feeds, Reddit testimonies).

Already tested:

  • Sources: OpenAlex, Crossref, HAL, OpenEdition, WordPress JSON, RSS feeds, GDELT, Reddit JSON, Gallica/BANQ.
  • Scripts: Google Apps Script + Python (Colab).

Main problems:

  1. APIs stop ~5 years back (need 10–20 yrs).
  2. Formats are all over (DOI, JSON, RSS, PDFs).
  3. Free automation without servers (Sheets + GitHub Actions?).

Looking for:

  • Examples of pipelines combining APIs/RSS/archives.
  • Tips on Pushshift/Wayback for historical Reddit/web.
  • Open-source workflows for deduplication + archiving.

Any input (scripts, repos, past experience) 🙏.

submitted by /u/Commercial-Soil5974
[link] [comments]

Looking For Narrative-style EDiscovery Dataset For Research

Hey folks – I’m working on a research project around eDiscovery workflows and ran into a gap with the datasets that are publicly available.

Most of the “open” collections (like the EDRM Micro Dataset) are useful for testing parsers because they include many file types – Word, PDF, Excel, emails, images, even forensic images – but they don’t reflect how discovery actually feels. They’re kinda just random files thrown together, without a coherent story or links across documents.

What I’m looking for is closer to a realistic “mock case” dataset:
• A set of documents (emails, contracts, memos, reports, exhibits) that tell a narrative when read together (even if hidden in a large volume of files)
• Something that could be used to test workflows like chronology building, fact-mapping, or privilege review
• Public, demo, or teaching datasets are fine (real or synthetic)

I’ve checked Enron, EDRM, and RECAP, but those either don’t have narrative structure or aren’t really raw discovery.

Does anyone know of (preferably free and public):
• Law school teaching sets for eDiscovery classes
• Vendor demo/training corpora (Relativity, Everlaw, Exterro, etc.)
• Any academic or professional groups sharing narrative-style discovery corpora

Thanks in advance!

submitted by /u/darkprime140
[link] [comments]

Why Are People Still Reconciling Data Manually?

In my last project the expectation was that we would manually reconcile all the CSV exports.
Some actually did it manually for real… I think people are crazy.

Anyway, apart from the automation, I put together a short presentation because it annoys me to see people losing so much time reconciling data.

In the slides I walk through the areas I think are important to fix, and how to catch discrepancies systematically, instead of relying on guesswork!

Not fancy, but it could save us hours if I send the right message.
Before I hand it over to the team, I thought I’d share it here, curious if anyone has suggestions or finds it useful too.

Ask for Link in comment

Let’s keep in touch

submitted by /u/Future-Gene-3181
[link] [comments]

I Built A Comprehensive SEC Financial Data Platform With 100M+ Datapoints + API Access – Feel Free To Try Out

Hi Fellows,

I’ve been working on Nomas Research – a platform that aggregates and processes SEC EDGAR data,

which can be accessed by UI(Data Visualization) or API (return JSON). Feel free to try out

Dataset Overview

Scale:

  • 15,000+ companies with complete fundamentals coverage

  • 100M+ fundamental datapoints from SEC XBRL filings

  • 9.7M+ insider trading records (non-derivative & derivative transactions)

  • 26.4M FTD entries (failure-to-deliver data)

  • 109.7M+ institutional holding records from Form 13F filings

Data Sources:

  • SEC EDGAR XBRL company facts (daily updates)

  • Form 3/4/5 insider trading filings

  • Form 13F institutional holdings

  • Failure-to-deliver (FTD) reports

  • Real-time SEC submission feeds

Not sure if I can post link here : https://nomas.fyi

submitted by /u/ccnomas
[link] [comments]

Built An IDE For Web Scraping In Javascript — Introducing Crawbots

We’ve been working on a desktop app called Crawbots — an all-in-one IDE for web data extraction. It’s designed to simplify the scraping process, especially for developers working with Puppeteer, Playwright, or Selenium.

We’re aiming to make Crawbots powerful yet beginner-friendly, so junior devs can jump in without fighting boilerplate or complex setups.

Would appreciate any thoughts, questions, or brutal feedback

submitted by /u/varvolta
[link] [comments]

I Have Access To A Large US Real Estate Dataset — Looking For Collaborations / Buyers Who Can Benefit

Hey everyone,

I’ve been working on compiling and structuring one of the largest real estate datasets covering the entire US. It includes 100M+ properties across all states — single-family, multi-family, land, and even mobile homes.

The data is structured like this:

Address | City | County | State | Zip

Listed Owner(s) | Mailing Address | Mailing City | Mailing State | Mailing County | Mailing Zip

Bedrooms | Bathrooms | Is Auction | Equity %

This dataset can be really valuable for:

Real estate investors looking for deal flow

Startups building proptech tools

Agencies doing market analysis

Data-driven lead generation in real estate

I’m exploring opportunities to collaborate or provide this data to those who could benefit most.

submitted by /u/GoldTea7698
[link] [comments]

I Started Learning Data Analysis Almost 60-70% Completed. I’m Confused

I’m 25 years old. Learning Data analysis and getting ready to job. I learned mySQL, advance Excel, power BI. Now learning python & also practice on real data. In next 2 months I’ll be job ready. But I’m worrying that Will I get job after all. I haven’t given any interview yet. I heard data analyst have very high competition.

I’m giving my 100% this time, I never been focused as I’m now I’m really confused…

submitted by /u/Old-Investment-6969
[link] [comments]

I’m Interested In Buying A House Within The Next 24 Months. Are There Any Data Sets Where I Could Find House Prices And/or Mortgage Rates In My Area To Narrow Down The Best Places To Buy? Id Be Interested In Splitting My City Up Into Sectors Or Neighborhoods To Help Narrow This Down

I’m interested in buying a house soon and would like to take a look at neighborhoods. My work is in the center of my city, so i could theoretically live anywhere in town and it would be conveniently located to work. Id like to see what datasets exist that I could consider for this little data project.

submitted by /u/EEJams
[link] [comments]

Was Using The IBM SPSS Software For Data Processing, This Thing Sucks

Most of my thesis work requires me to just collect data, then spend 20 mintues joining different tables and cleaning it, then manually uploading all the data in the spss software and then shifting through the menus until I hit the right multiple regression model or correlational analysis model depending on use case and then gather all the different graphs and plots to create a neat pdf, is it really supposed to be that hard, I am new to this how are yall doing it?

submitted by /u/Worried_Analyst_
[link] [comments]

I Need Help With Scraping Redfin URLS

Hi everyone! I’m new to posting on Reddit, and I have almost no coding experience so please bear with me haha. I’m currently trying to collect some data from for sale property listings on Redfin (I have about 90 right now but will need a few hundred more probably). Specifically I want to get the estimated monthly tax and homeowner insurance expense they have on their payment calculator. I already downloaded all of the data Redfin will give you and imported into Google sheets, but it doesn’t include this information. I then tried getting Chatgpt to write me a script for Google sheets that can scrape the urls I have in the spreadsheet for this but it didn’t work, it thinks it failed because the payment calculator portion is javascript rather than html that only shows after the url loads. I also tried to use ScrapeAPI which gave me a json file that I then imported into Google Drive and attempted to have chat write a script that could merge the urls to find the data and put it on my spreadsheet but to no avail. If anyone has any advice for me it’d be a huge help. Thanks in advance!

submitted by /u/Interesting_Rent6155
[link] [comments]

Best Datasets For US 10DLC Phone Number Lookups?

Trying to build a really good phone number lookup tool. Currently I have, NPA NXX Blocks with the block carrier, start date and line type. Same thing but with Zip Codes, Cities and Counties. Any other good ones I should include for local data? The more the merrier. Also willing to share the current datasets I have as they’re a pain in the ass to find online.

submitted by /u/MiloCOOH
[link] [comments]

Seeking NCAA Division II Baseball Data API For Personal Project

Hey folks,

I’m kicking off a personal project digging into NCAA Division II baseball, and I’m hitting a wall trying to find good data sources. Hoping someone here might have some pointers!

I’m ideally looking for something that can provide:

  • Real-time or frequently updated game stats (play-by-play, box scores)
  • Seasonal player numbers (like batting averages or ERA)
  • Team standings and schedules

I’ve already poked around at the usual suspects official NCAA stuff and big sports data sites but most seem to cover D1 or pro leagues much more heavily. I know scraping is always a fallback, but I wanted to see if anyone knows of a hidden-gem API or a solid dataset free or cheap that’s out there before I go that route.

submitted by /u/Sharp_Network7139
[link] [comments]

A Clean, Combined Dataset Of All Academy Award (Oscar) Winners From 1928-Present.

Hello r/datasets, I was working on a data visualization project and had to compile and clean a dataset of all Oscar winners from various sources. I thought it might be useful to others, so I’m sharing it here.

Link to the CSV file: https://www.kaggle.com/datasets/unanimad/the-oscar-award?resource=download&select=the_oscar_award.csv It includes columns for Year, Category, Nominee, and whether they won. It’s great for practicing data analysis and visualization. As an example of what you can do with it, I used a new AI tool I’m building (Datum Fuse) to quickly generate a visualization of the most awarded categories. You can see the chart here: https://www.reddit.com/r/dataisbeautiful/s/eEA6uNKWvi

Hope you find the dataset useful!

submitted by /u/Bootes-sphere
[link] [comments]

Need Massive Collections Of Schemas For AI Training – Any Bulk Sources?

looking for massive collections of schemas/datasets for AI training – mainly financial and ecommerce domains but really need vast quantities from all sectors. need structured data formats that I can use to train models on things like transaction patterns, product recommendations, market analysis etc. talking like thousands of different schema types here. anyone have good sources for bulk schema collections? even pointers to where people typically find this stuff at scale would be helpful

submitted by /u/Fragrant-Dog-3706
[link] [comments]

QUEENS: Python ETL + API For Making Energy Datasets Machine Readable

Hi all.

I’ve open-sourced QUEENS (QUEryable ENergy National Statistics), a Python toolchain for converting official statistics released as multi-sheet Excel files into a tidy, queryable dataset with a small REST API.

  • What it is: an ETL + API in one package. It ingests spreadsheets, normalizes headers/notes, reshapes to long format, writes to SQLite (RAW → PROD with versioning), and exposes a FastAPI for filtered queries. Exports to CSV/Parquet/XLSX are included.
  • Who it’s for: anyone who works with national/sectoral statistics that come as “human-first” Excel (multiple sheets, awkward headers, footnotes, year-on-columns, etc.).
  • Batteries included: it ships with an adapter for the UK’s DUKES (the official annual energy statistics compendium), but the design is collection-agnostic. You can point it at other national statistics by editing a few JSON configs and simple Excel “mapping templates” (no code changes required for many cases).

Key features

  • Robust Excel parsing (multi-sheet, inferred headers, optional transpose, note-tag removal).
  • Schema validation & type coercion; duplicate checks.
  • SQLite with versioning (RAW → staged PROD).
  • API: /data/{collection} and /metadata/{collection} with typed filters (eq, neq, lt, lte, gt, gte, like) and cursor pagination.
  • CLI & library: queens ingest, queens stage, queens export, or use import queens as q.

Install and CLI usage

pip install queens # ingest selected tables queens ingest dukes --table 1.1 --table 6.1 # ingest all tables in dukes queens ingest dukes # stage a snapshot of the data queens stage dukes --as-of-date 2025-08-24 # launch the API service on localhost queens serve 

Why this might help r/datasets

  • Many official stats are published as Excel meant for people, not machines. QUEENS gives you a repeatable path to clean, typed, long-format data and a tiny API you can point tools at.
  • The approach generalizes beyond UK energy: the parsing/mapping layer is configurable, so you can adapt it to other national statistics that share the “Excel + multi-sheet + odd headers” pattern.

Links

License: MIT
Happy to answer questions or help sketch an adapter for another dataset/collection.

submitted by /u/KaleidoscopeNo6551
[link] [comments]

Looking For A Dataset On Competitive Pokemon Battles(mostly VGC)

I’m looking for a data set of Pokemon games(mostly in VGC) containing the Pokemon brought to the game, their stats, their moves, and of course for data of the battle their moves, the secondary effects that occurred and all extra information that the game gives you. I’m researching a versatile algorithm to calculate advantage and I want to use Pokemon games test it.

Thank you.

submitted by /u/Malice15
[link] [comments]

How Are You Ingesting Data Into Your Database?

Here’s the general path that I take:

API > Parquet File(s) > Uploaded to S3 > Copy Into (From External Stage) > Raw Table

It’s all orchestrated by Dagster with asset checks along the way. Raw data is never transformed till after it’s in the db. I prefer using SQL instead of Python for cleaning data when possible.

submitted by /u/fruitstanddev
[link] [comments]