Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Tested Some Proxy Providers For City-level Geotrgeting And Most Of Them Lied To Me

I’ve been testing a few proxy providers recently because I needed accurate city-level geotargeting for some campaigns and honestly, the results were pretty disappointing.

A lot of them claimed the IPs were from specific cities, but when I checked them through different tools and websites, the locations were either completely wrong or just showed nearby regions instead of the actual city.

Some even rotated to totally different locations after a few minutes.

Is this normal with city-level targeting or have others faced the same thing?

submitted by /u/swaryapatil14
[link] [comments]

UK GDPR Small Business Q&A — 5,000 Synthetic Pairs With Article-level Citations [Synthetic]

Dataset for fine-tuning compliance assistants. Each pair includes:
– A practical SME-facing question (“Can I use pre-ticked consent boxes?”)
– An answer with specific UK GDPR article references, ICO guidance by name, and actionable steps
– Source metadata: which GDPR concepts were used, which generation strategy, timestamp

Generation method: questions via local Qwen 14B from a curated term bank, answers via DeepSeek API for factual reliability. JSON + Parquet, MIT license for the 1K sample.

This is a niche dataset — it’s not a benchmark contender, it’s for people building privacy tools for UK businesses. If you’re doing legal NLP or compliance RAG, might be useful.

Free sample: https://huggingface.co/datasets/Draeg82/uk-gdpr-small-business-qa

submitted by /u/a_serial_hobbyist_
[link] [comments]

Good Places To Find Dataset Customers?

Hello, so for the past year or so i have accumulated data from a lot of different stores and a few marketplaces. I have over 4m products with stock and price history. My question is how legal is it to sell this data and where cand I do that? This could be huge for anyone trying to start a store (all data is based on European stores).

submitted by /u/Lanky_Grocery_511
[link] [comments]

So I Ran A Custom Pipeline On All 350k Fulton County Parcels. The “long-tenure” Math Is Actually Insane.

i’ve been messin around with some custom filter pipelines lately. basically i wanted to see where the real “exhaustion points” are in the fulton county residential universe. everyone keeps talking about a housing shortage but the data shows something else if you look at the “LTO” (long-tenure owner) signals.

i narrowed down the 350,000+ parcels to a working universe of about 72k investment properties. and yeah… the numbers are kinda weird.

The “Alpha” or whatever you want to call it:

  • The 20-Year Wall: I found 41,959 owners with an avg hold period of 19.7 years. That is basically an entire generation of equity just sitting there.
  • The Absentee Factor: 96.9% of these are absentee. about 6% are out-of-state. these people have literally zero emotional attachment to the dirt at this point. they probably haven’t even seen the houses since the pre-covid spike.
  • The “Gap”: there are about 7,567 properties where the appraisal is so far behind the market appreciation that the assets are just objectively under-managed.

the south fulton logistics cluster is up like 114% in 3 years. Meanwhile, the North Fulton corridor has the highest density of these “Tier 1” owners who have held for 20+ years and are probably tired of dealing with tenants.

anyway. i’m just a data guy. but it feels like the market is ignoring a massive “tired landlord” wave that is about to hit. or maybe i’m just overthinking the etl results.

Has anyone actually closed anything in South Fulton lately? the appreciation numbers look like a glitch but i’ve triple checked the math.

submitted by /u/Silver-Tune-2792
[link] [comments]

What Valuable Professional Data Is Completely Locked Away From AI Companies?

Hi all,

Apologies beforehand if this is the wrong subreddit, let me know if you think there are better subreddits for this post.

I’m working on a project around proprietary data licensing for AI training and trying to identify data types that are genuinely inaccessible to AI labs- not because it doesn’t exist, but because no one has figured out how to unlock it.

Specifically looking for data that is:

• Created by domain experts as part of their daily work • Never published or shared outside the organization • Rich in human reasoning, not just structured outputs 

Finance is my background so I’m especially curious about examples there, but all industries welcome.

What’s the most valuable “locked” professional data you’ve come across in your field – and who (if ya know) owns the rights to it?

submitted by /u/Manny_in_iceage
[link] [comments]

Needed Full Reddit Comment Trees For An NLP Dataset, Here’s What I Used

Was building a training corpus and kept hitting the official API’s 500 comment truncation limit. Found a gateway that recursively resolves full thread depth and has historical archive access which the official API just doesn’t have.

Endpoint I relied on most:

GET /submission/{id}/full

Returns the entire thread, no truncation. Only charges on 200 OK so failed requests don’t eat your credits. Sharing in case anyone else is doing similar dataset work — happy to share what I’m using if anyone’s interested.

submitted by /u/Ok-Direction-9618
[link] [comments]

Desperately Need Data For My Website Involving Human Detection Of LLMS (All Welcome)

The concept is simple, 4 Large Language Models, 1 prompt, you’re either matched with a human or an LLM. It’s a Turing Test and and I really need the data and have no way of getting it. I worked my ass off creating this website and I’d be forever grateful if you spent 5 minutes of your time to play a few rounds. Here’s the link: https://the-imitation-project.vercel.app/

submitted by /u/xxFEETLOVERxx
[link] [comments]

Metadata-only Index For AI Image Galleries, What Fields Would Make This Useful?

I am building a metadata-only index for AI image discovery packs and wanted feedback from people who actually use datasets.

Current shape:

  • one JSONL record per image
  • prompt fragments when available
  • source URL and creator/source attribution fields
  • safety labels
  • category/style tags
  • pack manifests for small curated image sets
  • no upstream image files included in the first pass

Example manifest and records are here: https://generatedgallery.com/index/manifest.json https://generatedgallery.com/index/generated-gallery.sample.json

Protocol notes: https://generatedgallery.com/protocol

The use case is prompt research, moodboards, model eval sets, and image discovery where provenance does not get stripped away.

What fields would make this more useful before I publish a larger metadata-only dataset repo?

submitted by /u/Plane-Marionberry380
[link] [comments]

I Can Scrape/aggregate Pretty Much Any Fragmented Public Data. What Datasets Are Missing

I built a large-scale scraping system that can extract data from thousands of sources simultaneously, bypass anti-bot protection, and convert unstructured formats (PDFs, scanned docs, complex HTML) into clean structured datasets.

What public datasets should exist but don’t because:

• Data is scattered across too many jurisdictions (every state/county has their own portal) • No one has aggregated it yet • It’s in PDFs or hard-to-parse formats • Sites actively block automated access 

Not looking to sell—genuinely trying to understand what public data would be valuable if someone aggregated it. If there’s demand, I might build and release it.

submitted by /u/Sufficient-War-4020
[link] [comments]

[Tool] Built An API To Instantly Extract Any Public HTML Table Or Wikipedia Page Into A Clean JSON Data Matrix

Hey r/datasets,

I got tired of manually copying data tables or dealing with messy HTML structures when trying to feed data into my personal scripts and models.

To solve this, I built and hosted a lightweight cloud API that automatically scrapes public web pages, isolates the tables/data grids, and packages everything into an organized, nested JSON matrix.

I wanted to share it here for anyone looking to automate their data gathering pipelines. I set up a free testing tier on RapidAPI that gives you 50 free requests a month to play around with it:

https://rapidapi.com/patcicci4/api/housing-and-wikipedia-data-scraper

Let me know if you test it out or have any feedback on extra features I should add to the parser!

submitted by /u/Cyclonefan444
[link] [comments]

ORKUT [text Only] Dataset, Created From Internet Archive Raw Data

So guys, Im still uploading, about 150GB, about 1.1 billion replies, most from Brazil users (pt-br)

Also give a look at https://github.com/rodrigosf672/orkut-pydataglobal2025 and https://snap.stanford.edu/data/com-Orkut.html

So this one is just raw data, for now, I will later do ML analysis on this, if anyone want to write a paper together about it DM me.

Anyway on HF SalatielJordao/orkut-communities

submitted by /u/Grand-Prize1371
[link] [comments]

130 US Profession Profiles + 25 Deductively-generated Pain Bundles – Structured JSON, MIT, Regenerable

Open-source dataset of US professions. Two levels:

130 profession profiles in data/professions/us/profiles/. Each is a JSON with 7 sections – daily routine, regulations, tools, jargon, career levels + fears, community channels, labor market. All sourced from .gov, law.cornell.edu, BLS, and professional associations with source URLs attached to every fact. Built by running 7 targeted WebSearch queries per profession.

25 of those profiles also have generated pain bundles in data/professions/us/pains/. 8-15 inferred recurring pains per profession, each paired with a typed spec for the AI tool that would solve it (calculator with inputs/outputs/formula, checklist with steps and statutory refs, document template with variables, reference lookup keys, LLM advisor decision criteria). Generated by feeding the profile to Opus with a deductive system prompt – no web search at the generation step.

Sample of what comes out, from data/professions/us/pains/us-lawyers.json:

  • Billable Hours & Fee Calculation (calculator)
  • Statute of Limitations Lookup (reference)
  • IOLTA Trust Account Reconciliation (calculator)
  • Engagement Letter Drafting (template)
  • Court Filing Deadline Calculator (calculator)
  • … 8 more

And from data/professions/us/pains/us-auto-detailers.json:

  • Cost-plus detail job pricing calculator (calculator, includes 2026 IRS mileage rate)
  • EPA stormwater compliance checklist (checklist, $64,618/day Clean Water Act exposure)
  • California Car Wash Act registration + surety bond (checklist, Labor Code §§ 2050-2067)
  • Vehicle intake / pre-inspection form generator (template)
  • Quarterly self-employment tax estimator (calculator, 15.3% SE tax)
  • … 8 more

Each pain entry has: title, problem (2-3 sentences), affected segment, frequency, time_waste_h, money_risk_usd, source SCOPE section, skill_type, and a typed skill_spec matching the type. Schema docs in data/professions/us/_FORMAT.md.

Backstory: extending an MIT pain-mining repo I’d been running (court records based, B2B angle). Court records don’t have profession-level pain because professionals don’t litigate their own workflow tedium. Switched to web search for regulatory facts + offline LLM deduction for what’s painful given those facts.

Honest positioning: discovery dataset, not validated pain register. Pains are inferred from regulation + daily routine, not from real users complaining. Plausible starting points for customer-development interviews, not conclusions.

Both pipeline stages are in prompts/profession-scan/ so the dataset is fully regenerable. Country-aware – works for any country with adequate online regulatory data.

Repo: https://github.com/AyanbekDos/unfairgaps-os Cleanest single file to open: https://github.com/AyanbekDos/unfairgaps-os/blob/main/data/professions/us/pains/us-auto-detailers.json

MIT. PRs welcome for the remaining 105 profiles or non-US countries.

submitted by /u/Ogretape
[link] [comments]

[self-promotion] Searchable Public Lead Service Line Inventory Records Across The US

I built a free searchable site for public lead service line inventory records:

https://leadserviceline.org/

It aggregates public records from state, city, utility, spreadsheet, and PDF sources into address, water system, city, and state lookup pages.

Caveat: the records are only as good as the public inventories they come from. They can be incomplete, outdated, or wrong, and this is not a water test or a replacement for checking with a local utility.

Right now it is a website. If there is demand, I would like to add an API or bulk data access so people can pull the data directly.

submitted by /u/maximooth
[link] [comments]