Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Need Advice: How To Collect 2k Company Contacts (specific Roles) Without Doing Everything Manually?

Hi everyone, I’m facing a problem and could really use some advice from people who’ve done this before or been in similar situation.

I need to collect contact details for around 2,000 companies, but the tricky part is that I don’t need generic inboxes like info@ or support@. I specifically need contacts of responsible people (for example: Head of HR, HR Manager, CEO, Founder, or similar decision-makers). Doing this manually company by company feels almost impossible at this scale. I’m facing this challange for the first time and don’t know where to start.

I’m open to: paid tools APIs semi-automated workflows services you’ve personally used or even outsourcing ideas (if that’s realistic).

My main questions: Is this realistically automatable? Are there tools/services that actually work for role-based contacts? What should I absolutely avoid (wasting money, getting banned, bad data, etc.)? I’d really appreciate any real-world experience, tool recommendations, or warnings. Thanks in advance 🙏

submitted by /u/grafieldas
[link] [comments]

Any (free) Api Out There To Classify Domain Names?

Basically I am looking for api (free if possible) to classify if a given domain name is listed for sale or developed. Google doesn’t return anything. Did come across whoixml apis but they only offer history api (which is pretty expensive) which I tried but pretty seemed outdated. Need to process at least 1M domains monthly (happy to pay per request). Would appreciate some directions.

submitted by /u/ghad0265
[link] [comments]

I Need Free Audio Datasets For My Project

Need free, high-quality audio datasets for tasks like speech recognition, sound classification, or environmental noise analysis—ideally with labels, metadata, and permissive licenses (CC0 or similar). Does anyone have recommendations for sources beyond Hugging Face (Common Voice, AudioSet) or Kaggle? Bonus if they’re preprocessed or good for big data tools like Spark/Hadoop. Links, sizes, and usage tips appreciated.

submitted by /u/yobigp
[link] [comments]

Creating Datasets For Physical Activities, What Sensors?

Those of you collecting data for sports, hobbies, workouts, physical activities what sensors are you using?

I’m currently using the witmotion WT901 sensor, but I’d love to know what others are using?

Extra information: I’m finishing out an iOS app for collecting phone data specifically for ai data training with support for time syncing with external sensors. I’ll need this data for my own personal project. I’m trying to figure out if I’m better off using a different sensor? The only concern is that some sensors have so little information on them that connecting to them through the app and reading the data and syncing it with my phone data is an absolute pain. Witmotion sensor took me forever to get working with the phones sensor data.

submitted by /u/programmerguineapigs
[link] [comments]

I’m Looking For A Very Large Spatial Dataset

I thought this would be easy to find, but it’s been difficult so far. All I’m looking for is:

  • At least 10,000 observations
  • Open-source (or at least free to access)
  • Each observation has two spatial coordinates (x and y or longitude/latitude)
  • Each observation has at least two numeric variables (one that can be used as an explanatory variable, and one as a response variable.
  • NOT temporal/time-based

Anyone know where else I can look? I haven’t been able to find anything on the UCI ML repository. I’m sifting through Kaggle now but there are so many options.

submitted by /u/Cold-Priority-2729
[link] [comments]

Looking For Blood Test Dataset Of Multiple Diseases

I’m new and testing things on llm training . Should I look for individual diseases or is there a way to find this particular dataset . Someone mentioned using synthetic dataset but I’m not sure about it.

Will the llm learn properly if for example one dataset has cholesterol values and one dataset has liver based values or something

submitted by /u/SAY_GEX_895
[link] [comments]

2 Million Messy → Clean Addresses. What Would You Build With This?

Hello fellow developers,

I have a dataset containing 2 million complete Brazilian addresses, manually typed by real users. These addresses include abbreviations, typos, inconsistent formatting, and other common real-world issues.

For each raw address, I also have its fully corrected, standardized, and structured version.

Does anyone have ideas on what kind of solutions or products could be built with this data to solve real-world problems?

Thanks in advance for any insights!

submitted by /u/Hour-Dirt-8505
[link] [comments]

Datasets Where The Schema Actually Breaks Over Time?

I’m trying to get better at handling real-world data drift, not just loading clean CSVs once.

Are there public datasets where:

  • Fields get added/removed over time
  • Data types quietly change
  • Nulls suddenly spike for no obvious reason

Basically datasets that force you to add validation and monitoring instead of assuming everything stays the same.

I’m less interested in size and more in realism.
APIs, government feeds, or long-running open datasets all welcome.

Would love examples + what broke for you when you used them.

submitted by /u/crowpng
[link] [comments]

6500 Hours Of Multi-person Action Video. Rights Cleared, 1080 30fps

Dataset Overview

∙ Size: 6,500 hours / average clip length 25 minutes/ 13 TB

∙ Resolution: 1080p

∙ Frame rate: 30fps

∙ Format: MP4 (H.264)

I have a dataset I’ve gathered at my rage room business. We have 4 rooms with consistent camera and lighting. Camera angle is from the top corner of the room, standard cctv angle. Groups of 1-6 people. Full PPE for all subjects, mostly anonymous, some subjects will take off the helmet at the end. All subjects have signed talent release.

Activities: Physical actions including destruction, tool use, object interaction, coordination tasks

Objects: Various materials (glass, electronics, tools)

Scenarios: Both coordinated and chaotic multi-person behavior

Samples available

Looking to license

Open to feedback, currently collecting more video everyday and willing to create custom datasets.

submitted by /u/DrHARDCOREy
[link] [comments]

Dataset For School Incident Classification

Hi everyone! I’m currently working on a school-related machine learning project where I’m trying to classify short incident reports written in free text. The goal is to help guidance counselors sort through reports more easily by grouping them based on the type of incident and how serious it might be.

I’m using a pretty simple approach (Naive Bayes) and focusing on things like bullying, harassment, misconduct, vandalism, and facility concerns, with labels like minor or major. The model is just meant to assist with organization and prioritization (all final decisions are still made by people).

Right now, I’m looking for a public, anonymized, or synthetic dataset with short complaint- or incident-style text that I can train the model on. It doesn’t have to be school-specific; anything similar (complaints, reports, misconduct descriptions, etc.) would be super helpful as long as it’s ethical to use.

Since this is an academic project, I can’t use real or identifiable student data, and everything will only be used for research.

If you know of any datasets, past projects, or even tools for generating realistic synthetic text, I’d really appreciate the help. Thanks in advance!

submitted by /u/Soggy_Macaron_5276
[link] [comments]

I Made A Free Tool To Extract Tables From Any Webpage (Wikipedia, Gov Sites, Etc.)

Made a quick tool and thought some might find it useful!

🔗 lection.app/tools/table-extractor

It does one thing: paste a URL, it finds all HTML tables on the page, and you can download them as CSV or JSON. No signup, no API key, just works.

Works great for:

Wikipedia data tables

Government/public data portals

Sports stats sites

Any page with HTML tables

Limitations: Won’t work on JavaScript-rendered tables (like React dashboards) since it fetches raw HTML. But for most static pages it works pretty well.

Let me know if you run into any issues or have suggestions!

submitted by /u/Unmoovable
[link] [comments]

I’m Looking For Help Creating A Dataset

Hi everyone! I would like to start a new research project and I would appreciate a lot if anyone wants to join! The project consists in taking high quality scans of leaves. I know it sounds basic but it can have a great impact in the field of natural sciences. It is very hard to find high quality pictures of leaves online. Taking high quality scans can undercover the vein structure clearly, opening a whole set of possibilities in research. If anyone is interested in collaborating, you can send me a DM 🙂

submitted by /u/MammothComposer7176
[link] [comments]

Open Dataset: 3,023 Enterprise AI Implementations With Analysis

I analyzed 3,023 enterprise AI use cases to understand what’s actually being deployed vs. vendor claims.

Key findings:

Technology maturity:

  • Copilots: 352 cases (production-ready)
  • Multimodal: 288 cases (vision + voice + text)
  • Reasoning models (e.g. o1/o3): 26 cases
  • Agentic AI: 224 cases (growing)

Vendor landscape:

Google published 996 cases (33% of dataset), Microsoft 755 (25%). These reflect marketing budgets, not market share.

OpenAI published only 151 cases but appears in 500 implementations (3.3x multiplier through Azure).

Breakthrough applications:

  • 4-hour bacterial diagnosis vs 5 days (Biofy)
  • 60x faster code review (cubic)
  • 200K gig workers filed taxes (ClearTax)

Limitations:

This shows what vendors publish, not:

  • Success rates (failures aren’t documented)
  • Total cost of ownership
  • Pilot vs production ratios

My take: Reasoning models show capability breakthroughs but minimal adoption. Multimodal is becoming table stakes. Stop chasing hype, look for measurable production deployments.

Full analysis on Substack.
Dataset (open source) on GitHub.

submitted by /u/abbas_ai
[link] [comments]

Seeing The Same File-level Data Issues Again And Again, Why Are These Still So Hard To Catch?

Over the last few weeks, I’ve seen multiple discussions and anecdotes around file-level data problems that pass basic validation but still cause downstream pain.

Things like:

  • placeholder values that silently propagate
  • zero-width or invisible characters
  • encoding or locale-specific quirks
  • delimiter and quoting inconsistencies
  • numeric values flipping to scientific notation
  • dates and timezones behaving “correctly” but wrong in context

What’s interesting is that many of these aren’t schema violations and don’t fail parsing. The file looks fine, loads fine, and only causes issues much later.

A common pattern seems to be:

  • data comes from external teams or manual exports
  • files change subtly over time validation focuses on structure, not behavior

Is this problem is worth to be solved, because I was constantly trying to resolve this issue to some extent.

One approach I’ve seen discussed is tackling these issues incrementally, case by case, rather than trying to “validate everything” upfront, but adoption itself seems hard, especially when data privacy and workflow friction are concerns.

For people working in data engineering or analytics:

Which file-level issues have caused the most real-world pain for you, despite the files being technically valid?

Curious what patterns others have noticed. And is this a real issue for everyone out there.

submitted by /u/PriorNervous1031
[link] [comments]

Is There A Flights API With Deep Links For Booking?

So over the last few weeks I was playing around with Duffel API and Amadeus for flight booking. This is just for a random idea that I thought of, and while they work fine, in order to actually build this random idea I had, I would need to build the entire flow for booking, fetching, managing, checking in, payment, support, etc… Basically it’s several months worth of work for something that might not even work at all…

So I came across this expedia documentation which lets you build a link for searching flights, and then you get redirected to their website for booking and whatnot. I would love to have something like this, but in API format, because this only works if you actually open the website and browse the flights manually. Is there any such API?

submitted by /u/randomseller
[link] [comments]

Here’s A Dataset Of The Ratings Of All 7,072 Movies On IMDb With Over 25,000 Votes

Date of data: 12 January, 2026

Data: All 7,072 movies with over 25,000 votes (that’s the current vote threshold for the IMDb Top 250.)

Instructions: Download the .txt file, rename it to a .csv file, and you can open it in a spreadsheet program and play around with the figures.

Dropbox link.

(Note: you don’t need to sign in to Dropbox to download it. There’s a bypass button at the bottom of the screen.)

A list of the tab-separated columns:

  • Title

  • IMDb code

  • Year

  • 1 ratings

  • 2 ratings

  • 3 ratings

  • 4 ratings

  • 5 ratings

  • 6 ratings

  • 7 ratings

  • 8 ratings

  • 9 ratings

  • 10 ratings

  • Total number of ratings

  • Weighted Mean [the IMDb rating that is published on the website]

  • Arithmetic Mean [the unweighted IMDb rating calculated from the raw totals]

  • Difference of Means [the difference between the previous two columns]

  • Standard Deviation

submitted by /u/RunDNA
[link] [comments]