Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

How Do You Actually Manage Reference Data In Your Organization?

I’m curious how this is handled in real life, beyond diagrams and “best practices”.

In your organization, how do you manage reference data like:

  • country codes
  • currencies
  • time zones
  • phone formats
  • legal entity identifiers
  • industry classifications

Concretely:

  • Where does this data live? ERP, CRM, BI, data warehouse, spreadsheets?
  • Who owns it, IT, data team, business, no one?
  • How do updates happen, manually, scripts, vendors, never?
  • What usually breaks when it’s wrong or outdated?

I’m especially interested in:

  • what feels annoying but accepted
  • what creates hidden work or recurring friction
  • what you’ve tried that didn’t really work

Not looking for textbook answers, just how it actually works in your org.

If you’re willing to share, even roughly, it would help a lot.

submitted by /u/anasharn
[link] [comments]

Massive 360 Image Dataset Uses? | PhotoSphereStudio

I’m the creator of https://maps.moomoo.me which allows users to upload 360 photos to specific coordinates, which is no longer possible with official Google apps. I have recently started to backup the site images incase Google decides to sunset their streetview api, just like how they already removed their streetview app that prompted me to create this site.

I’ve also recently started scraping Google Maps in order to backup the older images that I never saved a copy for. Once I’m done I’ll have around 26,000 high quality 360 photos, and I’m wondering if this could be a valuable dataset?

submitted by /u/funny_b0t2
[link] [comments]

Tool For Generating LLM Datasets (just Launched)

hey yall

We’ve been doing a lot of fine-tuning and agentic stuff lately, and the part that kept slowing us down wasn’t the models but the dataset grind. Most of our time was spent just hacking datasets together instead of actually training anything.

So we built a tool to generate the training data for us, and just launched it. you describe the kind of dataset you want, optionally upload your sources, and it spits out examples in whatever schema you need. Free tier if you wanna mess with it, no card. curious how others here are handling dataset creation, always interested in seeing other workflows.

link: https://datasetlabs.ai

fyi we just launched so expect some bugs.

submitted by /u/Express_Seesaw_8418
[link] [comments]

Looking For Historical NIFTY 50 Constituent Weights (monthly) – Public Data Sources?

Hey folks,
I’m trying to track down historical NIFTY 50 constituent weights (ideally monthly, or even quarterly) going back as far as possible, preferably around 2000 onward.

I’m not looking for today’s weights or a current snapshot. I specifically need historical weights by constituent, preferably float-adjusted, in a machine-readable format (CSV / Excel / API).

If anyone knows:

  • a public dataset
  • an NSE data archive
  • an academic source
  • or even a paid source (that at least confirms the data exists)

please point me to it.

Even a clear answer like “this data isn’t publicly available and is only licensed via NSE/Bloomberg/etc.” would be helpful.

Thanks in advance

submitted by /u/Frosty-Article-9635
[link] [comments]

CCTV Weapon Detection: Rifles Vs Umbrellas (Synthetic)

Hi,

After finding this article a while ago: ”Umbrella mistaken for assault rifle” it seemed clear we need more good data for training our detection models.

https://www.livenowfox.com/news/see-it-umbrella-mistaken-assault-rifle-sparks-mall-lockdown.amp

Its now possible to generate this type of data synthetically and thats what I did, a fully synthetic but (hopefully) realistic CCTV Dataset for Rifles and Umbrellas.

The dataset consisting of balanced, synthetic images of Rifles vs. Umbrellas from overhead CCTV angles.

I have tried to make it high-quality, not meaning high-resolution perfect images, but actually realistic usable CCTV footage images of people holding weapons and umbrellas.

I would be happy for all feedback on the data:

– Is the images too ”easy” for a well-trained object detection model?

– Good diversity?

– If anyone fine-tune a model on the data, I would be happy to know the results!

And you find the dataset here:

https://www.kaggle.com/datasets/simuletic/cctv-weapon-detection-rifles-vs-umbrellas

submitted by /u/MiserableDonkey1974
[link] [comments]

Vibe Scraping At Scale With AI Web Agents, Just Prompt => Get Data

Most of us have a list of URLs we need data from (government listings, local business info, pdf directories). Usually, that means hiring a freelancer or paying for an expensive, rigid SaaS.

I built rtrvr.ai to make “Vibe Scraping” a thing.

How it works:

  1. Upload a Google Sheet with your URLs.
  2. Type: “Find the email, phone number, and their top 3 services.”
  3. Watch the AI agents open 50+ browsers at once and fill your sheet in real-time.

It’s powered by a multi-agent system that can handle logins and even solve CAPTCHAs.

Cost: We engineered the cost down to $10/mo but you can bring your own Gemini key and proxies to use for nearly FREE. Compare that to the $200+/mo some lead gen tools charge.

Use the free browser extension for walled sites like LinkedIn or the cloud platform for scale.

Curious to hear if this would make your dataset generation easier or is it missing the mark?

submitted by /u/BodybuilderLost328
[link] [comments]

Dataset Request – US Domestic Flights And Domestic Water Usage

I am working on a project where I am relating US Domestic tourism and domestic water usage/infrastructure strain. My plan to analyze domestic travel rates was through total daily arrivals in airports to see areas of heightened activity and then to focus on 2-3 high traffic areas, 2-3 low traffic regions, and 2-3 mid traffic regions and their associated domestic water demand to correlate the magnitude of infrastructure strain to tourism. Please let me know if you have any suggestions, or can provide any assistance. I am a student in high school working on a personal project and this is my first data analysis related project so any help would be appreciated.

Thank you!

submitted by /u/GySgt_Gibbs
[link] [comments]

HELP: API-Football: Player ID Not Reliable Without Team/season Context — Is This Expected?

Hi all,

I’m currently using API-Football and I’m running into a fundamental issue with how player IDs and stats work, and I’m trying to understand if this is just how the API is designed or if I’m missing something.

The core problem is that a player ID is not sufficient on its own to reliably fetch stats.

In practice, player stats only resolve correctly when combined with team + competition + season, but the API treats player_id as if it’s globally usable. This leads to several issues:

  • Querying stats by player_id alone often returns empty or incomplete results
  • Historical seasons return nothing unless league and season are explicitly known up front
  • When a player transfers (especially mid-season), stats are split across teams and are easy to miss
  • The same player can appear under multiple IDs depending on search context

Because of this, you can’t safely persist just a player_id and query it later. You effectively need a compound key like (player_id, team_id, season, competition), which makes generic or long-term player tracking very brittle — especially if you don’t already know where the player was playing in a given season.

On top of that, stats tend to default to the “latest” season, competition filtering isn’t always clean, and aggressive caching feels mandatory due to rate limits.

My question is:

  • Is this an expected limitation of API-Football?
  • Has anyone found a clean modeling strategy around this?
  • Or are there alternative APIs where player IDs are truly stable across seasons and clubs?

Any insights from people who’ve dealt with this would be hugely appreciated.

submitted by /u/Thin_Road_88
[link] [comments]

Looking For Data Set Of Medical Professionals Names And Education (a Bit More Info In The Post)

Hello,
I am looking for a dataset that will include some sort of medical professionals info and titles

For example,

1 Medical Conference registration of sort – interested in how those people wrote their title and such during registration. (I do not care about email address or any contact info)

OR
2) linkedin profile in which I can see how they wrote their profile with our without their professional title, e.g., John Doe M.D. or Dr. John Doe , or just John Doe, but with an option to cross reference against their education (if public on the profile) to see if they are actually medical professionals

Bonus: if there is gender information as well, but not required

I do not want or need any personal information that is related to contact, just trying to see how those people refer to themselves with or without their professional title

submitted by /u/psychic_shadow_lugia
[link] [comments]

Open-source CSV Analysis Helper For Exploring Datasets Quickly

Hi everyone, I’ve been working with a lot of awful CSV files lately. So, I put together a small open-source utility.

It’s < 200 lines but can scan a CSV and summarize patterns. Show monotonicity / trend shifts. It can count inflection points, compute simple outlier signals, and provide tiny visualizations when/if needed.

It isn’t a replacement for pandas (or anything big), it’s just a lightweight helper for exploring datasets.

Repo:
https://github.com/rjsabouhi/pattern-scope.

PyPI:
https://pypi.org/project/pattern-scope/

pip install pattern-scope

Hopefully it’s helpful.

submitted by /u/RJSabouhi
[link] [comments]

Looking For Public Datasets (Text + Images + Voice + Heart Rate) For IT Professional Stress Detection Dataset For My University Research Project

Hey everyone, I’m a Computer Science major working on a healthcare-related machine learning project focused on training models (not LLMs) using multimodal medical data.

I’m looking for public/open-source datasets that include one or more of the following modalities:

  • Text: Email and jira comments when the employees are stress
  • Images: Labled data of the employees
  • Voice: audio recordings of stressed employees
  • Physiological signals: Heart rate, ECG, PPG, EDA, or other wearable sensor data (preferably with stress/health labels)

If you know of datasets, repositories, or papers that release such data, I’d really appreciate links or pointers. Academic-access datasets are fine too.

Thanks in advance!

submitted by /u/ByteNinja2001
[link] [comments]

Looking For Anonymized Blood Test Reports

Hey, so I am a computer science major and currently working on a healthcare related LLM-based system which can interpret medical reports.

As the title says, I am looking for datasets that contains blood test reports (CBC, lipid profile, LPD, etc.). It would be really great if anyone can provide a link to some public datasets or guidance on any open-source datasets that I might have missed.

submitted by /u/ayuzzzi
[link] [comments]

For Sale: 2.5M Android App Store Assets (Icons, Screenshots, Structured-Metadata) [paid]

I’m looking for potential buyers interested in a large-scale Android App Store dataset.

What’s included

  • ~2.5 million Android apps
  • High-quality app icons
  • App screenshots
  • Structured metadata (app titles, descriptions, categories, etc.)
  • Clean, well-organized format suitable for direct use in analytics, ML pipelines, or content systems
  • Covers a wide range of app categories

Possible use cases

  • App intelligence and market research
  • AI / ML training (computer vision, NLP, recommendation systems)
  • App discovery, comparison, or ranking platforms
  • UI / design trend analysis
  • Academic or commercial research

Why this may be useful

  • Large and scalable dataset
  • Consistent structure across assets
  • Saves significant time and cost compared to collecting and maintaining this data independently
  • Suitable for both enterprise and research use.

Commercial terms

  • Available as a one-time full or partial purchase.
  • Sample subset available for serious inquiries

If you’re working on a related product, research, or platform and this sounds relevant, feel free to comment or DM to discuss access, pricing, and technical details.

submitted by /u/ErikaUreka
[link] [comments]

Looking For Resources To Build A Good Game Theory Corpus.

Hey folks!
I’m trying to build a solid Game Theory dataset for learning and experimentation, and I’m looking for suggestions on where to source good material.

Anything works — books, blogs, lecture notes, papers, simulations, GitHub repos, etc.
If you’ve learned game theory from a resource you loved, I’d really appreciate the recommendation.

Thanks a lot! 🙂

submitted by /u/src2004__
[link] [comments]

[PAID] A Dataset Of Geopolitical Events And Cyberattacks

Hi everyone,

I’ve been working on a side project to create a dataset of geopolitical events and cyberattacks. I made two similar posts in other communities to get people’s feedback and I wanted to share the results with folks here!

Initially, the goal was to create datasets that would allow me to make geopolitical “predictions” (it is a very hard problem obviously, so I’ve been trying to find trends and patterns mostly). To that end, I’ve created a dataset that contains 5 types of events:

  • Cyberattacks
  • Military Offensives
  • Sanction announcements
  • Military aid announcements
  • International summits

The dataset spans events since 2015 and contains more than 390K press articles that correspond to more than 120K unique events.

The goal is to help individual developers/small teams in their projects at a very low cost. There are some costs on my end so I have to charge for larger downloads but I’m trying to keep the costs as minimal as possible.

Check it out and let me know your thoughts: https://rapidapi.com/user/nmk3

Thanks, looking forward to people’s feedback!

submitted by /u/Dizzy_Garden7295
[link] [comments]

Looking For Specific Type Of Dataset

Hi. I am working on an independent project where i require south asian face and age dataset (possibly gender as well , that is not the primary concern however). I would like this to be concentrated around Indian, Pakistani, Bangladeshi origin people. I don’t want age groups (like baby, young , and old) Rather I want actual numerical ages. Can anyone point me to a large dataset of this type ? I have been unable to find anything so far.

submitted by /u/GasFearless1463
[link] [comments]

Wikidata Converted And Saved As Parquet Files

I don’t really know SPARQL, but I wanted to query wikidata, that why I converted the wikidata-truthy dataset to paquet and uploaded it to huggingface. Maybe it can also be useful for others here.

submitted by /u/piebroo
[link] [comments]

Annotators/RLHF Folks: What’s The One Skill Signal Clients Actually Trust?

I’ve noticed two people can do similar annotation/RLHF/eval work, but one gets steady access to better projects and the other keeps hitting droughts.

I’m trying to map real signals that predict consistency and higher-quality projects (and not things that are “resume fluff”).

For people doing data labeling / RLHF / evaluation / safety reviews:

  • What are the top 3 signals that get you more work (speed, accuracy, domain expertise, writing quality, math, tool fluency, reliability, etc.)?
  • What do you wish you could prove about your work, but can’t easily? (quality, throughput, disagreement rate, escalation judgment, edge-case handling…)
  • If you’ve leveled up, what changed—skills, portfolio, workflow, specialization, networking, something else?

submitted by /u/bibbletrash
[link] [comments]

Built Something For Turning Websites Into Datasets With AI

I made a tool to turn websites into structured datasets using AI, mainly for cases where data only exists on web pages and not as APIs or downloads. The idea is to make it easier to repeatedly extract the same fields and build datasets over time without hand-maintaining scrapers.

I’m curious what kinds of datasets people here wish existed but are hard to create today, and whether an approach like this feels useful or too fragile for serious dataset work.

Disclaimer: I built this tool and am sharing it for feedback, not selling datasets.
Can be found by searching Lection on chrome webstore

submitted by /u/MarketingJaded6157
[link] [comments]

Anyone Struggling To Find High-quality Non-English Training Data?

Working on a few local AI use cases and hitting the same wall: lack of clean, high-quality non-English data.

English datasets are everywhere, but once you go into local languages/dialects, quality drops fast—noisy labels, inconsistent formats, cultural gaps. Fine-tuning models for real-world local use becomes painful.

Curious from others building outside the US/EU bubble:

  • Where do you usually source non-English data?
  • What’s the biggest issue: quantity, quality, or context?
  • Have you paid for custom datasets before?

Feels like models are getting better faster than the data feeding them.

submitted by /u/Kind_Buyer8931
[link] [comments]