Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Real Free Heavily Moderated Salary Data Not Locked Behind Paywalls And Accounts

What do they make is entirely privacy first, heavily moderated against publicly accessible data. There are no accounts, no login, and no paywall. Zero logs, no IP tracking, or anything identifiable.

Give as much or as little information as you wish, or doom scroll through the feed of others who have posted. Every submitter is issued a random code that they can use to modify or delete their submission at any time.

submitted by /u/whatdotheymake
[link] [comments]

Hello, Is Anyone Able To Help Me Access The EU RASFF Notifications Pre 2021 Spreadsheet

It should be publicly available but every time I click download on the URL / spreadsheet it just refreshes the page instead. I feel like I’ve tried everything and asking here is a last resort, I need this information to help me with a paper I want to work on.

I believe it is the Excel sheet hinted at on this URL https://data.europa.eu/data/datasets/restored_rasff?locale=en

This would be a monumental help to me if anyone can help me download the Excel sheet as I am seriously struggling and this would massively benefit me.

Thank you In advance.

submitted by /u/afjecj
[link] [comments]

Are People Really Divided Into Groups Of “cat People” And “dog People” Or Are We Seeing More Of A Mixture Of Dogs And Cats Together? I Want To Test That Theory!

I am studying to find out if people mostly have dogs or cat. I am wonder how true is the “cat person” and “dog person” phenomenon. I need 50 data entries of individuals and how many dogs and/or cats they have! Please comment below if you want to be a part of my study and give me numbers of cats and/or dogs that you own! Thank you! This is anonymous and you will not have to give any personal information.

submitted by /u/nikiab94
[link] [comments]

20M+ Indian Court Cases – Structured Metadata, Citation Graphs, Vector Embeddings (API + Bulk Export)

I spent 6 years indexing Indian court cases from the Supreme Court, all 25 High Courts, and 14 Tribunals. Sharing because I haven’t seen a structured Indian legal dataset at this scale anywhere.

What’s in it:

– 20M+ cases with pdf, structured metadata (court, bench, date, parties, sections cited, acts referenced, case type, headnotes)

– Citation graph across the full corpus (which case cites, follows, distinguishes, or overrules which)

– 23,122 Indian Acts and Statutes (Central, State, Regulatory) with full text and amendment tracking

– Vector embeddings (Voyage AI, 1024d) for every case

– Bilingual legal translation pairs across 11 Indian languages (Hindi, Tamil, Telugu, Bangla, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Odia, Urdu) paired with English

For context: India has the world’s largest common law system.

40M+ pending cases. Court judgments are public domain under Indian law (no copyright on judicial decisions). But the raw data is scattered across 25+ different court websites, each with different formats, and many orders are scanned image PDFs with no searchable text.

Available as:

– REST API (sub-500ms hybrid semantic + keyword search)

– Bulk export (JSON / Parquet)

– Vector search via Qdrant

The bilingual legal translation pairs might be interesting for NLP

researchers working on low-resource Indian languages. Legal text is formal register with precise terminology, which is hard to find in most Indian language corpora.

Details: vaquill ai

Happy to answer questions about the data collection process, schema, or coverage gaps.

submitted by /u/zriyansh
[link] [comments]

Looking For Datasets Of Handwritten Medical Prescriptions (doctor Handwriting → Text)

Hello,

I’m working on a machine learning project focused on handwriting recognition, specifically targeting handwritten medical prescriptions and converting them into readable English text.

I’ve already searched through Kaggle and other sources, but most datasets either don’t focus on prescriptions or don’t have a large enough dataset of handwritten text.

I’m looking for:

  • Datasets containing handwritten doctor prescriptions
  • Ideally but not necessarily w/ ground truth transcriptions (handwritten → typed text)
  • English-language data only
  • Properly anonymized / compliant with privacy standards (no PII)

If anyone knows of publicly available datasets or repositories (academic, government, or open-source), I’d really appreciate the help. Even partial datasets or related resources (e.g., general medical handwriting) would be useful.

Sorry for the trouble and thanks in advance!

submitted by /u/Carode143
[link] [comments]

Looking For A 10+ Year News Archive For Academic NLP/ML Research (Low Budget)

I’m looking for an archive covering roughly 10 years of news publications, ideally from reputable media outlets (or a widely used news website).

I plan to use the data for academic research, specifically for text analysis / machine learning. As a student, I have a limited budget and cannot afford expensive commercial databases (I can spend up to around $400).

Does anyone have experience with similar datasets or can recommend a suitable source?

submitted by /u/TemporaryNo5605
[link] [comments]

Looking For A Dataset For Clustering And PCA Project

Hi guys, I’m new in this data science world. I’m looking for a real-world dataset for a data science portfolio project focused on clustering and PCA (no classification labels required)

  • At least 4–10 numerical features
  • Preferably 500+ rows
  • Suitable for customer/user segmentation or behavioral clustering
  • Clean or moderately clean data
  • Must be publicly available

The goal is to apply dimensionality reduction (PCA) and clustering algorithms and interpret meaningful segments.

Any suggestions for datasets that fit this use case would be highly appreciated

-> Any suggestions regarding suitable datasets for this use case would be also very helpful. Instead of direct dataset recommendations, I would be very grateful if you could give me some ideas on where I can look.

submitted by /u/persephone_y
[link] [comments]

How Do You Handle Semantic Differences When Integrating Data Across Organizations?

I’m working on a data integration problem in the railway/infrastructure domain and would really appreciate some input from people with experience in data engineering or system design.

We are integrating data from multiple railway companies. The challenge is that they often describe the same physical asset differently.

Both refer to essentially the same real-world object (track), but:

– naming differs

– structure and attributes may differ

– IDs are not shared across systems

What we want to achieve:

– Automatically detect that these refer to the same type of object

– Map them to a unified model (something like an ontology layer)

– Ideally also match actual instances across systems (entity resolution)

What is the best-practice architecture for this kind of problem?

How much can realistically be automated vs. manually mapped?

Thanks a lot!

submitted by /u/theophil93
[link] [comments]

Need Dataset For Trekking Data (Indian Treks)

I’m working on a personal project where I need structured data for Indian treks, specifically fields like:

  • trek name
  • location
  • difficulty
  • duration
  • highest altitude

So I wanted to ask:

  1. Does anyone know of a good dataset for Indian treks with these fields?
  2. Any tips for scraping sites more effectively?
  3. Is there a better data source or API I might be missing?

Appreciate any help

submitted by /u/Unable_Contest_4003
[link] [comments]

I Have Access To 500K Real US Whatsapp Numbers — Is There Any Legal Way To Monetize This?

I have access to a large dataset of around 500,000 active whatsapp phone numbers belonging to people based in New York.

These are real, valid contacts, but there is no prior relationship or opt-in from their side.

I’m trying to figure out what are the legal, ethical, and practical ways to turn something like this into a business or income stream.

Is there any legitimate way to monetize such a dataset? What industries or models could make use of this kind of data? How do companies usually convert raw contact data into revenue? What are the risks I should be aware of?

Looking for honest advice from people who understand data, marketing, or business.

What would you do in this situation?

submitted by /u/PsychologicalCat937
[link] [comments]

Almost JSON” Is One Of The Most Annoying Model Failure Modes

Been thinking about this a lot lately.

A model can look great on extraction at first, then the second you try plugging it into a real pipeline, it starts doing all the little annoying things:
missing keys, drifting field names, guessing on bad input, or slipping back into prose.

That’s why I’ve been more interested in training fixed-key behavior and clean validation instead of just prompting harder for JSON.

Feels like “almost structured” output is basically useless once a parser is involved.

Curious what breaks first for people here:
missing fields, key drift, bad validation, or prose creeping back in?

submitted by /u/JayPatel24_
[link] [comments]

Back Again With Another Training Problem I Keep Running Into While Building Dataset Slices For Smaller LLMs

Hey, I’m back with another one from the pile of model behaviors I’ve been trying to isolate and turn into trainable dataset slices.

This time the problem is reliable JSON extraction from financial-style documents.

I keep seeing the same pattern:

You can prompt a smaller/open model hard enough that it looks good in a demo.
It gives you JSON.
It extracts the right fields.
You think you’re close.

That’s the part that keeps making me think this is not just a prompt problem.

It feels more like a training problem.

A lot of what I’m building right now is around this idea that model quality should be broken into very narrow behaviors and trained directly, instead of hoping a big prompt can hold everything together.

For this one, the behavior is basically:

Can the model stay schema-first, even when the input gets messy?

Not just:
“can it produce JSON once?”

But:

  • can it keep the same structure every time
  • can it make success and failure outputs equally predictable

One of the row patterns I’ve been looking at has this kind of training signal built into it:

{ "sample_id": "lane_16_code_json_spec_mode_en_00000001", "assistant_response": "Design notes: - Storage: a local JSON file with explicit load and save steps. - Bad: vague return values. Good: consistent shapes for success and failure." } 

What I like about this kind of row is that it does not just show the model a format.

It teaches the rule:

  • vague output is bad
  • stable structured output is good

That feels especially relevant for stuff like:

  • financial statement extraction
  • invoice parsing

So this is one of the slices I’m working on right now while building out behavior-specific training data.

Curious how other people here think about this.

submitted by /u/JayPatel24_
[link] [comments]

Persistent Temporal Knowledge Graph Datasets

I’m working on a temporal knowledge graph (TKG) model for link prediction and graph generation. Basically, I have snapshots of a persistent knowledge graph over time (subject, relation, object) triplets, and I want to train the model to autoregressively predict the next graphs over a sequence of timesteps. For training, it takes in a graph at timestep t and predicts the graph at timestep t+1.

Unfortunately, I’m running into a pretty severe issue: the model overfits almost immediately, and Hits@K stays basically random.

Current dataset:

I’m currently using wikidata12k, which is a pretty small dataset, which I think may be causing some of the issues. It gives me about 200 knowledge graphs, one for each year from 1800 to 2020, each about 500 nodes.

I would actually love a bigger dataset, but it has to be in a persistent knowledge graph format, which means the graph changes slowly over time, and the graph at timestep t is similar to the graph at timestep t+1. This unfortunately rules out a lot of popular TKG datasets like ICEWS.

I’ve also looked at YAGO11k, but it suffers from the same lack of scale as wikidata12k.

I’ve made another post in r/learnmachinelearning with details about the architecture and other issues I’m facing, which you can check out if you want more details.

https://www.reddit.com/r/learnmachinelearning/comments/1sjl7ck/temporal_gnn_gat_pernode_lstm_overfitting/

Thank you so much for the help, and I’m happy to answer any additional questions

submitted by /u/Divine_Invictus
[link] [comments]

[self-promotion] Made A Website To Visualize Statsbomb Open Data – Feedback Highly Appreciated

https://chrischu-yc.github.io/sports-analytics/statsbomb_opendata_visualize/

Hi guys! I’m new to sports analytics and this is the first project that I’ve done. I’m still a university student and would be very interested to do something sports analytics related in the future. I’m a huge football (soccer), baseball and F1 fan.

Here I basically just took the free Statsbomb open data and built a website that shows all their matches, with tools like passing maps, team passing networks and xG plots available for all matches in the database. I think someone probably has done this before and tbh this might not be the most useful thing but still it’s a cool way to dive into old matches and explore probably the best free api you can get in football today.

The most unique thing I made is a performance card for each player in every match, as I don’t think I’ve seen something similar online for football (Please correct me if I’m wrong). They’re downloadable and give a quick summarize of a player’s performance in that game, with a match rating which I made a scheme for myself. Sort of like a report card for players after the match.

Would love feedback from anyone and idea on how to expand the website. Here’s the link again: https://chrischu-yc.github.io/sports-analytics/statsbomb_opendata_visualize/. Also if you want to check out my GitHub repository it’s here.

submitted by /u/ChrisC_13
[link] [comments]

AWS Web Hosting Costs Vs (?) A Proxy For Software Market Size

I want to visualize the relationship between declining web hosting costs and growth in the software market. Specifically, I’m looking for a metric that reasonably captures software market size or prosperity, potentially reflecting the impact of more accessible hosting.

I’ve already found historical data on AWS pricing, but I haven’t been able to locate consistent, long-term data on software market size going back to roughly 2006–2008. If anyone can point me to good datasets – or suggest a solid proxy – I’d appreciate it. Thank you all!

submitted by /u/Fun_Pen8596
[link] [comments]

Junior Data Scientist Looking For Real-world Datasets To Work On (free)

Hey guys,

I’m a junior Data Scientist and I’m trying to get more real experience working with actual datasets.

If you have any data you want to explore or just don’t know what to do with it (business data, school project, personal spreadsheet, anything really), I’d be happy to help out for free.

Even small or random projects are totally fine.

If you think I could help you or someone you know, just message me 👍

submitted by /u/Alternative_Air3221
[link] [comments]

[Self-promotion] A Daily LLM-powered Scraper That Structures E-commerce Promos Into Clean CSV/JSON/Parquet – Free On Kaggle

Hello, everyone, we repurposed data from an old project into a Kaggle dataset⬇️ Happy to hear your thoughts and feedback

What this is about:
Major US retailers run hundreds of promotions daily – but there’s no clean, structured source to track them over time. I built a pipeline that scrapes 5 major e-commerce sites daily and extracts every promo, coupon code, and deal into a structured format using GPT-4o-mini and Llama.

Covers Office Depot, Ulta, Home Depot, 1800Flowers, and Shutterfly (for now) – with discount type, value, expiration date, and source URL for every record.

A few things the data shows right now:

  • Office Depot dominates volume: 73 promos today vs 10 for Home Depot
  • Ulta and 1800Flowers both hit 50% as their max discount: beauty and flowers are aggressive
  • Only 4% of promos have coupon codes: most discounts are applied automatically at checkout
  • Home Depot ran 228 promos on April 8th: likely a flash sale event worth investigating

You can find it here: https://www.kaggle.com/datasets/indext-data-lab-ai/promos-dataset

4,955+ records collected over 37 days and counting. Next update tomorrow morning

submitted by /u/KaiseyTayl
[link] [comments]

Data Set Preview – Cyber Security – RAG – Feedback Wanted Please – [Synthetic] (i Think)

here is the preview https://huggingface.co/datasets/Lucasautomatekc/Cybersecurity_RAG_Knowledge_Graph-25-Topics-75-Articles-200-Chunks

I am trying to see if this is something people actually want – I had an idea that some how lead to me looking into selling data sets – total beginner so I’m seeing if there is a certain structure or format folks prefer… I have the data through my web pages, its all clean and enterprise ready for LLM or whatever people need it for…

Honestly – I have no clue what I’m doing, so feedback would be appreciated to even see if I’m going down the right path… yes this is a preview, I have the full set for sale but again have no idea what I’m dong LOL.

Some how AI lead me here, depending on if this the content is actually sellable, I may never follow robots blindly again, or… I will make it my life mission to praise the bots!

Thanks all!

submitted by /u/Bitter_Produce_8153
[link] [comments]

[OC] 21 Years Of EU Fuel Prices Cleaned Into One Dataset — 106k Rows, All 27 Countries, Weekly Since 2005

I built and maintain this dataset — linking to my own Kaggle.

The European Commission publishes weekly pump prices for all 27 EU member states going back to January 2005 — but only as a messy 200-column Excel file with multiple header rows and prices per 1000 litres. I cleaned it into one flat CSV, one row per country per week per fuel type.

Covers petrol 95, diesel, heating oil, fuel oil and LPG across all 27 EU member states + UK.

A few things the data shows right now:

  • Irish diesel is rising 5.7x faster during the Iran war than it did during the Ukraine war
  • Netherlands and Denmark are the most expensive countries for diesel at €2.58 and €2.56/litre
  • Malta is by far the cheapest at €1.21/litre — government price controls
  • The 2022 Ukraine war spike is visible across every country in the heatmap simultaneously

Free dataset on Kaggle | Analysis notebook | Hugging Face mirror

Updated every Wednesday when the EC releases new data. Next update April 15th.

Happy to pull specific country numbers if anyone wants them.

submitted by /u/Cool_Law_8915
[link] [comments]

Dataset Idea For Training Retrieval Judgment Instead Of Just Retrieval Itself

Been thinking about a failure mode that feels more like a dataset problem than a tooling problem:

the retrieval stack is available
the tool is wired
the docs are there

but the model still answers from memory on requests that clearly depend on current information.

So the issue is not always “bad search.”
A lot of the time it is the trigger decision:
when should the model actually check, and when should it not?

I’ve been looking at a Lane 07 style setup for this where the supervision signal is explicit:

  • needs_search: true when freshness matters
  • needs_search: false when model knowledge is enough

Example row:

{ "sample_id": "lane_07_search_triggering_en_00000008", "needs_search": true, "assistant_response": "This is best answered with a quick lookup for current data. If you want me to verify it, I can." } 

What I like about this framing is that it does not just teach “retrieve more.”
It teaches both sides of the boundary:

  • when to trigger
  • when to hold back

That seems useful because bad gating hurts in both directions:

  • over-triggering adds latency and cost
  • under-triggering gives stale but confident answers

I’m experimenting with dataset structures for this kind of retrieval judgment and I think it is an underrated training target compared with just improving retrieval quality itself.

Curious how others here would structure it:

  • binary needs_search
  • richer labels
  • classifier-style trigger data
  • conversational SFT rows
  • hybrid setup

Would love to hear if anyone else is working on datasets for this boundary.

submitted by /u/JayPatel24_
[link] [comments]

BankFocus Orbis Data Question For Studies

Hi,

Does anyone know how to access to BankFocus Orbis Data? I want to use banks level data from different countries for my studies but unfortunately my university doesn’t give access to it. Is there another way to do it?

And another question is if I access to it, is data ordered in a nice way like IMF or World Bank.

Thanks,

submitted by /u/Elegant610
[link] [comments]

Dataset Of Geospatial Data For Active Industrial Sites In Russia?

Working on a GIS project mapping industrial facilities in Russia.

Looking for datasets covering active, large-scale industrial sites (operating, not abandoned), such as:

– manufacturing plants / заводы

– mining sites (open-pit карьеры, underground шахты)

– металлургия (steel mills, smelters)

– oil & gas infrastructure (refineries, GPPs, terminals)

– chemical / fertilizer plants

– power generation (ТЭЦ, ГЭС, АЭС)

– major industrial zones / hubs

Need:

– points (addresses or lat/long) OR

– polygons / site boundaries

Bonus:

Regional data, industry classification, operator/company

Formats:

CSV, GeoJSON, SHP, etc.

Interested in publicly available datasets like:

– OSM data

– corporate reports

– aggregated/open GIS datasets

Prefer datasets with regional breakdown (especially Ural region), but country-wide data is also useful.

submitted by /u/Active-Hornet-9241
[link] [comments]

Looking For Feedback On A Conversational Speech Dataset (multilingual, Real Interactions)

We’ve been working on conversational speech datasets recently and wanted to share a sample to get feedback from this community.

This is focused on real conversational behaviour rather than clean, scripted dialogue.

What it includes:

  • multi-speaker conversations
  • natural interruptions and overlapping speech
  • code-switching (Hindi + English, Hinglish)
  • context-driven interactions (not isolated utterances)
  • speaker variability (accent, pace, fluency)

Languages covered in the sample:

  • Indian English
  • Hindi
  • Hinglish
  • Punjabi
  • Marwadi

We’ve tried to keep the structure usable for training and evaluation, with metadata around speakers, turns, and context.

Still early, and would genuinely appreciate feedback on:

  • dataset structure
  • missing edge cases
  • what would make this more useful in real pipelines

Happy to share access if anyone wants to take a closer look.

submitted by /u/Cautious-Today1710
[link] [comments]