Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Looking For Feedback On A Conversational Speech Dataset (multilingual, Real Interactions)

We’ve been working on conversational speech datasets recently and wanted to share a sample to get feedback from this community.

This is focused on real conversational behaviour rather than clean, scripted dialogue.

What it includes:

  • multi-speaker conversations
  • natural interruptions and overlapping speech
  • code-switching (Hindi + English, Hinglish)
  • context-driven interactions (not isolated utterances)
  • speaker variability (accent, pace, fluency)

Languages covered in the sample:

  • Indian English
  • Hindi
  • Hinglish
  • Punjabi
  • Marwadi

We’ve tried to keep the structure usable for training and evaluation, with metadata around speakers, turns, and context.

Still early, and would genuinely appreciate feedback on:

  • dataset structure
  • missing edge cases
  • what would make this more useful in real pipelines

Happy to share access if anyone wants to take a closer look.

submitted by /u/Cautious-Today1710
[link] [comments]

Posting A Small Artifact From The Control Plane Built On Top Of Earlier ArXiv/patent Stage-1 Dataset Work.

WIP My attempt at pulling useful information out of a large dataset.

High-level loop:

artifact -> frontier -> bounded claim -> probe family -> evidence bundle -> route -> reasoner -> confidence update -> next-evidence request -> validation -> frontier advance

What it is trying to do is take stage-1 corpus outputs and turn them into a controlled evidence loop: pick one bounded question, gather support or contradiction around it, decide what evidence to ask for next, and only move forward when the result holds up.

This HF link is one visible slice of the larger system, not the whole thing. Here it shows repeated bridge and contrast structure plus one stable rare finding across follow-up validation.

partial artifact, more on the link below

“cluster_ids”: [ 406 ], “member_count”: 1, “observation_count”: 3, “source_bundle_ids”: [ “tier1_bundle_mixed_region_0410c17c2515bf49” ], “source_result_ids”: [ “finding:406” ], “primary_terms”: [ “anisotropy”, “coherent”, “high-confidence”, “rare”, “centered”, “gmims-related”, “text”, “magnetic”, “medium”, “film” ], “supporting_results”: [ { “result_type”: “finding”, “cluster_ids”: [ 406 ], “bundle_id”: “tier1_bundle_mixed_region_0410c17c2515bf49”, “summary”: “Cluster 406 is a coherent, high-confidence rare cluster centered on GMIMS-related text about anisotropy, magnetic medium, and film/samples, with 22 rows and a strong mean probability of 0.814.”, “confidence”: 0.76608, “support_cycles”: 3, “next_probe_name”: “neighbor_cluster_comparison”, “next_probe_cluster_ids”: [ 406 ], “next_probe_reason”: “The card lists centroid neighbors 285 and 282 with very high cosine similarity, making them the most direct follow-up for checking whether the GMIMS/anisotropy pattern extends to adjacent clusters.” } ],

link to hf with artifacts : https://huggingface.co/datasets/cjc0013/reasoningovercorpusartifacts/tree/main

link to previous dataset post : https://old.reddit.com/r/datasets/comments/1sej8ro/fused_patent_arxiv_clustering_dataset_9m_raw_388m/

submitted by /u/Either_Pound1986
[link] [comments]

Dataset For Training When An LLM Should Retrieve Vs When It Should Answer From Memory

One failure mode I keep seeing in assistants with retrieval is this:

the search path exists
the tool is available
the orchestration is wired

but the model still answers from memory on requests that clearly depend on current information.

So the failure is not always retrieval quality itself.
A lot of the time it is the trigger decision.

That got me interested in treating this as a dataset problem rather than only a prompting or orchestration problem.

We’ve been working on a Lane 07 style dataset focused on search triggering, where the supervision target is the boundary between:

  • requests that should trigger retrieval
  • requests that should stay on general knowledge

Each row is built to teach that judgment explicitly.

Example row:

{ "sample_id": "lane_07_search_triggering_en_00000008", "needs_search": true, "assistant_response": "This is best answered with a quick lookup for current data. If you want me to verify it, I can." } 

What I find important here is that the dataset is not just teaching “search more.”

It teaches both sides:

  • when retrieval is actually required
  • when retrieval is unnecessary and just adds latency / cost

That matters because bad gating hurts in both directions:

  • over-triggering makes the system slower and more expensive
  • under-triggering gives you stale but confident answers

For me, the interesting dataset question is:
how do you represent retrieval judgment as a trainable supervision signal instead of leaving it to prompt heuristics?

A few things I’m curious about from others working on datasets or fine-tuning:

  • Would you model this as binary needs_search, or something richer?
  • How much do you rely on explicit freshness words like “latest” vs implicit freshness cases like booking, availability, status, schedules?
  • Have you seen better results from classifier-style data, SFT conversational rows, or hybrid setups?

Would love to hear how others are structuring retrieval-trigger data, if you’re building similar datasets.

submitted by /u/JayPatel24_
[link] [comments]

[Slef-promotion][Synthetic] I Built A 100K-row Sleep Health Dataset From Scratch – It Just Earned A Kaggle Silver Medal (7,800 Views, 1,700+ Downloads In 2 Weeks)

A few weeks ago I released a synthetic sleep health dataset on Kaggle and it took off faster than I expected. Sharing it here in case anyone finds it useful.

What’s in it:

– 100,000 records, 32 features, 3 prediction targets

– Sleep architecture: REM %, deep sleep %, latency, wake episodes

– Lifestyle: caffeine, alcohol, screen time, exercise, steps

– Psychological: stress score, chronotype, mental health condition

– Demographics: 12 occupations, 15 countries, ages 18-69

Three ML targets:

– cognitive_performance_score- regression (0–100)

– sleep_disorder_risk – multiclass (Healthy / Mild / Moderate / Severe)

– felt_rested – binary classification

One finding that surprised people:

Lawyers average 5.74 hrs of sleep and 7.3/10 stress. Retired individuals average 8.03 hrs and 2.6/10 stress. That 2.13-hour gap shows up clearly in every model – occupation is the strongest predictor of sleep health in the entire dataset.

All distributions are calibrated against CDC, Sleep Foundation, and Frontiers in Sleep research. Correlations match peer-reviewed values (e.g. stress vs quality r=-0.64).

Link in profile if you want to check it out. Happy to answer questions about how it was built.

submitted by /u/Mohan137
[link] [comments]

How Would I Go About Using The MultiAIGCD Dataset?

Hello all,

I’m sure that this is a noob question, but how would I go about finding this dataset so that I can use it? I’ve tried my usual googling around, but can’t seem to find a way to access the dataset itself, other than for a few python questions labeled as “TeX Source” in the top right-hand side of the webpage provided.

Alternatively, is there another dataset that anyone knows about that has heaps of Java source code written by AI?

Thanks!

submitted by /u/Deidreia
[link] [comments]

Irish Property Price Register 2010–2026 — 778k Residential Sales Cleaned Into One CSV [OC]

The Irish Property Price Register is public data but only accessible

through a slow paginated search with no bulk download. I wrote a Python

script to pull the entire register into one flat CSV.

778,508 rows covering every recorded residential sale in Ireland since 2010.

Columns: date_of_sale, address, county, eircode, price_eur,

not_full_market_price, vat_exclusive, description, property_size

Some findings from the data:

– National median went from €205k (2010) to €360k (2026)

– Laois prices rose 126% from 2010–2012 avg to 2020–2022 avg

– Dublin’s premium over rest of Ireland narrowed from 117% to 47%

– New builds went from 25% of market in 2010 to 24% in 2026,

but now cost €45k more than second-hand on average

– COVID barely dented prices — volumes collapsed but median held

[Dataset](https://www.kaggle.com/datasets/fionnhughes/property-price-register)

[Analysis notebook](https://www.kaggle.com/code/fionnhughes/property-price-analysis)

submitted by /u/Cool_Law_8915
[link] [comments]

Cleaned Indian Liver Patient Dataset (ML Ready)

🔥 Cleaned Indian Liver Patient Dataset (ML Ready)

https://www.kaggle.com/datasets/shauryasrivastava01/liver-patient-dataset

• 583 patient records with real clinical biomarkers

• Binary classification (Liver Disease vs Healthy)

• Fully cleaned + preprocessed (no messy columns)

• Includes enzymes, bilirubin, proteins & demographic data

• Perfect for ML projects, EDA, and healthcare modeling

💡 Great for:

– Beginners learning classification

– Feature importance & SHAP analysis

– Bias & fairness studies in healthcare

🚀 Ready to plug into your ML pipeline!

submitted by /u/Direct-Jicama-4051
[link] [comments]

Global Trash And Debris (geo-tagged, Real-world Imagery)

Sharing an open dataset of real-world trash and debris with geo-tagged imagery across different environments.

Useful for:

  • Waste / debris detection models
  • Environmental monitoring
  • Urban cleanliness analysis
  • Smart city / cleanup planning

Dataset: https://huggingface.co/datasets/Outerview/global-trash-and-debris-index

Most existing waste datasets are small or staged — this is focused on real-world, in-the-wild data, which is still relatively limited in computer vision.

Would love feedback or ideas on how people would use this.

submitted by /u/Realistic-Ad-6157
[link] [comments]

14K+ Global Potholes And Fire Hydrants (Geotagged Imagery)

Sharing two open geotagged image datasets:

Each dataset includes ground-level imagery with location metadata (latitude/longitude), along with additional attributes depending on the source.

Data is compiled from a mix of our own collection efforts and open mapping datasets, with a focus on real-world, observable infrastructure.

Potential use cases:

  • computer vision training (object detection / classification)
  • infrastructure analysis
  • urban planning / maintenance modeling
  • geospatial ML

Happy to answer questions or expand coverage if useful.

submitted by /u/Realistic-Ad-6157
[link] [comments]

Speech AI Works In Demos… So Why Does It Break In Real Life?

Been looking closely at speech datasets lately, and something feels off.

Most of what’s used to train models is way too clean.

No interruptions.
No overlap.
Hardly any code-switching.

But that’s not how people actually speak, especially in India.

Real conversations are messy. People switch between Hindi and English mid-sentence, talk over each other, drop context, pick it back up.

Feels like models aren’t failing because of architecture, but because the data doesn’t reflect reality.

Curious how others here are dealing with this.
Are you seeing the same gap in real-world performance?

submitted by /u/Cautious-Today1710
[link] [comments]

Open-source Cannabis Price Index — Methodology, SQL, And Sample Data

We’ve been running a weekly price index for the U.S. online cannabis market since December 2025. Today we’re open-sourcing the methodology, the SQL used to compute the index, and a sample dataset.

The index tracks average effective prices, discount rates, and discount depth across subcategories (Pre-Rolls, Cartridges, Gummies, etc.) relative to a fixed baseline week. It’s a straightforward avg-price-over-baseline calculation at the (category, subcategory) grain.

Repo: https://github.com/TheoV823/cannabis-price-index

Live index with full data: https://cannabisdealsus.com/cannabis-price-index/

Happy to answer questions about the approach or limitations.

submitted by /u/theov666
[link] [comments]

Need To Tag ~ 30k Vendors As IT Vs Non-IT

Hi everyone,

I have a large xlsx vendor master list (~30k vendors).

Goal:

Add ONE column: “IT_Relevant” with values Yes / No.

Definition:

Yes = vendor provides software, hardware, IT services, consulting, cloud, infrastructure, etc.

No = clearly non‑IT (energy, hotel, law firm, logistics, etc.).

Accuracy does NOT need to be perfect – this is a first‑pass filter for sourcing analysis.

Question:

What is a practical way to do this at scale?

Can it be done easily? Basically, the companies should be researched (web) to decide if it is IT relevant or not. ChatGPT cannot handle that much data.

Thank you for your help.

submitted by /u/Grindelwaldt
[link] [comments]

Fused Patent + ArXiv Clustering Dataset (9M Raw → 3.88M Release, BGE-large, Deterministic Quality Gating)

Dataset link: https://huggingface.co/datasets/cjc0013/ArvixFusedWithPatents <———— still uploading keep that in mind please

9,063,272 raw rows → 3,881,329 release rows (~20+ GB zipped)

I built a zero-touch technical clustering pipeline over a fused patent + arXiv corpus. The full run was deterministic end-to-end, with Postgres used as the control plane rather than notebook state.

This was not just “embed some text and cluster it.”

The pipeline handled shard-level ingest/normalization, chunk embeddings with BAAI/bge-large-en-v1.5 (1024-dim), clustering, reducer-tree merge, global assignment, BM25 artifact generation, and then a deterministic inspection/gating pass to decide what was actually release-worthy.

Full raw run output:

  • 91 label shards
  • 91 embedding shards
  • 91 chunk shards
  • 422 final clusters
  • 9,063,272 labeled rows

I did not treat the raw output as valid by default.

I ran deterministic inspection across all 422 clusters and split them into:

  • 147 coherent
  • 107 mixed
  • 168 metadata-heavy

For the release dataset, I kept only the coherent clusters and dropped the mixed + metadata-heavy ones entirely.

Final release subset:

  • 147 clusters
  • 3,881,329 rows
  • 42.82% retention from the raw run
  • ~20+ GB zipped

I also generated deterministic cluster names from top terms as a lightweight inspection layer. Example release clusters looked like:

  • wireless communications / device
  • substrate / semiconductor / layer
  • chemistry / formula / alkyl
  • neural / data / network
  • vehicle / system / control
  • signal / data / circuit

A big reason for the drop was metadata leakage. Some clusters were being driven by ingestion/wrapper fields rather than actual technical content, so keeping everything would have made the dataset look cleaner than it really was.

The system was also built to survive long, failure-prone runs instead of assuming ideal conditions. It uses Postgres-backed task leasing, heartbeats, and stage state; resumable progress; reducer-tree staged unblocking; explicit timeout handling; and a descending batch ladder so memory failures downshift deterministically instead of killing the run outright.

I did not re-embed the corpus, hand-label clusters, manually patch results, or overwrite the original run. The release set is derived strictly from deterministic keep/drop logic after full pipeline completion.

The 147-cluster subset is the release-grade version.

submitted by /u/Either_Pound1986
[link] [comments]

Looking For A MND TEST REPORTS For My Final Year Project Based On Ncs And Emg Tests , We Can Feature The Sender In Our Work And Also The Sender Can Anonymize The Report We Just Want The Readings And Conclusion That’s It

we are making an fyp in which we predict MND through AI model and we need datasets ( anonymize works as well) just have to be a real patient data

We are invited to many places to present our idea and we can feature the ones who help us get this dataset

thanks

submitted by /u/Character_Shirt_9216
[link] [comments]

I’ve Made A Dataset Of 1 Million Samples But Don’t Know The Exact Price To Sell!! Help Me[PAID]””’

Hi I’m Yug 20(M)

I have started a language dataset providing startup for AI companies and startups.

So I have maded a 1 million samples of Hinglish dataset, totally unique scrapped from public available sources, well cleaned & labelled but now I want to sell it but don’t know the price to sell it. So if you are in this field can you help me.

Here is the sample: { “id”: 501212, “text”: “bhai ye kaafi acha hai”, “intent”: “Appreciation”, “emotion”: “Happy”, “toxicity”: “Low”, “sarcasm”: “No”, “language”: “Hinglish” }

I also have uploaded 5k samples on my GitHub.

submitted by /u/UniqueProfessional81
[link] [comments]

I Couldn’t Find Structured Data On UK Planning Refusals, So I Extracted It From PDFs Myself. Here Is The Schema Sample.

Most UK planning data is trapped in local council PDFs… so if you’re trying to build AI or risk models for property, its a nightmare to parse why things actually get rejected.

I spent the last few weeks building an extraction pipeline that pulls out the exact policy breaches, original context & officer notes into a CSV. I also wrote a script to abstract all the PII to just postcodes for GDPR compliance.

I put a 50 row sample of the schema up on Kaggle here: SAMPLE

If anyone here is working in proptech, data engineering or spatial modeling, I’d love your feedback on the schema before I pay to run the compute to scale this to to 10,000+ rows… what columns am I missing?

submitted by /u/a_cold_floor
[link] [comments]