Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

AWS Web Hosting Costs Vs (?) A Proxy For Software Market Size

I want to visualize the relationship between declining web hosting costs and growth in the software market. Specifically, I’m looking for a metric that reasonably captures software market size or prosperity, potentially reflecting the impact of more accessible hosting.

I’ve already found historical data on AWS pricing, but I haven’t been able to locate consistent, long-term data on software market size going back to roughly 2006–2008. If anyone can point me to good datasets – or suggest a solid proxy – I’d appreciate it. Thank you all!

submitted by /u/Fun_Pen8596
[link] [comments]

Junior Data Scientist Looking For Real-world Datasets To Work On (free)

Hey guys,

I’m a junior Data Scientist and I’m trying to get more real experience working with actual datasets.

If you have any data you want to explore or just don’t know what to do with it (business data, school project, personal spreadsheet, anything really), I’d be happy to help out for free.

Even small or random projects are totally fine.

If you think I could help you or someone you know, just message me 👍

submitted by /u/Alternative_Air3221
[link] [comments]

[Self-promotion] A Daily LLM-powered Scraper That Structures E-commerce Promos Into Clean CSV/JSON/Parquet – Free On Kaggle

Hello, everyone, we repurposed data from an old project into a Kaggle dataset⬇️ Happy to hear your thoughts and feedback

What this is about:
Major US retailers run hundreds of promotions daily – but there’s no clean, structured source to track them over time. I built a pipeline that scrapes 5 major e-commerce sites daily and extracts every promo, coupon code, and deal into a structured format using GPT-4o-mini and Llama.

Covers Office Depot, Ulta, Home Depot, 1800Flowers, and Shutterfly (for now) – with discount type, value, expiration date, and source URL for every record.

A few things the data shows right now:

  • Office Depot dominates volume: 73 promos today vs 10 for Home Depot
  • Ulta and 1800Flowers both hit 50% as their max discount: beauty and flowers are aggressive
  • Only 4% of promos have coupon codes: most discounts are applied automatically at checkout
  • Home Depot ran 228 promos on April 8th: likely a flash sale event worth investigating

You can find it here: https://www.kaggle.com/datasets/indext-data-lab-ai/promos-dataset

4,955+ records collected over 37 days and counting. Next update tomorrow morning

submitted by /u/KaiseyTayl
[link] [comments]

Data Set Preview – Cyber Security – RAG – Feedback Wanted Please – [Synthetic] (i Think)

here is the preview https://huggingface.co/datasets/Lucasautomatekc/Cybersecurity_RAG_Knowledge_Graph-25-Topics-75-Articles-200-Chunks

I am trying to see if this is something people actually want – I had an idea that some how lead to me looking into selling data sets – total beginner so I’m seeing if there is a certain structure or format folks prefer… I have the data through my web pages, its all clean and enterprise ready for LLM or whatever people need it for…

Honestly – I have no clue what I’m doing, so feedback would be appreciated to even see if I’m going down the right path… yes this is a preview, I have the full set for sale but again have no idea what I’m dong LOL.

Some how AI lead me here, depending on if this the content is actually sellable, I may never follow robots blindly again, or… I will make it my life mission to praise the bots!

Thanks all!

submitted by /u/Bitter_Produce_8153
[link] [comments]

[OC] 21 Years Of EU Fuel Prices Cleaned Into One Dataset — 106k Rows, All 27 Countries, Weekly Since 2005

I built and maintain this dataset — linking to my own Kaggle.

The European Commission publishes weekly pump prices for all 27 EU member states going back to January 2005 — but only as a messy 200-column Excel file with multiple header rows and prices per 1000 litres. I cleaned it into one flat CSV, one row per country per week per fuel type.

Covers petrol 95, diesel, heating oil, fuel oil and LPG across all 27 EU member states + UK.

A few things the data shows right now:

  • Irish diesel is rising 5.7x faster during the Iran war than it did during the Ukraine war
  • Netherlands and Denmark are the most expensive countries for diesel at €2.58 and €2.56/litre
  • Malta is by far the cheapest at €1.21/litre — government price controls
  • The 2022 Ukraine war spike is visible across every country in the heatmap simultaneously

Free dataset on Kaggle | Analysis notebook | Hugging Face mirror

Updated every Wednesday when the EC releases new data. Next update April 15th.

Happy to pull specific country numbers if anyone wants them.

submitted by /u/Cool_Law_8915
[link] [comments]

Dataset Idea For Training Retrieval Judgment Instead Of Just Retrieval Itself

Been thinking about a failure mode that feels more like a dataset problem than a tooling problem:

the retrieval stack is available
the tool is wired
the docs are there

but the model still answers from memory on requests that clearly depend on current information.

So the issue is not always “bad search.”
A lot of the time it is the trigger decision:
when should the model actually check, and when should it not?

I’ve been looking at a Lane 07 style setup for this where the supervision signal is explicit:

  • needs_search: true when freshness matters
  • needs_search: false when model knowledge is enough

Example row:

{ "sample_id": "lane_07_search_triggering_en_00000008", "needs_search": true, "assistant_response": "This is best answered with a quick lookup for current data. If you want me to verify it, I can." } 

What I like about this framing is that it does not just teach “retrieve more.”
It teaches both sides of the boundary:

  • when to trigger
  • when to hold back

That seems useful because bad gating hurts in both directions:

  • over-triggering adds latency and cost
  • under-triggering gives stale but confident answers

I’m experimenting with dataset structures for this kind of retrieval judgment and I think it is an underrated training target compared with just improving retrieval quality itself.

Curious how others here would structure it:

  • binary needs_search
  • richer labels
  • classifier-style trigger data
  • conversational SFT rows
  • hybrid setup

Would love to hear if anyone else is working on datasets for this boundary.

submitted by /u/JayPatel24_
[link] [comments]

BankFocus Orbis Data Question For Studies

Hi,

Does anyone know how to access to BankFocus Orbis Data? I want to use banks level data from different countries for my studies but unfortunately my university doesn’t give access to it. Is there another way to do it?

And another question is if I access to it, is data ordered in a nice way like IMF or World Bank.

Thanks,

submitted by /u/Elegant610
[link] [comments]

Dataset Of Geospatial Data For Active Industrial Sites In Russia?

Working on a GIS project mapping industrial facilities in Russia.

Looking for datasets covering active, large-scale industrial sites (operating, not abandoned), such as:

– manufacturing plants / заводы

– mining sites (open-pit карьеры, underground шахты)

– металлургия (steel mills, smelters)

– oil & gas infrastructure (refineries, GPPs, terminals)

– chemical / fertilizer plants

– power generation (ТЭЦ, ГЭС, АЭС)

– major industrial zones / hubs

Need:

– points (addresses or lat/long) OR

– polygons / site boundaries

Bonus:

Regional data, industry classification, operator/company

Formats:

CSV, GeoJSON, SHP, etc.

Interested in publicly available datasets like:

– OSM data

– corporate reports

– aggregated/open GIS datasets

Prefer datasets with regional breakdown (especially Ural region), but country-wide data is also useful.

submitted by /u/Active-Hornet-9241
[link] [comments]

Looking For Feedback On A Conversational Speech Dataset (multilingual, Real Interactions)

We’ve been working on conversational speech datasets recently and wanted to share a sample to get feedback from this community.

This is focused on real conversational behaviour rather than clean, scripted dialogue.

What it includes:

  • multi-speaker conversations
  • natural interruptions and overlapping speech
  • code-switching (Hindi + English, Hinglish)
  • context-driven interactions (not isolated utterances)
  • speaker variability (accent, pace, fluency)

Languages covered in the sample:

  • Indian English
  • Hindi
  • Hinglish
  • Punjabi
  • Marwadi

We’ve tried to keep the structure usable for training and evaluation, with metadata around speakers, turns, and context.

Still early, and would genuinely appreciate feedback on:

  • dataset structure
  • missing edge cases
  • what would make this more useful in real pipelines

Happy to share access if anyone wants to take a closer look.

submitted by /u/Cautious-Today1710
[link] [comments]

Posting A Small Artifact From The Control Plane Built On Top Of Earlier ArXiv/patent Stage-1 Dataset Work.

WIP My attempt at pulling useful information out of a large dataset.

High-level loop:

artifact -> frontier -> bounded claim -> probe family -> evidence bundle -> route -> reasoner -> confidence update -> next-evidence request -> validation -> frontier advance

What it is trying to do is take stage-1 corpus outputs and turn them into a controlled evidence loop: pick one bounded question, gather support or contradiction around it, decide what evidence to ask for next, and only move forward when the result holds up.

This HF link is one visible slice of the larger system, not the whole thing. Here it shows repeated bridge and contrast structure plus one stable rare finding across follow-up validation.

partial artifact, more on the link below

“cluster_ids”: [ 406 ], “member_count”: 1, “observation_count”: 3, “source_bundle_ids”: [ “tier1_bundle_mixed_region_0410c17c2515bf49” ], “source_result_ids”: [ “finding:406” ], “primary_terms”: [ “anisotropy”, “coherent”, “high-confidence”, “rare”, “centered”, “gmims-related”, “text”, “magnetic”, “medium”, “film” ], “supporting_results”: [ { “result_type”: “finding”, “cluster_ids”: [ 406 ], “bundle_id”: “tier1_bundle_mixed_region_0410c17c2515bf49”, “summary”: “Cluster 406 is a coherent, high-confidence rare cluster centered on GMIMS-related text about anisotropy, magnetic medium, and film/samples, with 22 rows and a strong mean probability of 0.814.”, “confidence”: 0.76608, “support_cycles”: 3, “next_probe_name”: “neighbor_cluster_comparison”, “next_probe_cluster_ids”: [ 406 ], “next_probe_reason”: “The card lists centroid neighbors 285 and 282 with very high cosine similarity, making them the most direct follow-up for checking whether the GMIMS/anisotropy pattern extends to adjacent clusters.” } ],

link to hf with artifacts : https://huggingface.co/datasets/cjc0013/reasoningovercorpusartifacts/tree/main

link to previous dataset post : https://old.reddit.com/r/datasets/comments/1sej8ro/fused_patent_arxiv_clustering_dataset_9m_raw_388m/

submitted by /u/Either_Pound1986
[link] [comments]

Dataset For Training When An LLM Should Retrieve Vs When It Should Answer From Memory

One failure mode I keep seeing in assistants with retrieval is this:

the search path exists
the tool is available
the orchestration is wired

but the model still answers from memory on requests that clearly depend on current information.

So the failure is not always retrieval quality itself.
A lot of the time it is the trigger decision.

That got me interested in treating this as a dataset problem rather than only a prompting or orchestration problem.

We’ve been working on a Lane 07 style dataset focused on search triggering, where the supervision target is the boundary between:

  • requests that should trigger retrieval
  • requests that should stay on general knowledge

Each row is built to teach that judgment explicitly.

Example row:

{ "sample_id": "lane_07_search_triggering_en_00000008", "needs_search": true, "assistant_response": "This is best answered with a quick lookup for current data. If you want me to verify it, I can." } 

What I find important here is that the dataset is not just teaching “search more.”

It teaches both sides:

  • when retrieval is actually required
  • when retrieval is unnecessary and just adds latency / cost

That matters because bad gating hurts in both directions:

  • over-triggering makes the system slower and more expensive
  • under-triggering gives you stale but confident answers

For me, the interesting dataset question is:
how do you represent retrieval judgment as a trainable supervision signal instead of leaving it to prompt heuristics?

A few things I’m curious about from others working on datasets or fine-tuning:

  • Would you model this as binary needs_search, or something richer?
  • How much do you rely on explicit freshness words like “latest” vs implicit freshness cases like booking, availability, status, schedules?
  • Have you seen better results from classifier-style data, SFT conversational rows, or hybrid setups?

Would love to hear how others are structuring retrieval-trigger data, if you’re building similar datasets.

submitted by /u/JayPatel24_
[link] [comments]

[Slef-promotion][Synthetic] I Built A 100K-row Sleep Health Dataset From Scratch – It Just Earned A Kaggle Silver Medal (7,800 Views, 1,700+ Downloads In 2 Weeks)

A few weeks ago I released a synthetic sleep health dataset on Kaggle and it took off faster than I expected. Sharing it here in case anyone finds it useful.

What’s in it:

– 100,000 records, 32 features, 3 prediction targets

– Sleep architecture: REM %, deep sleep %, latency, wake episodes

– Lifestyle: caffeine, alcohol, screen time, exercise, steps

– Psychological: stress score, chronotype, mental health condition

– Demographics: 12 occupations, 15 countries, ages 18-69

Three ML targets:

– cognitive_performance_score- regression (0–100)

– sleep_disorder_risk – multiclass (Healthy / Mild / Moderate / Severe)

– felt_rested – binary classification

One finding that surprised people:

Lawyers average 5.74 hrs of sleep and 7.3/10 stress. Retired individuals average 8.03 hrs and 2.6/10 stress. That 2.13-hour gap shows up clearly in every model – occupation is the strongest predictor of sleep health in the entire dataset.

All distributions are calibrated against CDC, Sleep Foundation, and Frontiers in Sleep research. Correlations match peer-reviewed values (e.g. stress vs quality r=-0.64).

Link in profile if you want to check it out. Happy to answer questions about how it was built.

submitted by /u/Mohan137
[link] [comments]

How Would I Go About Using The MultiAIGCD Dataset?

Hello all,

I’m sure that this is a noob question, but how would I go about finding this dataset so that I can use it? I’ve tried my usual googling around, but can’t seem to find a way to access the dataset itself, other than for a few python questions labeled as “TeX Source” in the top right-hand side of the webpage provided.

Alternatively, is there another dataset that anyone knows about that has heaps of Java source code written by AI?

Thanks!

submitted by /u/Deidreia
[link] [comments]

Irish Property Price Register 2010–2026 — 778k Residential Sales Cleaned Into One CSV [OC]

The Irish Property Price Register is public data but only accessible

through a slow paginated search with no bulk download. I wrote a Python

script to pull the entire register into one flat CSV.

778,508 rows covering every recorded residential sale in Ireland since 2010.

Columns: date_of_sale, address, county, eircode, price_eur,

not_full_market_price, vat_exclusive, description, property_size

Some findings from the data:

– National median went from €205k (2010) to €360k (2026)

– Laois prices rose 126% from 2010–2012 avg to 2020–2022 avg

– Dublin’s premium over rest of Ireland narrowed from 117% to 47%

– New builds went from 25% of market in 2010 to 24% in 2026,

but now cost €45k more than second-hand on average

– COVID barely dented prices — volumes collapsed but median held

[Dataset](https://www.kaggle.com/datasets/fionnhughes/property-price-register)

[Analysis notebook](https://www.kaggle.com/code/fionnhughes/property-price-analysis)

submitted by /u/Cool_Law_8915
[link] [comments]

Cleaned Indian Liver Patient Dataset (ML Ready)

🔥 Cleaned Indian Liver Patient Dataset (ML Ready)

https://www.kaggle.com/datasets/shauryasrivastava01/liver-patient-dataset

• 583 patient records with real clinical biomarkers

• Binary classification (Liver Disease vs Healthy)

• Fully cleaned + preprocessed (no messy columns)

• Includes enzymes, bilirubin, proteins & demographic data

• Perfect for ML projects, EDA, and healthcare modeling

💡 Great for:

– Beginners learning classification

– Feature importance & SHAP analysis

– Bias & fairness studies in healthcare

🚀 Ready to plug into your ML pipeline!

submitted by /u/Direct-Jicama-4051
[link] [comments]

Global Trash And Debris (geo-tagged, Real-world Imagery)

Sharing an open dataset of real-world trash and debris with geo-tagged imagery across different environments.

Useful for:

  • Waste / debris detection models
  • Environmental monitoring
  • Urban cleanliness analysis
  • Smart city / cleanup planning

Dataset: https://huggingface.co/datasets/Outerview/global-trash-and-debris-index

Most existing waste datasets are small or staged — this is focused on real-world, in-the-wild data, which is still relatively limited in computer vision.

Would love feedback or ideas on how people would use this.

submitted by /u/Realistic-Ad-6157
[link] [comments]

14K+ Global Potholes And Fire Hydrants (Geotagged Imagery)

Sharing two open geotagged image datasets:

Each dataset includes ground-level imagery with location metadata (latitude/longitude), along with additional attributes depending on the source.

Data is compiled from a mix of our own collection efforts and open mapping datasets, with a focus on real-world, observable infrastructure.

Potential use cases:

  • computer vision training (object detection / classification)
  • infrastructure analysis
  • urban planning / maintenance modeling
  • geospatial ML

Happy to answer questions or expand coverage if useful.

submitted by /u/Realistic-Ad-6157
[link] [comments]