submitted by /u/LtLfTp12
[link] [comments]
Category: Datatards
Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?
I need lots of models and graphs and data sets that are relevant to the covid 19 pandemic. To be more specific: I am trying to give a presentation for a class called “Models in Science” and I want to talk about how modeling the pandemic was effective and ineffective in spreading information and misinformation during the height of the pandemic.
submitted by /u/PsychologicalRock995
[link] [comments]
Hi everyone,
I’m looking for a suitable 3D point cloud dataset — or a CAD/mesh dataset from which I can sample point clouds — for a small research/report project.
The goal is to compare Topological Data Analysis (TDA) as a preprocessing / feature extraction method against more standard 3D point cloud preprocessing methods, under different perturbations such as:
- Gaussian jitter / noise
- random point deletion / subsampling
- small deformations
- scaling / rotations
- outliers or other synthetic corruptions
The comparison would be based on the classification accuracy of a downstream model after preprocessing.
I do not necessarily need many classes. Even a binary classification dataset would be enough. What matters most is that the classes should differ in their topological structure, ideally in the number of holes / loops / cavities, so that TDA has a meaningful signal to detect.
For example, something like:
- sphere / ball-like objects vs torus / ring-like objects
- solid object vs object with a tunnel
- objects with different numbers of handles or holes
Ideally, each class should contain many samples (600+), or the dataset should contain enough CAD/mesh models so that I can sample many point clouds from them.
Does anyone know of a dataset that fits this description? I would also appreciate suggestions for CAD repositories, synthetic dataset generators, or benchmark datasets where such class pairs could be extracted.
Thanks!
submitted by /u/generalbrain_damage
[link] [comments]
I’ve worked with common datasets from Kaggle and UCI, but I’m looking for more realistic data sources tied to actual business or operational problems.
I’m especially interested in datasets where analysis could answer questions like:
- Why sales dropped in a region
- Customer churn patterns
- Inventory or supply chain inefficiencies
- Pricing opportunities
- Marketing campaign performance
I’ve already explored Kaggle, UCI, and some open government portals.
For those who build portfolio projects or practice real analytics work:
- Where do you usually find more realistic datasets?
- How do you turn raw public data into a meaningful business problem statement?
- Any underrated sources (APIs, city data, company reports, scraped public data, etc.)?
Would appreciate hearing your process.
submitted by /u/silent-romeo57
[link] [comments]
The USDA Dr. Duke Database of Phytochemicals and Ethnobotany is one of the most comprehensive collections of relationships between plant compounds in existence. Over 76,000 records. Decades of work. It includes notes on bioactivity, concentration ranges, and ethnobotanical uses for thousands of plant species.
The user interface hasn’t changed in about twenty years. There is no bulk export. The compounds have no standardized identifiers. SMILES strings do not exist. If your workflow requires PubChem CIDs, you have to start from scratch.
Every team working in the field of machine learning for natural products ultimately has to preprocess the same raw data independently. I know this because I’ve spoken with people who’ve done it, and the same problems came up every time.
So I rebuilt it.
The current version: 76,907 records. 9,098 unique compounds with PubChem CID mappings. SMILES via CID lookup. USPTO patent numbers starting in 2020. Intervention data from ClinicalTrials.gov. Classification of compounds into discrete phytochemicals, complex mixtures, substance classes, and generic ambiguities.
The most time-consuming part was not the data enrichment. It was the question of how to handle records where the compound name is ambiguous. RESIN has no CID. ALKALOID FRACTION has no CID. Assigning one would be incorrect. Leaving them without documentation explaining why they are zero leaves the next researcher in the dark. That is why I added a “compound_type” column that classifies each record and documents the classification logic.
The dataset underwent an external CID review this month. A chemistry consultant manually reviewed 13,206 compound assignments and compared them with PubChem, COCONUT, and InChI keys. One confirmed error was found and corrected. 1,534 previously zero-CIDs were resolved by matching them with IUPAC names. The number of zero-CIDs has decreased by 8%.
The dataset is provided as Parquet and JSON. Queryable in less than five minutes using DuckDB.
Available on HuggingFace (wirthal1990-tech/USDA-Phytochemical-Database-JSON). The GitHub repository (wirthal1990-tech/USDA-Phytochemical-Database-JSON) contains the complete MANIFEST and the methodology documentation.
submitted by /u/DoubleReception2962
[link] [comments]
Seen this catch teams right before a production push more than once.
Dev has clean data, a few rows, and almost no noise. The plan looks fine, the query runs fast, and nobody questions it. Then the same query hits production and suddenly it’s the main character in an incident review.
The missing index is the classic one. Scanning 150 rows costs nothing. No slowdown, no scary estimate, no reason to care. Then it scans five million rows and everyone starts looking at the execution plan like it betrayed them personally.
Plans can also change between dev and prod. The optimizer works with stats and row counts. Change the volume and you can get a different join strategy, different index usage, different order of operations. A plan that looks fine in dev can turn ugly once real row counts show up.
Test data is usually too clean too. No weird NULLs, no duplicate values, no old records, no boundary cases. Dev passes because the dataset was never rude enough to fail.
Same with joins. Small reference tables make everything look harmless. Real selectivity only shows up when both sides of the join have enough data to hurt.
The boring fix is better staging data. Some teams generate it manually. Some use tools like dbForge Data Generator to get production-like volume before testing queries there.
The annoying part is the query wasn’t really broken in dev. The data just wasn’t big enough to tell the truth.
What usually breaks first for you in prod: missing indexes, bad estimates, or the plan flipping once real data hits?
submitted by /u/MissionFormal61
[link] [comments]
Was working on a project and needed Albanian government data in English. Spent a few weeks cleaning and translating it. Sharing it here in case anyone finds it useful. Data includes: – 399 health centers with contact details – 2,289 approved medicines – 1,654 treasury transactions – 2,700+ schools – Business registration stats 2023-2026 Available at albaniandata.com — free tier included. Happy to answer questions about the data or methodology.
submitted by /u/Massive-Two-8399
[link] [comments]
Hi everyone,
I’m sharing a metadata-only dataset of 7,000 news articles (extracted from a larger 700k core) designed specifically for NLP feature engineering and Media Intelligence. Instead of just standard sentiment (Positive/Negative), I’ve focused on “Narrative Alpha”, structural signals that quantify how a story is being told.
Why this is useful: If you’re building news classifiers, bias detectors, or financial sentiment models, standard text often isn’t enough. This set provides deterministic linguistic metrics you can’t get from a standard scrape.
What’s Inside (22 Columns):
- Structural Metrics: Passive Voice Ratio, Sentence/Word Counts.
- Narrative Signals: Hedging Rate (uncertainty cues), Claim Density per 1k words.
- Credibility & Alignment: Headline-Body Alignment Score, Primary Source Ratio (attribution).
- Traditional Labels: Topic, Political Orientation, Bias Strength, Credibility Level.
Technical Specs:
- Format: Tabular CSV (Clean, no text blobs to protect legal/copyright).
- Usability: 10.0/10.0 on Kaggle (fully documented columns).
- License: CC BY 4.0 (Open for research/commercial use).
Link: Kaggle
AMA about the methodology or the pipeline!
submitted by /u/Queasy_System9168
[link] [comments]
Been building this for a few months with my co-founder. Wanted to share here and get honest feedback.
DataPulse delivers ready-made datasets from Amazon, Temu, Zillow, LinkedIn, Airbnb and 10 more sources automated pipeline, no sales calls, public pricing.
The Temu one is interesting — we’re the only ready-made Temu product catalog on the market right now. Bright Data confirmed on their own page they only do it on a custom basis.
Pricing is $399-$899/mo per dataset vs Bright Data’s $50K-$100K/yr. Same data, fraction of the cost.
Also do custom requests — if you need a source that’s not in our catalog, any site, any fields, we’ll quote within 24 hours.
Free sample pull if anyone wants to test quality ,no card needed, just fill out the form.
Genuinely open to feedback .what are we missing?
submitted by /u/Sufficient-War-4020
[link] [comments]
Two datasets that are hard to find ready-made:
Temu — 50M+ products (the only off-the-shelf one on the market)
`product_id, name, category, price_usd, discount_pct, rating, review_count, in_stock` + 8 more fields
**Amazon — 200M+ products**
`asin, title, brand, category, price, bsr_rank, rating, review_count` + 9 more fields
Weekly refresh. CSV, JSON, Parquet.
Drop a comment if you want a free sample.
submitted by /u/Flat_Telephone_4636
[link] [comments]
We’ve been building a public quality standard for AI training data — same idea as Moody’s for bonds — and the free audit tool is now open to anyone. No account needed.
What you get if you paste a HuggingFace dataset URL at https://labelsets.ai/rate
• A 19-dimension quality score (structural, annotation, training-fit, compliance)
• 7-oracle consensus across 5 algorithm families with Cohen + Fleiss κ agreement reporting
• 95% Wilson confidence intervals on rate-based dimensions
• 90% conformal prediction interval on downstream model F1 (Vovk 2005 / Romano 2019)
• Contamination flags against 40+ public evals — MMLU, HumanEval, GSM8K, MedQA, LegalBench, SQuAD, ARC, TruthfulQA,
etc.
• An Ed25519-signed cert verifiable offline against our public key (fingerprint aa4c070af907e2ea)
Methodology paper is published open CC BY 4.0 (19 pages, peer-review ready) at labelsets.ai/paper — fork it, reimplement it, write a paper that disagrees with us.
The free /rate audit produces a JSON cert. The hosted PDF + permalink + embeddable badge are paid ($49 procurement / $149 pro), but the underlying score is the same.
Built deliberately so verification works at FedRAMP-restricted shops — public API at GET /api/verify-lqs-cert/:hash, no auth required, or run crypto.verify() against the Ed25519 public key locally.
Curious what people here think of the dimension list. Happy to defend any of the 19 or kill the ones that don’t carry weight.
submitted by /u/plomii
[link] [comments]
https://zenodo.org/records/19659891
Initial release of the curated European Union member states Indicators dataset (2026).
- 27 member states covered (current EU composition).
- 12 variables: id, country_name, iso_alpha3, capital, eu_accession_year, schengen_accession_year, is_schengen_member, latitude, longitude, landlocked, area_km2, population.
- Standardized Metadata: Includes ISO 3166-1 alpha-3 codes and geospatial centroids.
- Format: Available in CSV format, optimized for read.csv() in R and pandas.read_csv() in Python.
- Validation: Data integrity checked for missing values (specifically handling non-Schengen members).
- Metadata: Includes .zenodo.json for automatic archiving and CITATION.cff for GitHub integration.
- License: Licensed under Creative Commons Attribution 4.0 International (CC BY 4.0)
- https://github.com/lightbluetitan/european_union_indicators
submitted by /u/renzocrossi
[link] [comments]
Hello
I’m practicing astrology and require the mentioned dataset to integrate for analysis.
I require dataset of all the countries in the world and the details like Date Time of independence, Capital City, etc.
Please guide for the same.
TIA
submitted by /u/Divin3_Rudra
[link] [comments]
https://zenodo.org/records/19493935
⚽🏆 Initial release of the curated FIFA World Cup dataset (1930–2022).
- 22 editions covered (1930–2022)
- 10 variables: id, edition, year, host_country, host_continent, winner, second_place, third_place, fourth_place, total_teams
- Available in CSV and XLSX formats
- Validated in R and Python
- Licensed under CC BY 4.0
submitted by /u/renzocrossi
[link] [comments]
getting access to job data is very annoying, and tbh with everyone trying to use AI tools (openai, claude, cursor, etc) to help with job decisions. People are just going to start scraping job sites to find positions so they can have structured job data to give them a competitive edge.
Imagine what this does for network administrators, they’ll hate their career pages.
starting a version controlled open data “pool” to facilitate sharing. jobdatapool.com
Disclaimer: this is promotional content to kick off open source conversion of my project.
submitted by /u/TacoTuesdayX
[link] [comments]
Hey everyone,
I’ve been working on a small side project and wanted to share it here in case it’s useful for others dealing with messy data.
It’s a no-code CSV pipeline tool, but the part I’ve been focusing on recently is a “data health” layer that tries to answer a simple question: how bad is this dataset before I start working on it?
For each dataset (and each column), it surfaces things like:
- % of missing values
- outliers
- skewness
- uniqueness
- data type consistency
You can also drill into individual columns to see why something looks off, instead of manually scanning or writing quick checks.
The general idea behind the tool is:
- every transformation creates a versioned snapshot
- you can go back to any previous step
- you don’t lose the original dataset
- everything is visual / no-code
I built it mostly because I kept repeating the same initial checks in pandas and wanted a faster way to get a feel for the data before doing anything serious.
Not trying to replace code-based workflows just more like speeding up the early “what am I dealing with?” phase.
Curious how others approach this part of analysis, and whether something like this would actually fit into your workflow or just feel unnecessary.
submitted by /u/Woland96
[link] [comments]
What are some things or products you realise rich or famous people hide from the public to keep their true sources a secret?
I’m doing some research and am hoping to find some examples.
submitted by /u/Shot_Army8540
[link] [comments]
I’m a senior software engineer (Clojure, Python, Rust, TypeScript/JavaScript, etc.) who works with LLMs daily for real development work, mainly on side projects. I’ve been building tooling to capture and annotate these sessions — not just the final code, but the full multi-turn trajectory with per-step expert annotations: correctness, engineering quality rating, error taxonomy (wrong approach, bad idiom, overengineering, etc.), and how errors were recovered (model self-corrected, expert redirected, expert rewrote).
The closest existing thing I’m aware of is PRM800K for math reasoning, but nothing equivalent exists publicly for code. SWE-bench has pass/fail outcomes but no step-level human quality judgments. Here’s what I want to know:
- Is anyone actually buying this kind of data? I know Scale AI, Surge, etc. hire coders for annotation work, but is there demand for independently produced, expert-annotated trajectory datasets?
- Is the implicit signal from product usage (accepting/rejecting model outputs in tools like Copilot, Claude Code, Cursor) making explicit annotation redundant? Labs get millions of implicit preference signals for free from their users. Does manual expert annotation add something that’s worth paying for?
- Does niche language coverage (e.g., Clojure, Haskell) change the calculus? Underrepresented languages have less implicit data, but does that make expert trajectories in those languages more valuable, or is the buyer pool too small to matter in the first place?
- Am I stuck (i.e., probably better off) just contracting with annotation vendors directly? Rather than selling a dataset, should I be applying to Scale/Surge/DataAnnotation with this tooling and expertise? Or is the tooling even unnecessary for those platforms too
For context, each annotated session includes: the full transcript (readable + machine-parseable), git diffs tied to specific turns, structured YAML annotations with a documented rubric, and session metadata (model used, duration, complexity). I’m still working on the annotation schema but it’s is “informed” by PRM800K, HelpSteer2, and UltraFeedback conventions.
I’m trying to figure out if this is a real product or if I’m building something the market doesn’t need. Honest feedback appreciated.
submitted by /u/emfuhsiss
[link] [comments]
German Job Market Dataset – 150K Jobs
Fresh scrape from Indeed . de (April 2026). Perfect for ML, research, or HR analytics.
📊 What you get:
– 38 fields: title, company, description, location, salary flags, apply counts, ratings
– CSV format (~455MB)
– 100% valid data, no duplicates
📥 Free sample (5k jobs): IN COMMENTS
💰 Price: 200 USD
🎯 Use for:
– Job market research
– ML training data
– Salary benchmarking
– Competitive intelligence
TG – gdataxxx
submitted by /u/dracariz
[link] [comments]
A SoundCloud uploader has been surfacing deleted and unreleased songs from various artists, claiming they originated from a “public database.”
The original filenames were retrieved by querying the SoundCloud GraphQL API, which reveals the metadata and original names of files exactly as they were first uploaded. These filenames point to a massive, static scrape of the Tencent Music (TME) ecosystem. While these files were likely on those servers at the time of the scrape, they no longer appear to be live on the platforms.
Identified File Fingerprints:
• M500000NZFuy3x21FU.mp3 (QQ Music)
• M500002Ci5OM2KR9ox.mp3 (QQ Music)
• M500002TYpVo39CS7k.mp3 (QQ Music)
• 3641760591.mp3 (Kuwo/NetEase)
• a4bb901691254386980571228fa86eb3.flac (Kugou)
The database includes high-quality FLAC files and tracks previously thought lost. It seems to be a historical server dump or a large-scale archival project.
Does anyone recognize these naming conventions or know of a historical TME server dump or static archive from these services?
submitted by /u/Connect_Software_702
[link] [comments]
Hey devs,
I’m building a developer API on top of SEC filings and just shipped a feature I want honest feedback on.
The problem
Financial data APIs give you numbers: revenue, margins, cash flow, ratios. Numbers don’t tell you how the business works, what the moats are, what management can actually pull, or where the whole thing breaks if it breaks.
That reasoning lives in three places today:
- Sell-side reports (paywalled, slow, one company at a time)
- An analyst’s head after reading the 10-K (doesn’t scale)
- Bloomberg and FactSet narrative fields (institutional pricing, not LLM-queryable)
If you’re building an investing tool or AI research assistant, you know the gap. LLMs are great at reasoning and terrible at reading 300-page filings without inventing numbers that were never in the document.
What I shipped
Pass in a ticker. Get back a structured economic model as JSON, classified from SEC filings and earnings materials. Seven components:
- Business model (revenue model, cost structure, unit economics, cash conversion, capital intensity)
- Competitive advantages (each moat classified by type, mechanism, persistence)
- Operating levers (what management can pull, mapped to KPIs)
- Flywheels (self-reinforcing loops, each step explicit)
- Strategic initiatives (stage, impact level, time horizon)
- Failure modes (structural risks, not generic market risks, with watch metrics)
- Offerings (every product line with revenue role, monetization, margin profile)
Every field is returned as clean JSON. Screenable, LLM-consumable, consistent across every US public company.
The part I actually want to talk about: the citation trail
Every field carries a sources array. Every source has the URL of the actual SEC filing, the section it came from, and the verbatim quote that justifies the claim. Every quote is machine-verified against the filing text at generation time.
If a number or claim can’t be traced to a filing, it doesn’t exist in the API.
Here’s one flywheel from NVIDIA’s model, not trimmed, this is the raw JSON:
{ "name": "Developer ecosystem → platform value → adoption loop", "loop": [ "More developers using CUDA and software tools", "More applications optimized for NVIDIA platforms", "Higher platform value and broader adoption across end markets", "More developers using CUDA and software tools" ], "impact": "growth", "sources": [ { "url": "https://www.sec.gov/Archives/edgar/data/1045810/000104581026000021/nvda-20260125.htm", "source": "10-K", "section": "Item 1, Business", "quote": "There are over 7.5 million developers worldwide using CUDA and our other software tools..." } ] }
That url is live. A human auditor or your AI agent can open it and verify the quote exists at that exact section of the filing. Same shape on every moat, every failure mode, every operating lever.
Why I think the citation trail is the real feature, not the model
A flywheel on its own is an opinion. A flywheel with the 10-K quote next to every component is a defensible claim.
- AI agents stop hallucinating. Every answer grounds in a verbatim filing quote, not “I think Nvidia has a network effect.”
- Investors can defend a memo in a committee, every line linked to its 10-K.
- Compliance teams can verify whether a company’s narrative matches what the filing actually says.
I’ve never seen a provider ship this with per-field citations. That’s the bet.
How it compares
- Bloomberg and FactSet have qualitative fields, priced for institutions, not returned as LLM-consumable JSON, and no per-claim citation you can click.
- SimplyWall and retail tools show dashboards, not queryable structure.
- Polygon, FMP, EODHD, Intrinio ship numbers, zero structural interpretation.
- LLM-only approaches hallucinate without source grounding.
The wedge: every US public company, structured the same way, every field citeable, priced so a developer can actually afford it.
What I want feedback on
- If you’re building an investing tool, research agent, or screener, what’s the first concrete use case that comes to mind?
- Is the 7-component structure the right shape, or is some of it noise? (Flywheels is the one I’m least sure about, be honest.)
- Would the citation trail change your workflow, or is “trust me, it’s AI-generated” fine for what you’re building?
- What would you add or remove before this is a must-have in your stack?
Roast it if it’s a bad idea, that’s literally why I’m posting.
submitted by /u/Either_Door_5500
[link] [comments]