Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Working On Real-time Data From Brands, And Social Media

Working on browser based agents that can fetch real time digital content like posts, images, brand details and videos from social media and company websites using natural language queries, and turn it into structured data you can directly use.

The goal is to plug this into digital marketing workflows for things like trend tracking, content inspiration, competitor monitoring, and campaign research without manual browsing or scraping. Is this something people would be interested in

submitted by /u/agentbrowser091
[link] [comments]

Natural Disasters Normalized For Cross Domain Comparisons

I’ve been building a program for the past couple months and it’s in good shape to share now.

The meat of it is earthquakes, volcanos, tsunami’s, hurricanes, tornados, currencies, CIA Facebook, and the UN SDGs (plenty more coming). I’ve got all these datasets normalized to a loc-id system, so you can ask across data really easy and opened up the API lanes and made MCP tools. Some are paid datasets, I’m using x402 for a few. Plenty are free though, so check it out!

www.daedalmap.com/agents

There’s the human side app as well, you can explore there to see what it’s like, I’ve been building a research mode that allows users to take a bounded set of data and ask questions to it

submitted by /u/Xyver
[link] [comments]

Searching A Too To Generate A Dataset

Hi everyone,

I’m working on an anomaly detection project using logs from an all-in-one OpenStack deployment (Ansible-based). The logs come from multiple sources , and are collected via Fluentd and sent to OpenSearch.

My main problem is that I don’t have a dataset, and I don’t have enough time to build one manually.

I’m considering running OpenStack for a full day to generate a large amount of logs, then using a tool to generate more data to have a huge and good dataset for anomaly detection.

Are there any tools or approaches that can help generate a good dataset from my own logs in this kind of setup? (Logs are json lines!)

Thanks in advance!

submitted by /u/Substantial_Elk_2999
[link] [comments]

[Disclaimer – My Personal Project] Built This Advanced But Extremely Beginner Friendly Data Visualisation Tool. Please Share Your Thoughts

Hey everyone

I’m thrilled to share Polyform — the modern way to analyse and visualise data without the usual headaches.

Tired of juggling spreadsheets for editing and separate tools for charting? Polyform lets you edit data just like a familiar spreadsheet, while instantly visualising it across 24+ beautiful chart types at the same time — bar, line, pie, scatter, radar, heatmap, candlestick, waterfall, gauge, 3D surface, and many more.

Key highlights:

Change any value and watch your charts animate instantly — no refresh, no lag.

Connect multiple data sheets (e.g., sales + regions) and create combined visuals in one chart.

Sign in and start working immediately. Everything lives in the cloud.

Generate a shareable link — teammates can view or edit without signing up.

Charts as PNG/JPG/PDF, data as CSV/Excel, or full dashboards.

Add rows/columns on the fly, custom color palettes, link locking for safety, and financial/KPI charts built-in.

Whether you’re a solo analyst spotting trends or a growing team needing fast insights, Polyform scales with you. From raw data to shareable, insightful dashboards in under a minute.

No plugins. No complex setup. Just powerful, real-time data storytelling.

Try it here: https://polyform-graphs.lovable.app

Would love your feedback — what’s the one chart type or workflow you wish existed in your current tools? Whats in here that can be improved ?

submitted by /u/FOR_REAL_NOT_REAL
[link] [comments]

Where Do You Look For Reliable Datasets That Aren’t Behind Paywalls?

finding datasets isn’t that hard, but finding ones that are actually reliable, well-documented, and usable (without a paywall) is a different story.

obviously there’s government portals, World Bank etc but even their pretty hit or miss depending on data structure and maintainance

where do you consistently go when you need solid datasets?not just a big list of datasets but sources you actually trust for things like documentation, clear definitions / methodology, reasonably up-to-date data something you’d feel comfortable citing or building on?

Please drop links to if you can, always looking to build a better mental list of go-to sources.

submitted by /u/Rude_Context_4844
[link] [comments]

[PAID] Built A Real-time Salary Dataset From Fortune 500 Workday Job Postings — 100% US Salary Coverage Because Of Pay Transparency Laws. Free Sample Available. [Disclosure: Our Product]

my co-founder and i have been building this for a few months and wanted to share here .

150K-300K active job postings refreshed weekly, 100% US salary coverage, 22 structured fields including salary_min, salary_max, job_category, remote_type, worker_type, requirements, and posted_date. companies include NVIDIA, Goldman Sachs, Walmart, Target, Disney, Pfizer, Boeing, Deloitte and 1,200+ others.

CSV or JSON, ready for R, Stata, or Python out of the box.

een getting interest from labor economists studying pay transparency laws and HR analytics teams — figured researchers here might find it useful too.

this dataset isn’t on our site yet — submit a custom data request at datapulse.skop.dev/custom-request and we’ll get back to you with a free sample within a few hours.

what fields are we missing?

submitted by /u/Sufficient-War-4020
[link] [comments]

Seeking IMDb Gendered Ratings (Raw Scores) Post-2018 For A Data Viz Project

I’m building a site that visualizes gender differences and similarities in movie ratings (screenshots: https://imgur.com/a/yEM5wUd). Currently I’m using a 2018 IMDb list of the top 200 movies rated by women, but it’s outdated and likely misses many highly men-favored films that didn’t make that specific list.

While IMDb displayed gendered ratings until early 2023, their official TSV datasets only provide the aggregate averageRating. I need the specific Male vs. Female raw ratings, not just a gendered rank.

Does anyone know of a dataset, archive, or scraper output from 2019–2023 that captured the demographics breakdown before the UI changes? I’ve checked the standard IMDb non-commercial sets, but the granularity isn’t there.

Thanks!

submitted by /u/HandToDirt
[link] [comments]

Nobody Asked For It, But I Still Built It.

As you can tell from all the titles and the tags, this is an NSFW manga dataset. with over 500k+ data of manga ID, title, release date, and all the other metadata.

I haven’t updated it since March this month. No need to worry, though; I promise to update it more frequently. And the favorites’ number may vary from when it was posted to when it was scraped.

Feel free to use it in your personal data science projects. And tag me if you make something hilarious.

submitted by /u/banana_737
[link] [comments]

[Self-Promotion][Custom Dataset Infrastructure] Where Public Datasets Keep Falling Short For Production AI Systems

Over the past few months, we’ve been helping teams source highly specific datasets that public benchmarks consistently miss.

Some examples:

– Off-script voice agent conversations (interruptions, objections, mixed intent)

– Real human SaaS workflow screen recordings

– Industrial OCR edge cases (reflective packaging, degraded print)

– Computer vision long-tail failures (low-light, oblique angles, occlusion)

– Agent workflow regression scenarios (schema drift, retries, stale state)

Biggest takeaway:

For most production AI systems, the bottleneck usually isn’t the model.

It’s dataset coverage around messy real-world deployment conditions.

Public datasets are usually enough for demos.

Custom datasets are what close the gap to production reliability.

The more specialized the deployment environment becomes, the more valuable targeted data infrastructure becomes.

If you’re actively running into dataset gaps that public benchmarks aren’t solving, feel free to DM me with what you need, always happy to compare notes or help scope solutions.

submitted by /u/Khade_G
[link] [comments]

Topological Data Analysis-friendly CAD/3D Point Cloud Dataset Request

Hi everyone,

I’m looking for a suitable 3D point cloud dataset — or a CAD/mesh dataset from which I can sample point clouds — for a small research/report project.

The goal is to compare Topological Data Analysis (TDA) as a preprocessing / feature extraction method against more standard 3D point cloud preprocessing methods, under different perturbations such as:

  • Gaussian jitter / noise
  • random point deletion / subsampling
  • small deformations
  • scaling / rotations
  • outliers or other synthetic corruptions

The comparison would be based on the classification accuracy of a downstream model after preprocessing.

I do not necessarily need many classes. Even a binary classification dataset would be enough. What matters most is that the classes should differ in their topological structure, ideally in the number of holes / loops / cavities, so that TDA has a meaningful signal to detect.

For example, something like:

  • sphere / ball-like objects vs torus / ring-like objects
  • solid object vs object with a tunnel
  • objects with different numbers of handles or holes

Ideally, each class should contain many samples (600+), or the dataset should contain enough CAD/mesh models so that I can sample many point clouds from them.

Does anyone know of a dataset that fits this description? I would also appreciate suggestions for CAD repositories, synthetic dataset generators, or benchmark datasets where such class pairs could be extracted.

Thanks!

submitted by /u/generalbrain_damage
[link] [comments]

Where Do You Find Real-world Datasets With Actual Business Problems To Solve?

I’ve worked with common datasets from Kaggle and UCI, but I’m looking for more realistic data sources tied to actual business or operational problems.

I’m especially interested in datasets where analysis could answer questions like:

  • Why sales dropped in a region
  • Customer churn patterns
  • Inventory or supply chain inefficiencies
  • Pricing opportunities
  • Marketing campaign performance

I’ve already explored Kaggle, UCI, and some open government portals.

For those who build portfolio projects or practice real analytics work:

  1. Where do you usually find more realistic datasets?
  2. How do you turn raw public data into a meaningful business problem statement?
  3. Any underrated sources (APIs, city data, company reports, scraped public data, etc.)?

Would appreciate hearing your process.

submitted by /u/silent-romeo57
[link] [comments]

The Dr. Duke Database Of Phytochemicals Contains 40 Years Of Data On Plant Compounds And Is Virtually Unusable For Machine Learning – I Rebuilt It

The USDA Dr. Duke Database of Phytochemicals and Ethnobotany is one of the most comprehensive collections of relationships between plant compounds in existence. Over 76,000 records. Decades of work. It includes notes on bioactivity, concentration ranges, and ethnobotanical uses for thousands of plant species.

The user interface hasn’t changed in about twenty years. There is no bulk export. The compounds have no standardized identifiers. SMILES strings do not exist. If your workflow requires PubChem CIDs, you have to start from scratch.

Every team working in the field of machine learning for natural products ultimately has to preprocess the same raw data independently. I know this because I’ve spoken with people who’ve done it, and the same problems came up every time.

So I rebuilt it.

The current version: 76,907 records. 9,098 unique compounds with PubChem CID mappings. SMILES via CID lookup. USPTO patent numbers starting in 2020. Intervention data from ClinicalTrials.gov. Classification of compounds into discrete phytochemicals, complex mixtures, substance classes, and generic ambiguities.

The most time-consuming part was not the data enrichment. It was the question of how to handle records where the compound name is ambiguous. RESIN has no CID. ALKALOID FRACTION has no CID. Assigning one would be incorrect. Leaving them without documentation explaining why they are zero leaves the next researcher in the dark. That is why I added a “compound_type” column that classifies each record and documents the classification logic.

The dataset underwent an external CID review this month. A chemistry consultant manually reviewed 13,206 compound assignments and compared them with PubChem, COCONUT, and InChI keys. One confirmed error was found and corrected. 1,534 previously zero-CIDs were resolved by matching them with IUPAC names. The number of zero-CIDs has decreased by 8%.

The dataset is provided as Parquet and JSON. Queryable in less than five minutes using DuckDB.

Available on HuggingFace (wirthal1990-tech/USDA-Phytochemical-Database-JSON). The GitHub repository (wirthal1990-tech/USDA-Phytochemical-Database-JSON) contains the complete MANIFEST and the methodology documentation.

submitted by /u/DoubleReception2962
[link] [comments]

Small Test Data Lies: Why Queries Look Fine In Dev And Break In Prod

Seen this catch teams right before a production push more than once.

Dev has clean data, a few rows, and almost no noise. The plan looks fine, the query runs fast, and nobody questions it. Then the same query hits production and suddenly it’s the main character in an incident review.

The missing index is the classic one. Scanning 150 rows costs nothing. No slowdown, no scary estimate, no reason to care. Then it scans five million rows and everyone starts looking at the execution plan like it betrayed them personally.

Plans can also change between dev and prod. The optimizer works with stats and row counts. Change the volume and you can get a different join strategy, different index usage, different order of operations. A plan that looks fine in dev can turn ugly once real row counts show up.

Test data is usually too clean too. No weird NULLs, no duplicate values, no old records, no boundary cases. Dev passes because the dataset was never rude enough to fail.

Same with joins. Small reference tables make everything look harmless. Real selectivity only shows up when both sides of the join have enough data to hurt.

The boring fix is better staging data. Some teams generate it manually. Some use tools like dbForge Data Generator to get production-like volume before testing queries there.

The annoying part is the query wasn’t really broken in dev. The data just wasn’t big enough to tell the truth.

What usually breaks first for you in prod: missing indexes, bad estimates, or the plan flipping once real data hits?

submitted by /u/MissionFormal61
[link] [comments]

I Cleaned And Translated Albanian Government Data — Health Centers, Medicines, Treasury Spending (free Download)

Was working on a project and needed Albanian government data in English. Spent a few weeks cleaning and translating it. Sharing it here in case anyone finds it useful. Data includes: – 399 health centers with contact details – 2,289 approved medicines – 1,654 treasury transactions – 2,700+ schools – Business registration stats 2023-2026 Available at albaniandata.com — free tier included. Happy to answer questions about the data or methodology.

submitted by /u/Massive-Two-8399
[link] [comments]

7,000 News Articles Metadata: 22 NLP Metrics For Narrative Alpha & Bias Analysis

Hi everyone,

I’m sharing a metadata-only dataset of 7,000 news articles (extracted from a larger 700k core) designed specifically for NLP feature engineering and Media Intelligence. Instead of just standard sentiment (Positive/Negative), I’ve focused on “Narrative Alpha”, structural signals that quantify how a story is being told.

Why this is useful: If you’re building news classifiers, bias detectors, or financial sentiment models, standard text often isn’t enough. This set provides deterministic linguistic metrics you can’t get from a standard scrape.

What’s Inside (22 Columns):

  • Structural Metrics: Passive Voice Ratio, Sentence/Word Counts.
  • Narrative Signals: Hedging Rate (uncertainty cues), Claim Density per 1k words.
  • Credibility & Alignment: Headline-Body Alignment Score, Primary Source Ratio (attribution).
  • Traditional Labels: Topic, Political Orientation, Bias Strength, Credibility Level.

Technical Specs:

  • Format: Tabular CSV (Clean, no text blobs to protect legal/copyright).
  • Usability: 10.0/10.0 on Kaggle (fully documented columns).
  • License: CC BY 4.0 (Open for research/commercial use).

Link: Kaggle

AMA about the methodology or the pipeline!

submitted by /u/Queasy_System9168
[link] [comments]

[PAID] We Built Ready-made E-commerce Datasets (Amazon, Temu, Zillow, LinkedIn) — 90% Cheaper Than Bright Data. Free Sample Available. Roast Us. [Disclosure: This Is Our Product]

Been building this for a few months with my co-founder. Wanted to share here and get honest feedback.

DataPulse delivers ready-made datasets from Amazon, Temu, Zillow, LinkedIn, Airbnb and 10 more sources automated pipeline, no sales calls, public pricing.

The Temu one is interesting — we’re the only ready-made Temu product catalog on the market right now. Bright Data confirmed on their own page they only do it on a custom basis.

Pricing is $399-$899/mo per dataset vs Bright Data’s $50K-$100K/yr. Same data, fraction of the cost.

Also do custom requests — if you need a source that’s not in our catalog, any site, any fields, we’ll quote within 24 hours.

Free sample pull if anyone wants to test quality ,no card needed, just fill out the form.

datapulse.skop.dev

Genuinely open to feedback .what are we missing?

submitted by /u/Sufficient-War-4020
[link] [comments]

[Self-promotion] [PIAD] I Built This TEMU DATASET

Two datasets that are hard to find ready-made:

Temu — 50M+ products (the only off-the-shelf one on the market)

`product_id, name, category, price_usd, discount_pct, rating, review_count, in_stock` + 8 more fields

**Amazon — 200M+ products**

`asin, title, brand, category, price, bsr_rank, rating, review_count` + 9 more fields

Weekly refresh. CSV, JSON, Parquet.

Drop a comment if you want a free sample.

submitted by /u/Flat_Telephone_4636
[link] [comments]

Free Signed Quality Cert For Any HuggingFace Dataset — 19 Dimensions, Contamination Check Against 40+ Public Evals, Open Methodology [self-promotion]

We’ve been building a public quality standard for AI training data — same idea as Moody’s for bonds — and the free audit tool is now open to anyone. No account needed.

What you get if you paste a HuggingFace dataset URL at https://labelsets.ai/rate

• A 19-dimension quality score (structural, annotation, training-fit, compliance)

• 7-oracle consensus across 5 algorithm families with Cohen + Fleiss κ agreement reporting

• 95% Wilson confidence intervals on rate-based dimensions

• 90% conformal prediction interval on downstream model F1 (Vovk 2005 / Romano 2019)

• Contamination flags against 40+ public evals — MMLU, HumanEval, GSM8K, MedQA, LegalBench, SQuAD, ARC, TruthfulQA,

etc.

• An Ed25519-signed cert verifiable offline against our public key (fingerprint aa4c070af907e2ea)

Methodology paper is published open CC BY 4.0 (19 pages, peer-review ready) at labelsets.ai/paper — fork it, reimplement it, write a paper that disagrees with us.

The free /rate audit produces a JSON cert. The hosted PDF + permalink + embeddable badge are paid ($49 procurement / $149 pro), but the underlying score is the same.

Built deliberately so verification works at FedRAMP-restricted shops — public API at GET /api/verify-lqs-cert/:hash, no auth required, or run crypto.verify() against the Ed25519 public key locally.

Curious what people here think of the dimension list. Happy to defend any of the 19 or kill the ones that don’t carry weight.

submitted by /u/plomii
[link] [comments]

European Union Countries: A Curated Dataset On EU Members For Education And Data Science

https://zenodo.org/records/19659891
Initial release of the curated European Union member states Indicators dataset (2026).

  • 27 member states covered (current EU composition).
  • 12 variables: id, country_name, iso_alpha3, capital, eu_accession_year, schengen_accession_year, is_schengen_member, latitude, longitude, landlocked, area_km2, population.
  • Standardized Metadata: Includes ISO 3166-1 alpha-3 codes and geospatial centroids.
  • Format: Available in CSV format, optimized for read.csv() in R and pandas.read_csv() in Python.
  • Validation: Data integrity checked for missing values (specifically handling non-Schengen members).
  • Metadata: Includes .zenodo.json for automatic archiving and CITATION.cff for GitHub integration.
  • License: Licensed under Creative Commons Attribution 4.0 International (CC BY 4.0)
  • https://github.com/lightbluetitan/european_union_indicators

submitted by /u/renzocrossi
[link] [comments]