Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

How To Determine A Value For A Question In A Survey

Hello,

I want to get some opinions and recommendations on statistical methods that could be used for my analysis.

The plan is to collect data through a survey and a database search. The results will be used as input and output for Data Envelopment Analysis (DEA). The target of the survey is a decision-making unit (DMU).

There are eight input items and two output items. The score for the input items will be based on the survey responses received. For output items, the score will be calculated using data from public databases.

Each item comprises questions with different types of answers. These include yes/no questions, questions where you select one of statements 1–5, and numerical questions. The number of questions for each item varies depending on its specific characteristics.

This is how I grade each answer and calculate the total score for each item.

Scoring answers:

Type A question: yes/no, YES is given score 3, NO is given score 1

Type B question: A score from 1 to 5 is given based on the score of the selected answer

Type C question: numerical question. The number (n) will be given a score based on the calculation of the mean/median of all the collected answers. If n < Q2, the score is 1; if n = Q2, the score is 2; and if n > Q2, the score is 3.

I then sum up the grades from all the questions in each item. The final score for an item is = total grade/max grade*5 (I set the highest score for an item as 5)

A radar chart for a DMU will be developed showing the scores of the 8 input items.

For the output items:

The data is derived from a public database. I classify the data from each DMU into one of four groups based on quality.

Group HHQ HQ LQ LLQ
DMU1 XX XX XX XX
DMU2 XX XX XX XX
DMU3 XX XX XX XX
Mean/median XX XX XX XX

For the scoring:

  1. derive the frequency number from database
  2. calculate the median for each group
  3. set the grade as 1 to 3 (same as the type C question)
Group HHQ HQ LQ LLQ
DMU1 1 3 3 2
DMU2 3 2 2 3
DMU3 3 1 2 2

4.Because I want to give different weights to each group so that the data from the high-quality group contributes more to the total score. A multiplication factor depending on the group will be applied to each grade, as follows:

Output1

Group HHQ HQ LQ LLQ Output1 value
DMU1 1 * 5 3 *3 3 *2 2 =Sum/Max sum*5
DMU2 3 * 5 2 *3 2 *2 3 =Sum/Max sum*5
DMU3 3 * 5 1*3 2 *2 2 =Sum/Max sum*5

This is how I set the input and output values for each DMU.

Question:

  1. Is this kind of scoring acceptable, even when there are different types of questions for each input item?
  2. Is there a scientific method that can be applied here? For example, how should the score for each answer be set? I have found papers that use scoring in their surveys, but their questions are usually of the same type, producing the same type of answer (e.g. a Likert scale).

Any comments or advice would be appreciated, also if anyone can recommend me any references that would be awesome.

Thank you.
marlee

submitted by /u/Fast-Rise17
[link] [comments]

Looking For Wheat Yellow Rust Image Datasets For ML Project (with Metadata)

We’re undergraduate Machine Learning students working on a crop disease generation project using CGANs, aimed at supporting global sustainability. 🌱

Right now, we’re looking for wheat images of yellow rust disease along with metadata like region, severity, and time range for model training and evaluation.

If you know of any public datasets, research projects, or institutional resources, or even just pointers on where to look, we’d really appreciate your guidance.

Thanks so much for your help! Any leads will be credited in our project.

submitted by /u/Plane_Race_840
[link] [comments]

Does A Corpus Of Archaic English Words Exist?

I have a large database/wordlist containing probably every English dictionary word plus many additional ones like brand names, but this naturally includes many words no longer in use. I need to cut down the size of the list, but since too many words have been added to it to start from scratch, my plan is to obtain a corpus of only archaic words and use these as negatives to remove from the main wordlist. Does such a corpus/wordlist exist anywhere in text form, even it’s just a word per line? Thank you in advance, any help is much appreciated.

submitted by /u/SheffieldParadox
[link] [comments]

Looking For A Long-term Collaborator – Data Engineer / Backend Engineer (Automotive Data)

We are building an automotive vehicle check platform focused on the European market and we are looking for a long-term technical collaborator, not a one-off freelancer.

Our goal is to collect, structure, and expose automotive-related data that can be included in vehicle history / verification reports.

We are particularly interested in sourcing and integrating:

  • Vehicle recalls / technical campaigns / service recalls, using public sources such as RAPEX (EU Safety Gate)

  • Commercial use status (e.g. taxi, ride-hailing, fleet usage), where this can be inferred from public or correlatable data

  • Safety ratings, especially Euro NCAP (free source)

  • Any other publicly available or correlatable automotive data that adds real value to a vehicle check report

What we are looking for:

  • Experience with data extraction, web scraping, or data engineering

  • Ability to deliver structured data (JSON / database) and ideally expose it via API

  • Focus on data quality, reliability, and long-term maintainability

  • Interest in a long-term collaboration, not short-term gigs

Context:

  • European market focus

  • Product-oriented project with real-world usage

If this sounds interesting, feel free to comment or send a DM with a short intro and relevant experience.

submitted by /u/cauchyez
[link] [comments]

What Packaging And Terms Make A Dataset Truly “enterprise-friendly”?

I am trying to define what makes a dataset “enterprise-ready” versus just a dump of files. Regarding structure, do you generally prefer one monolithic archive or segmented collections with manifests? I’m also looking for best practices on taxonomy. How do you expect keywords and tags to be formatted for the easiest integration into your systems?

One of the biggest friction points seems to be legal clarity. What is the clearest way to express restrictions, such as allowed uses, no redistribution, or retention limits, so that engineers can understand them without needing a lawyer to parse the file every time?

If you have seen examples of “gold standard” dataset documentation that handles this perfectly, I would love to see them.

Thanks again guys for the help!

submitted by /u/Lost_Transportation1
[link] [comments]

Looking For Dataset For AI Interview / Behavioral Analysis (Johari Window)

Hi, I’m working on a university project building an AI-based interview system (technical + HR). I’m specifically looking for datasets related to interview questions, interview responses, or behavioral/self-awareness analysis that could be mapped to concepts like the Johari Window (Open/Blind/Hidden/Unknown).

Most public datasets I’ve found focus only on question generation, not behavioral or self-awareness labeling.
If anyone knows of relevant datasets, research papers, or even similar projects, I’d really appreciate pointers.

Thanks!

submitted by /u/Connect_Length6153
[link] [comments]

ScrapeGraphAI 100k: 100,000 Real-World Structured LLM Output Examples From Production Usage

# r/datasets – ScrapeGraphAI 100k Post

Announcing ScrapeGraphAI 100k – a dataset of 100,000 real-world structured extraction examples from the open-source ScrapeGraphAI library:

https://huggingface.co/datasets/scrapegraphai/scrapegraphai-100k

What’s Inside:

This is raw production data – not synthetic, not toy problems. Derived from 9 million PostHog events collected from real users of ScrapeGraphAI during Q2-Q3 2025.

Every example includes:

– `prompt`: Actual user instructions sent to the LLM

– `schema`: JSON schema defining expected output structure

– `response`: What the LLM actually returned

– `content`: Source web content (markdown)

– `llm_model`: Which model was used (89% gpt-4o-mini)

– `source`: Source URL

– `execution_time`: Real timing data

– `response_is_valid`: Ground truth validation (avg 93% valid)

Schema Complexity Metrics:

– `schema_depth`: Nesting levels (typically 2-4, max ~7)

– `schema_keys`: Number of fields (typically 5-15, max 40+)

– `schema_elements`: Total structural pieces

– `schema_cyclomatic_complexity`: Branching complexity from `oneOf`, `anyOf`, etc.

– `schema_complexity_score`: Weighted aggregate difficulty metric

All metrics based on [SLOT: Structuring the Output of LLMs](https://arxiv.org/abs/2505.04016v1)

Data Quality:

Heavily balanced: Cleaned from 9M raw events to 100k diverse examples

Real-world distribution: Includes simple extractions and gnarly complex schemas

Validation annotations: `response_is_valid` field tells you when LLMs fail

Complexity correlation: More complex schemas = lower validation rates (thresholds identified)

Key Findings:

– 93% average validation rate across all schemas

– Complex schemas cause noticeable degradation (non-linear drop-off)

– Response size heavily correlates with execution time

– 90% of schemas have <20 keys and depth <5

– Top 10% contain the truly difficult extraction tasks

Use Cases:

– Fine-tuning models for structured data extraction

– Analyzing LLM failure patterns on complex schemas

– Understanding real-world schema complexity distribution

– Benchmarking extraction accuracy and speed

– Training models that handle edge cases better

– Studying correlation between schema complexity and output validity

The Real Story:

This dataset reflects actual open-source usage patterns – not pre-filtered or curated. You see the mess:

– Schema duplication (some schemas used millions of times)

– Diverse complexity levels (from simple price extraction to full articles)

– Real failure cases (7% of responses don’t match their schemas)

– Validation is syntactic only (semantically wrong but valid JSON passes)

Load It:

from datasets import load_dataset dataset = load_dataset("scrapegraphai/sgai-100k") 

This is the kind of dataset that’s actually useful for ML work – messy, real, and representative of actual problems people solve.

submitted by /u/Electrical-Signal858
[link] [comments]

Football (Soccer) Data – Players (without Game Analysis)

Hi,

Loking for a dataset / API that contains information about Football players, their nationalities, clubs they played at, their coaches and their individual & team trophies.

Most of the API-s / Datasets out there are either, oriented on the football and game tactical analysis, or transfer market, so I could not find reliable data source.

Tried Transfermarkt data but it has a lot of inaccuracies, and it has limited history. Need something rather comprehensive.

Any tips?

submitted by /u/orm_the_stalker
[link] [comments]

I’m Trying To “Moneyball” US High Schools To See Which Ones Are Actually D1 Athlete Factories. Is There A Clean Dataset For This?

I’ve gone down a rabbit hole trying to analyze the “Athlete ROI” of different zip codes. Basically, I want to build a heatmap that shows which high schools are statistically over-performing at sending kids to college on athletic scholarships (specifically D1/D2 commits). My theory is that there are “hidden gem” public schools that produce just as many elite athletes as the $50k/year private academies, but the data is impossible to visualize because it’s all locked in individual profiles. I’ve looked at MaxPreps, 247Sports, and Rivals, but they are designed for tracking single players, not analyzing school output at scale. The Question: Does anyone know of an aggregate dataset (or a paid API) that links: High School Name / Zip Code Total Commits per year (broken down by D1 vs D2 if possible) Sport Category

I’m trying to avoid writing a scraper to crawl 20,000 school pages if a clean database already exists. Has anyone worked with recruitment data like this before?

submitted by /u/Dry-Town7979
[link] [comments]

[Project] FULL_EPSTEIN_INDEX: A Unified Archive Of House Oversight, FBI, DOJ Releases

TL;DR: I am aggregating all public releases regarding the Epstein estate (House Oversight docs, DOJ disclosures, flight logs, multimedia) into one repository. While I finish processing the data (OCR and Whisper transcription), I have opened my Dropbox for public access to the raw files.

This archive aims to be a unified resource for OSINT analysis and research. It expands on previous dumps by combining the recent November 2025 House Oversight releases with the DOJ’s “First Phase” declassification.

  • Note: I am still in the process of uploading some of the larger media files, so keep checking back. However, it currently contains ALL the raw pdf’s from every source (fbi, house/senate, doj, etc), including the most recent (tho heavily redacted) release

To avoid bots scraping, the Dropbox is password protected, but you can access it via password. The pass is my username for my github account, theelderemo

I am currently running a pipeline to process these files to make them fully searchable:

OCR: Extracting high-fidelity text from the raw PDFs.

Transcription: Using OpenAI Whisper to generate transcripts for all audio and video evidence.

Once the processing is complete, the structured dataset will be hosted on Hugging Face, and I will be releasing a Gradio app to make searching the index user friendly.

Please Watch or Star the GitHub repository. That is where I will post the updates, the link to the final Hugging Face dataset, and the search app once they are live.

Github Repo

Dropbox with all files

Original Repo for 20k Emails (this contains the november dataset and gradio search app)

content warning: This repository contains graphic and highly sensitive material regarding sexual abuse, exploitation, and violence. It also contains unverified allegations. discretion is strongly advised.

EDIT: apparantly subfolders are not being publicly shared for some reason, so only the top parent folder is shared in dropbox. I’m cloning them to my google drive. Be patient with me, lol. I’ll update the dropbox link to the drive link once it’s done. It’s over 150gb.

Here’s the link for the google drive
It is being updated via a script in colab cloning my dropbox to the drive, so each refresh will have new folders/docs.

For now, here’s individual share links for each subfolder:

https://www.dropbox.com/scl/fo/mu2ebqnutbehj5ix063hi/AO_gd0QCu7dopIc5KulYqcs?rlkey=eoqzz5a8x9v1qsjotxmwax8ed&st=7d5tjzjq&dl=0

https://www.dropbox.com/scl/fo/lhdne8ebxvih4z9y83aqj/ACFUeplO_SCiCYF6PLVQTNE?rlkey=miisoobzylco8hzhc8yjtfbim&st=3k6uha26&dl=0

https://www.dropbox.com/scl/fo/xmgoirs4n1cjobpu45wgo/AH-YxKPuoecKz2cvrV24xtA?rlkey=6dmiuieavbifgucvtmhxg5oz2&st=fm2lceeb&dl=0

https://www.dropbox.com/scl/fo/nommub0xf7yw1uvnzzu6s/ACPTR-QCmzRj_-YXUFnONws?rlkey=zf0e1l0tggxagphvl8z0qj1j2&st=hlsvrqf8&dl=0

https://www.dropbox.com/scl/fo/q4sjrvwfemg3uwx63kgiz/AP_HvwExmO7YxYD32Nixvwg?rlkey=ygb0w2ardd1vud5tknr2xf6zv&st=y0pyxhv3&dl=0

https://www.dropbox.com/scl/fo/va3f0oraph91wljz2dhst/AFkaQGsAPDWad4U9gg8_8Ag?rlkey=hjkyqs6q9hqjttf8dvot6c5w4&st=vd1f6rk1&dl=0

https://www.dropbox.com/scl/fo/k3hwoqmax72un20ok70cy/AHmkB7YPXV_6xRLtDRNxPVQ?rlkey=7ak8w1dm2iyzvjxuqjxd5qsoo&st=uroug8x1&dl=0

submitted by /u/Ok-District-1330
[link] [comments]

Help Me Figure Out What To Do With This Massive Israeli Car Data File I Stumbled Upon

Okay, so here’s the deal – I somehow ended up with this massive file that’s got like a million lines of what looks like Israeli car data. It’s all separated by these pipe characters (|) and has Hebrew writing mixed in. From what I can tell by looking at it, it’s got stuff about different cars – models, years, engine info, all that – but written out in Hebrew. Kinda wild.

02263039|0650|P|ñåáàøå éôï|0226|GP3ELCC|XV|XV|1.6 PREMIUM|5|14|2016|FB 16

02258339|0650|P|ñåáàøå éôï|0247|GP7ELUC|XV|XV|2.0I|5|14|2016|FB 20

02279939|0650|P|ñåáàøå éôï|0253|SJ5DL7C|FORESTER|FORESTER|2.0XS|5|14|2017|FB 20

02247639|0650|P|ñåáàøå éôï|0243|GP7ELTC|XV|XV|2.0 PREMIUM|5|14|2016|FB 20

01851239|0650|P|ñåáàøå éôï|0228|GP7ELUC|XV|XV|2.0I|1|14|2017|FB 20

What I’ve Figured Out:

  • Pipe-delimited format
  • Column 4: Hebrew vehicle descriptions (decodes to makes/models like Honda CR-V, Seat, BMW)
  • Column 12: Year (1999-2017+)
  • Column 13: Engine codes (G4LC = Hyundai/Kia 1.4L, etc.)
  • Columns 10-11: Likely cylinders and engine displacement
  • ISO-8859-8 encoding for Hebrew

Questions for the Community:

  1. Does anyone recognize this specific data format or structure?
  2. What industries would find this data most valuable?
  3. Any creative but legitimate applications for this type of automotive dataset?
  4. What are the best ways to process/enhance this data?
  5. Any Israeli-specific considerations I should know about?
  6. Has anyone worked with similar automotive data commercially?
  7. What might the other columns represent (1-3, 5-9)?

I have technical skills (Python, SQL, APIs) to work with this but need domain knowledge about what’s actually valuable here and how to properly interpret the structure.

Not looking to share the full dataset publicly, but happy to provide more samples if helpful for analysis. Interested in legitimate applications and technical insights.

Thanks for any help!

submitted by /u/Only1_abdou
[link] [comments]

Out Of Curiosity, How Much Would Be Worth This Mortgages Dataset?

In my past job, and I want to as vague as possible, there was a need for data manipulation/migration/backup numerous times over the cours of say 2 years. There were almost no safety standards at place in handling the data, I couldn’t believe some of the tasks I was assigned by the management, for example I was supposed to back them up on my local machine temporarily, etc.

I don’t want to go more into detail and possibly get anyone (myself included) in trouble.

I was just curious, how much would be the data worth on open (and possibly black) market? I had no intention of betraying anyone but I wondered this for a couple of years now just being in awe how much the management was risking in trusting several people without having no protocols at place. I am pretty sure our contracts had no clauses about leaking data etc.

The data contained about 5-7000 mortgage details over a span of 5-7 years and its entire screening process (a very complex data model) – the applicants’ health reports based on their medical records, their specific and verified assets and liquidity, verified income, liabilities, the property information etc., banking information, contact information. Anything that would be required in a screening process for a mortgage was basically in the dataset. Lots of sensitive and personal information.

I don’t want to specify the country exactly, you may consider it was either USA, UK, or Canada.

And just to clarify, I would never do anything illegal with the data as I appreciated the people and had no intention of going to jail.

submitted by /u/John200xw
[link] [comments]

Weekly Pricing Snapshots For 500+ Online Brands (Free, MIT Licensed)

I’ve been working on a dataset that captures weekly pricing behavior from online brand storefronts.

What it is:

– Weekly snapshots of pricing data from 500+ DTC and e-commerce brands

– Structured schema: current price, original price, discount percentage, category

– Historical comparability (same schema across all snapshots)

– MIT licensed

What it’s for:

– Pricing analysis and benchmarking

– Market research on e-commerce behavior

– Academic research on retail pricing dynamics

– Building models that need consistent pricing signals

What it’s not:

– A product catalog (it’s behavioral data, not inventory)

– Real-time (weekly cadence, not live feeds)

– Complete (consistent sample > exhaustive coverage)

The repo has full documentation on methodology, schema, and limitations. First data release is coming soon.

GitHub: https://github.com/mranderson01901234/online-brand-pricing-snapshots

Source and full methodology: https://projectblueprint.io/datasets

submitted by /u/operastudio
[link] [comments]

Esports DFS Dataset: CS2 Match Stats + Player Game Logs + Prop Outcomes (hit/miss)

I built an esports DFS dataset/API pipeline and I’m releasing a sample dataset from it.

What’s inside (CS2):

• Fixtures (upcoming + completed, any date) • Box scores + per-player match stats • Player game logs • Prop outcomes grading (hit/miss/push) • Player images + team logos (media fields included) 

Trimmed JSON:

{

“sport”: “cs2”,

“fixture_id”: “fix_144592”,

“event_time”: “2025-11-30T10:00:00Z”,

“competition”: “DraculaN #4: Open Qualifier”,

“team1”: “Mousquetaires”,

“team2”: “Young Ninjas”,

“metadata”: { “format”: “bestOf3”, “maps”: [“Inferno”,”Mirage”,”Nuke”] }

}

Disclosure: I run KashRock (the API behind this).

If you’re building a bot/dashboard/model, comment “key” and I’ll send access.

submitted by /u/Apprehensive_Ice8314
[link] [comments]

How Does Your Organization Find Outsourcing Vendors For Data Labeling?

I’m the founder of a data labeling platform startup based in a Southeast Asian country. Since the beginning, we’ve worked with two major clients from the public sector (locally), providing both a self-hosted end-to-end solution and data labeling services. Their requirements are often broad and sometimes very niche (e.g., geographical data, medical data, etc.). Many times, these requirements don’t follow standardized contracts—for example, they might request non-Hugging Face-compatible outputs or even Excel files instead of JSON due to security concerns.

While we’ve been profitable and stable, we’re looking to pivot into the international market in the long term (B2B focus) rather than remaining exclusively in B2G.

Because of the strict requirements from government clients, our data labeling team is highly skilled. For context, our project leads include ex-team leaders from big tech companies, and we enforce a rigorous QA process. This has made us unaffordable within our local market, so we’re hoping to expand internationally.

However, after spending around $10,000 on a local agency to run paid ads, we didn’t generate useful leads or convert any users. I understand that our product is challenging to market, but I’d like to hear from others who have faced similar issues.

If your organization needs a data labeling vendor, where do you typically look? Google? LinkedIn? Word of mouth?

submitted by /u/not_apply_yet
[link] [comments]

Embeddings For The Wikipedia Link Graph

Hi, I am looking for embeddings of the links in English Wikipedia pages, the version I have currently is more than a year out of date and only includes a limited number of entity types.

Does anyone here have experience using these or training their own? Training looks it would be quite expensive so I want to make sure I’ve explored all other options first.

submitted by /u/Useful-Pride1035
[link] [comments]

Be Honest: Which AI Tool Do You Actually Use Daily?

I’m genuinely curious about the AI tools people actually use every day. There are thousands of AI products out there, but there’s a big gap between the tools people talk about and the ones they truly rely on in their daily workflow.

So here’s my question:

If you used an AI tool today:

What did you use it for?What made it stick?

For example, I use Supaboard every single day to help with my analytics and reporting work. Before Supaboard, I depended heavily on my tech team for this. What made Supaboard “sticky” for me is that it lets me do work I was already doing, just faster and without the back-and-forth.

I also use the latest version of ChatGPT daily for writing, ideation, quick research, and thinking through problems.

What makes it stick is how naturally it fits into my workflow, it’s fast, flexible, and helps me move from idea to execution without friction.

I’m not looking for promo links or marketing pitches, just genuine recommendations for tools you personally find useful and would confidently recommend to others

Thanks in advance!

submitted by /u/Ok-Friendship-9286
[link] [comments]

DataSetIQ Python Library – Millions Of Datasets In Pandas

Sharing datasetiq v0.1.2 – a lightweight Python library that makes fetching and analyzing global macro data super simple.

It pulls from trusted sources like FRED, IMF, World Bank, OECD, BLS, and more, delivering data as clean pandas DataFrames with built-in caching, async support, and easy configuration.

### What My Project Does

datasetiq is a lightweight Python library that lets you fetch and work millions of global economic time series from trusted sources like FRED, IMF, World Bank, OECD, BLS, US Census, and more. It returns clean pandas DataFrames instantly, with built-in caching, async support, and simple configuration—perfect for macro analysis, econometrics, or quick prototyping in Jupyter.

Python is central here: the library is built on pandas for seamless data handling, async for efficient batch requests, and integrates with plotting tools like matplotlib/seaborn.

### Target Audience

Primarily aimed at economists, data analysts, researchers, macro hedge funds, central banks, and anyone doing data-driven macro work. It’s production-ready (with caching and error handling) but also great for hobbyists or students exploring economic datasets. Free tier available for personal use.

### Comparison

Unlike general API wrappers (e.g., fredapi or pandas-datareader), datasetiq unifies multiple sources (FRED + IMF + World Bank + 9+ others) under one simple interface, adds smart caching to avoid rate limits, and focuses on macro/global intelligence with pandas-first design. It’s more specialized than broad data tools like yfinance or quandl, but easier to use for time-series heavy workflows.

### Quick Example

import datasetiq as iq # Set your API key (one-time setup) iq.set_api_key("your_api_key_here") # Get data as pandas DataFrame df = iq.get("FRED/CPIAUCSL") # Display first few rows print(df.head()) # Basic analysis latest = df.iloc[-1] print(f"Latest CPI: {latest['value']} on {latest['date']}") # Calculate year-over-year inflation df['yoy_inflation'] = df['value'].pct_change(12) * 100 print(df.tail()) 

Links & Resources

submitted by /u/dsptl
[link] [comments]