Need A Huge Data Set Related To Gambling For My Data Analytics For Economists Final Project.

Can someone please help me, I cannot find anything online i need a big dataset that could include the months as well, please any leads or links would be helpful and if anyone has a statista membership could you please help me get it from there?

submitted by /u/OppositeJury2310
[link] [comments]

0

Is There A Practical Standard For Documenting Web-scraped Datasets?

Every dataset repo has its own README style – some list sources, others list fields, almost none explain the extraction process. I’m thinking scraped data deserves its own metadata standard: crawl date, frequency, robots.txt compliance, schema history, coverage ratio. But no one seems to agree on how deep to go. How would you design a reproducible, lightweight standard for scraped data documentation something between bare minimum CSV and academic paper appendix?

submitted by /u/Vivid_Stock5288
[link] [comments]

0

I’ve Built A Automatic Data Cleaning Application. Looking For MESSY Spreadsheets To Clean/test.

Hello everyone!

I’m a data analyst/software developer. Ive built a data cleaning, processing, and analyses software but I need datasets to clean and test it out thoroughly.

I’ve used AI generated datasets, which works great but hallucinates a lot with random data after a while.

I’ve used datasets from kaggle but most of them are pretty clean.

I’m looking for any datasets in any industry to test the cleaning process. Preferably datasets that take a long time to clean and process before doing the data analysis.

CSV and xlsx file types. Anything helps! 🙂 Thanks

submitted by /u/spicytree21
[link] [comments]

0

Looking For Pickleball Data For School Project.

I checked Kaggle, it does not have any scoring data or win/loss data.

i am looking for data about matches played and the results of the matches, including wins, losses and points for and against

submitted by /u/ikeiscoding
[link] [comments]

0

Looking For Housing Price Dataset To Do Regression Analysis For School

Hi all, I’m looking through kaggle to find a housing dataset with at least 20 columns of data and I can’t find any that look good and have over 20 columns. Do you guys know of one off the top your head by any chance or at least be able to find one quick?

I’m looking for one with attributes like, roof replaced x years ago, or garage size measured by cars, sq footage etc. Anything that might change the value of a house. The one I’ve got now is only 13 columns of data which will work but I would like to find one that is better.

submitted by /u/labor_anoymous
[link] [comments]

0

What Your Data Provider Won’t Tell You: A Practical Guide To Data Quality Evaluation

Hey everyone!

Coresignal here. We know Reddit is not the place for marketing fluff, so we will keep this simple.

We are hosting a free webinar on evaluating B2B datasets, and we thought some people in this community might find the topic useful. Data quality gets thrown around a lot, but the “how to evaluate it” part usually stays vague. Our goal is to make that part clearer.

What the session is about

Our data analyst will walk through a practical 6-step framework that anyone can use to check the quality of external datasets. It is not tied to our product. It is more of a general methodology.

He will cover things like:

How to check data integrity in a structured way
How to compare dataset freshness
How to assess whether profiles are valid or outdated
What to look for in metadata if you care about long-term reliability

When and where

December 2 (Tuesday)
11 AM EST (New York)
Live, 45 minutes + Q&A

Why we are doing it

A lot of teams rely on third-party data and end up discovering issues only after integrating it. We want to help people avoid those situations by giving a straightforward checklist they can run through before committing to any provider.

If this sounds relevant to your work, you can save a spot here:
https://coresignal.com/webinar/

Happy to answer questions if anyone has them.

submitted by /u/Coresignal
[link] [comments]

0

Looking For A Piracy Dataset On Games

So my university requires me do a data analysis capstone project and i have decided to create hypothesis on the piracy level of a country based on GDP per capita and the prices that these games that are sold for is not acquirable for the masses and how unfair the prices are according to GDP per capita, do comment on wt you think also if you guys have a better idea do enlighten me also yea please suggest me a dataset for this coz i cant see anything that’s publicly available?!

submitted by /u/NecessaryBig2035
[link] [comments]

0

[Offer] Glassdoor MSCI Companies Job Review Dataset (2145 Companies, 1.31GB) – Preview Available

Hi everyone,

I’m offering a structured dataset of employee job reviews for MSCI index companies, built from public job review platforms (e.g. Glassdoor).

I’m sharing a free preview sample, and the full dataset (1.31 GB) is available on request.

🗂 Dataset Overview

Coverage: 2,145 MSCI-listed companies

Size: ~1.31 GB

Content: Company-level job reviews, including:

Overall rating information

Job titles and review dates

Free-text review content (pros/cons, comments, etc., where available)

Timeframe: Recent data (latest version at time of collection)

The data is cleaned and structured for analytics and modeling (CSV / similar tabular format).

🔧 Potential Use Cases

HR & people analytics – benchmarking employee satisfaction across MSCI companies

NLP / LLM training – sentiment analysis, aspect-based opinion mining, topic clustering

Market & equity research – linking employee sentiment to performance, risk, or ESG signals

Academic / research projects – labor studies, organizational behavior, etc.

📥 Preview & Full Access

I’m happy to provide a small preview sample so you can check structure and suitability for your use case.

If you’re interested in the full version of this dataset, please contact me directly:

📧 [a.corradini0215@gmail.com](mailto:a.corradini0215@gmail.com)

We can discuss:

Use case (research vs. commercial)

Licensing / usage terms

Pricing and any customization (e.g., specific sectors, time ranges)

⚖️ Notes

Please ensure that any use of the dataset complies with your local laws, your organization’s policies, and the terms of the original review platforms. I’m happy to clarify the structure and collection approach if needed.

Thanks, and feel free to ask questions here or by email if you want more details about fields, schema, or example rows.

submitted by /u/Crafty_Beach_3733
[link] [comments]

0

[PAID] I Spent Months Scraping 140+ Low-cap Solana Memecoins From Launch (10s Intervals), Dataset Just Published!

Disclosure: This is my own dataset. Access is gated.

Hey everyone,

I’ve been working on a dataset since September, and finally published it on Hugging Face.

I’ve traded (well.. gambled) with Solana memecoins for almost 3 years now, and discovered an incredible amount of factors at play when trying to determine if a coin was worth buying.

I’d dabble mostly in low market cap coins, while keeping the vast majority of my crypto assets in mid-high cap coins, Bitcoin for example. It was upsetting seeing new narratives with high price potential go straight to 0, and finally decided to start approaching this emotional game logically.

I ended up building a web scraper to both constantly scrape new coin data as they were deployed, and make API calls to a coin’s social data, rugcheck data, and tons of other tokenomics at the same time.

The dataset includes large amount of features per token snapshot (every max 10 second pulse), such as:

market cap
volume
holders
top 10 holder %
bot holding estimates
dev wallet behavior
social links
linked website scraping analysis (*title, HTML, reputation, etc*)
rugcheck scores
up to hundreds of other features

In total I collected thousands of coin’s chart histories, and filtered this number down to 140+ clean charts, each with nearly 300 data points on average.

With some quick exploratory analysis, I was able to spot smaller patterns, such as how the presence of social links could correlate with a higher market cap ATH. I’m a data engineer, not a data scientist yet, I’m sure those with formal ML backgrounds could find much deeper patterns and predictive signals from this dataset than I can.

For the full dataset description/structure/charts/and examples, see the Hugging Face Dataset Card.

submitted by /u/wtfmase
[link] [comments]

0

Dataset Pour La Création D’une BDD Sur La Gestion D’un Cinéma

Bonjour,

Je suis étudiante en informatique et je réalise un projet sur la création de base de données pour la gestion d’un cinéma. Je souhaiterais savoir si vous saviez où je pourrais trouver des jeu de données sur un seul et même cinéma français (Pathé, UDC, CGR…) svp ?

Merci pour votre aide !

submitted by /u/Ok_Type_7221
[link] [comments]

0

Times Higher Education World University Rankings Dataset (2011-2026) – 44K Records, CSV/JSON, Python Scraper Included

I’ve created a comprehensive dataset of Times Higher Education World University Rankings spanning 16 years (2011-2026).

📊 Dataset Details: – 44,000+ records from 2,750+ universities worldwide – 16 years of historical data (2011-2026) – Dual format: Clean CSV files + Full JSON backups – Two data types: Rankings scores AND key statistics (enrollment, staff ratios, international students, etc.)

📈 What’s included: – Overall scores and individual metrics (teaching, research, citations, industry, international outlook) – Student demographics and institutional statistics – Year-over-year trends ready for analysis

🔧 Python scraper included: The repo includes a fast, reliable Python scraper that: – Uses direct API calls (no browser automation) – Fetches all data in 5-10 minutes – Requires only requests and pandas

💡 Use cases: – Academic research on higher education trends – Data visualization projects – Institutional benchmarking – ML model training – University comparison tools

GitHub: https://github.com/c3nk/THE-World-University-Rankings

The scraper respects THE’s public API endpoints and is designed for educational/research purposes. All data is sourced from Times Higher Education’s official rankings.

Feel free to fork, star, or suggest improvements!

submitted by /u/cenkK
[link] [comments]

0

Rest Api To Dataset Just A Few Prompts Away

Hey folks, senior data engineer and dlthub cofounder here (dlt = oss python library for data integration)

Most datasets are behind rest APIS. We created a system by which you can vibe-code a rest api connector (python dict based, looks like config, easy to review) including llm context, a debug app and easy ways to explore your data.

We describe it as our “LLM native” workflow. Your end product is a resilient, self healing production grade pipeline. We created 8800+ contexts to facilitate this generation but it also works without them to a lesser degree. Our next step is we will generate running code, early next year.

Blog tutorial with video: https://dlthub.com/blog/workspace-video-tutorial

And once you created this pipeline you can access it via what we call dataset interface https://dlthub.com/docs/general-usage/dataset-access/dataset which is a runtime agnostic way to query your data (meaning we spin up a duckdb on the fly if you load to files, but if you load to a db we use that)

More education opportunities from us (data engineering courses): https://dlthub.learnworlds.com/

hope this was useful, feedback welcome

submitted by /u/Thinker_Assignment
[link] [comments]

0

University Statistics Report Confusion

I am doing a statistics report but I am really struggling, the task is this: Describe GPA variable numerically and graphically. Interpret your findings in the context. I understand all the basic concepts such as spread, variability, centre etc etc but how do I word it in the report and in what order? Here is what I have written so far for the image posted (I split it into numerical and graphical summary).

The mean GPA of students is 3.158, indicating that the average student has a GPA close to 3.2, with a standard deviation of 0.398. This indicates that most GPAs fall within 0.4 points above or below the mean. The median is 3.2 which is slightly higher than the mean, suggesting a slight skew to the left. With Q1 at 2.9 and Q3 at 3.4, 50% of the students have GPAs between these values, suggesting there is little variation between student GPAs. The minimum GPA is 2 and the Maximum is 4, using the 1.5xIQR rule to determine potential outliers, the lower boundary is 2.15 and the upper boundary is 4.15. A minimum of 2 indicates potential outliers, explaining why the mean is slightly lower than the median.

Because GPA is a continuous variable, a histogram is appropriate to show the distribution. The histogram shows a unimodal distribution that is mostly symmetrical with a slight left skew, indicating a cluster of higher GPAs and relatively few lower GPAs.

Here is what is asked for us when describing a single categorical variable: Demonstrates precision in summarising and interpreting quantitative and categorical variables. Justifies choice of graphs/statistics. Interprets findings critically within the report narrative, showing awareness of variable type and distributional meaning.

submitted by /u/Sad-Beautiful-7945
[link] [comments]

0

What’s Your Preferred Way To Store Incremental Updates For Large Datasets?

I’m maintaining a dataset that changes daily. Full refreshes are too heavy; diffs get messy. I’ve tried append-only logs, versioned tables, even storing compressed deltas. Each tradeoff hurts either readability, reproducibility, or storage. If you manage big evolving datasets, how do you structure yesterday + today without rewriting history or duplicating half your records?

submitted by /u/Vivid_Stock5288
[link] [comments]

0

Exploring The Public “Epstein Files” Dataset Using A Log Analytics Engine (interactive Demo)

I’ve been experimenting with different ways to explore large text corpora, and ended up trying something a bit unusual.

I took the public “Epstein Files” dataset (~25k documents/emails released as part of a House Oversight Committee dump) and ingested all of it into a log analytics platform (LogZilla). Each document is treated like a log event with metadata tags (Doc Year, Doc Month, People, Orgs, Locations, Themes, Content Flags, etc).

The idea was to see whether a log/event engine could be used as a sort of structured document explorer. It turns out it works surprisingly well: dashboards, top-K breakdowns, entity co-occurrence, temporal patterns, and AI-assisted summaries all become easy to generate once everything is normalized.

If anyone wants to explore the dataset through this interface, here’s the temporary demo instance:

https://epstein.bro-do-you-even-log.com
login: reddit / reddit

A few notes for anyone trying it:

Set the time filter to “Last 7 Days.”
I ingested the dataset a few days ago, so “Today” won’t return anything. Actual document dates are stored in the Doc Year/Month/Day tags.
It’s a test box and may be reset daily, so don’t rely on persistence.
The AI component won’t answer explicit or graphic queries, but it handles general analytical prompts (patterns, tag combinations, temporal comparisons, clustering, etc).
This isn’t a production environment; dashboards or queries may break if a lot of people hit it at once.

Some of the patterns it surfaced:

unusual “Friday” concentration in documents tagged with travel
entity co-occurrence clusters across people/locations/themes
shifts in terminology across document years
small but interesting gaps in metadata density in certain periods
relationships that only emerge when combining multiple tag fields

This is not connected to LogZilla (the company) in any way — just a personal experiment in treating a document corpus as a log stream to see what kind of structure falls out.

If anyone here works with document data, embeddings, search layers, metadata tagging, etc, I’d be curious to see what would happen if I throw it in there.

Also, I don’t know how the system will respond to 100’s of the same user logged in, so expect some likely weirdness. and pls be kind, it’s just a test box.

submitted by /u/meccaleccahimeccahi
[link] [comments]

0

Searching For Dataset Of Night Road Wildlife Animals

Hello, I am searching for richer (not like 300 images) annotated datasets that would include animals, their silhouettes displayed on or besides the road at night time. So I would be able to train an ML model on.

submitted by /u/liudasbar
[link] [comments]

0

[Synthetic] Created A 3-million Instance Dataset To Equip ML Models To Trade Better In Blackswan Events.

So I recently wrapped up a project where I trained an RL model to backtest on 3 years of synthetic stock data, and it generated 45% returns overall in real-market backtesting.

I decided to push it a lil further and include black swan events. Now the dataset I used is too big for Kaggle, but the second dataset is available here.

I’m working on a smaller version of the model to bring it soon, but looking for some feedback here about the dataset construction.

submitted by /u/Legitimate_Monk_318
[link] [comments]

0

AI Company Sora Spends Tens Of Millions On Compute But Nearly Nothing In Data

Paywalled article https://www.billboard.com/pro/suno-creates-spotify-catalog-music-two-weeks-pitch-deck/

submitted by /u/cavedave
[link] [comments]

0

Discussion About Creating Structured, AI-ready Data/knowledge Datasets For AI Tools, Workflows, …

I’m working on a project, that turns raw, unstructured data into structured, AI-ready data in form of Dataset, which can then be used by AI tools, or can be directly queried.

What I’m trying to understand is, how is everyone handling this unstructured data to make it ”understandable”, with proper context so AI tools can understand it.

Also, what are your current setbacks and pain points when creating a certain Datasets?

Where do you currently store your data? On a local device(s) or already using a cloud based solution?

What would it take for you to trust your data/knowledge to a platform, which would help you structure this data and make it AI-ready?

If you could, would you monetize it, or keep it private for your own use only?

If there would be a marketplace, with different Datasets available, would you consider buying access to these Datasets?

When it comes to LLMs, do you have specific ones that you’d use?

I’m not trying to promote or sell anything, just trying to understand how community here is thinking about the Datasets, data/knowledge, …

submitted by /u/Udbovc
[link] [comments]

0

Bulk Earning Call Transcripts Of 4,500 Companies The Last 20 Years [PAID]

Created a dataset of company transcripts on Snowflake. Transcripts are broken down by person and paragraph. Can use an llm to summarize or do equity research with the dataset.

Free use of the earning call transcripts of AAPL. Let me know if you like to see any other company!

https://app.snowflake.com/marketplace/listing/GZTYZ40XYU5

submitted by /u/fruitstanddev
[link] [comments]

0

We Built A Synthetic Proteomics Engine That Expands Real Datasets Without Breaking The Biology. Sharing Some Validation Results

Hey, let me start of with with Proteomics datasets especially exosome datasets used in cancer research which are are often small, expensive to produce, and hard to share. Because of that, a lot of analysis and ML work ends up limited by sample size instead of ideas.

At Synarch Labs we kept running into this issue, so we built something practical: a synthetic proteomics engine that can expand real datasets while keeping the underlying biology intact. The model learns the structure of the original samples and generates new ones that follow the same statistical and biological behavior.

We tested it on a breast cancer exosome dataset (PXD038553). The original data had just twenty samples across control, tumor, and metastasis groups. We expanded it about fifteen times and ran several checks to see if the synthetic data still behaved like the real one.

Global patterns held up. Log-intensity distributions matched closely. Quantile quantile plots stayed near the identity line even when jumping from twenty to three hundred samples. Group proportions stayed stable, which matters when a dataset is already slightly imbalanced.

We then looked at deeper structure. Variance profiles were nearly identical between original and synthetic data. Group means followed the identity line with very small deviations. Kolmogorov–Smirnov tests showed that most protein-level distributions stayed within acceptable similarity ranges. We added a few example proteins so people can see how the density curves look side by side.

After that, we checked biological consistency. Control, tumor, and metastasis groups preserved their original signatures even after augmentation. The overall shapes of their distributions remained realistic, and the synthetic samples stayed within biological ranges instead of drifting into weird or noisy patterns.

Synthetic proteomics like this can help when datasets are too small for proper analysis but researchers still need more data for exploration, reproducibility checks, or early ML experiments. It also avoids patient-level privacy issues while keeping the biological signal intact.

We’re sharing these results to get feedback from people who work in proteomics, exosomes, omics ML, or synthetic data. If there’s interest, we can share a small synthetic subset for testing. We’re still refining the approach, so critiques and suggestions are welcome.

submitted by /u/Odd-Disk-975
[link] [comments]

0

I Scraped And Cleaned 50,000+ Career Discussion Threads From R/AskEngineers And R/EngineeringStudents. Here Is The Tool I Used.

I couldn’t find a good dataset that mapped the “Skills Gap” between university and industry, so I built a local scraper to create one.

The Data:

Volume: ~52,000 threads.
Fields: Title, Body, Top Comments, Sentiment.
Focus: Keywords relating to “Exams” vs “Workplace Tools”.

I built the extractor (ORION) to run locally so I wouldn’t get IP banned. It uses requests and smart rate-limiting.

You can grab the tool and the extraction logic here: https://mrweeb0.github.io/ORION-tool-showcase/

Feel free to fork it if you want to scrape other career subreddits (like Nursing or CS).

submitted by /u/No-Associate-6068
[link] [comments]

0

[question] Statistics About Evaluating A Group

submitted by /u/Few_Relationship_454
[link] [comments]

0

5,082 Email Threads Extracted From Epstein Files

I have processed the Epstein Files dataset and extracted 5,082 email threads with 16,447 individual messages. I used an LLM (xAI Grok 4.1 Fast via OpenRouter API) to parse the OCR’d text and extract structured email data.

Dataset available here: https://huggingface.co/datasets/notesbymuneeb/epstein-emails

submitted by /u/muneebdev
[link] [comments]

0

What’s The Best Way To Capture Change Over Time In Scraped Data?

I’m working on a dataset of daily price movements across thousands of products.
The data’s clean but flat. Without a timeline, it’s hard to analyze trends. I’ve tried storing deltas, snapshots, and event logs each one adds bloat. What’s your preferred model for time-aware datasets? Versioned tables? Append-only logs? Or something hybrid that stays queryable without eating storage?

submitted by /u/Vivid_Stock5288
[link] [comments]

0

Looking For Third-party UK Company Data Providers

I’m looking for websites that offer free UK company lookups, that don’t use the gov.uk domain.

I’m not looking for ones like Endole, or Company Check.

submitted by /u/plaguedbyfoibles
[link] [comments]

0

Where To Get Labelled CBC Datasets For Machine Learning?

Hi there, I was working on a machine learning project to detect Primary Adrenal Insufficiency (Addison’s disease) based on blood sample data. Does anyone knows where to get free CBC datasets for Addison patients or any CBC datasets with labels of the disease?

submitted by /u/KaitoKid417
[link] [comments]

0

Rate This Dataset In The Drive Link …

I have generated the dataset in 10 minutes … Plesse give ratings …

submitted by /u/Quirky-Ad-3072
[link] [comments]

0

I Can Generate Unlimited, World-class Synthetic Datasets On Demand – 100% Custom, Cleaner Than Most Real-world Data, Any Domain

Throwaway for obvious reasons, but I’ve spent the last 18 months quietly perfecting a pipeline that spits out synthetic data that consistently beats public benchmarks and even most private datasets in quality. What I can do right now (literally same-day delivery in most cases): Any domain: medical (EHR, radiology reports, mimic-like), legal, financial (LOBs, transactions, KYC), code, multilingual text, tabular, time-series, images + captions, instruction-following, agent trajectories, you name it

Scale: 10k–10M+ samples, whatever you need

submitted by /u/Quirky-Ad-3072
[link] [comments]

0

Where Do I Get A Good Dataset For Practicing

data analytics #data

submitted by /u/PirateMugiwara_luffy
[link] [comments]

0

Category: Datatards

Need A Huge Data Set Related To Gambling For My Data Analytics For Economists Final Project.

Is There A Practical Standard For Documenting Web-scraped Datasets?

I’ve Built A Automatic Data Cleaning Application. Looking For MESSY Spreadsheets To Clean/test.

Looking For Pickleball Data For School Project.

Looking For Housing Price Dataset To Do Regression Analysis For School

What Your Data Provider Won’t Tell You: A Practical Guide To Data Quality Evaluation

What the session is about

When and where

Why we are doing it

Looking For A Piracy Dataset On Games

[Offer] Glassdoor MSCI Companies Job Review Dataset (2145 Companies, 1.31GB) – Preview Available

[PAID] I Spent Months Scraping 140+ Low-cap Solana Memecoins From Launch (10s Intervals), Dataset Just Published!

Dataset Pour La Création D’une BDD Sur La Gestion D’un Cinéma

Times Higher Education World University Rankings Dataset (2011-2026) – 44K Records, CSV/JSON, Python Scraper Included

Rest Api To Dataset Just A Few Prompts Away

University Statistics Report Confusion

What’s Your Preferred Way To Store Incremental Updates For Large Datasets?

Exploring The Public “Epstein Files” Dataset Using A Log Analytics Engine (interactive Demo)

Searching For Dataset Of Night Road Wildlife Animals

[Synthetic] Created A 3-million Instance Dataset To Equip ML Models To Trade Better In Blackswan Events.

AI Company Sora Spends Tens Of Millions On Compute But Nearly Nothing In Data

Discussion About Creating Structured, AI-ready Data/knowledge Datasets For AI Tools, Workflows, …

Bulk Earning Call Transcripts Of 4,500 Companies The Last 20 Years [PAID]

We Built A Synthetic Proteomics Engine That Expands Real Datasets Without Breaking The Biology. Sharing Some Validation Results

I Scraped And Cleaned 50,000+ Career Discussion Threads From R/AskEngineers And R/EngineeringStudents. Here Is The Tool I Used.

[question] Statistics About Evaluating A Group

5,082 Email Threads Extracted From Epstein Files

What’s The Best Way To Capture Change Over Time In Scraped Data?

Looking For Third-party UK Company Data Providers

Where To Get Labelled CBC Datasets For Machine Learning?

Rate This Dataset In The Drive Link …

I Can Generate Unlimited, World-class Synthetic Datasets On Demand – 100% Custom, Cleaner Than Most Real-world Data, Any Domain

Where Do I Get A Good Dataset For Practicing

data analytics #data

Recent Posts

Recent Comments

18+ Content

What the session is about

When and where

Why we are doing it

data analytics #data

Recent Posts

Recent Comments