submitted by /u/Lewoniewski
[link] [comments]
Category: Datatards
Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?
Make an IPL dataset from IPL offical website Check out this and upvote if you like
https://www.kaggle.com/datasets/robin5024/ipl-pointtable-2008-2025
submitted by /u/Mr_Writer_206
[link] [comments]
Hey all, I spent some time organizing the Eptstein files to make transparency a little clearer. I need to tighten the data for organizations and people a bit more, but hopeful this is helpful in research in the interim.
submitted by /u/Vaughnatri
[link] [comments]
So i need footage of people walking hight for a graduation project but it seems that this hard date to get, so i need advice how to get it, or what will you do if you where in my place. thank you
submitted by /u/mohamed_hi
[link] [comments]
I’m preparing a public dataset built from open retail listings. It includes: timestamp, country, source URL, and field descriptions. But is there something more that shared datasets must have? Maybe sample size, crawl frequency, error rate? I’m trying to make it genuinely useful not just another CSV dump.
submitted by /u/Vivid_Stock5288
[link] [comments]
Each dataset includes
- What technologies were detected (e.g. WordPress 4.5.3)
- The domain it was found on
- The page it was found on
- The IP address associated with the page
- Who owns the IP address
- The geolocation for that IP address
- The URLs found on the page
- The meta description tags for that page
- The size of the HTTP response
- What protocol was used to fulfill the HTTP request
- The date the page was crawled
September 2025: https://www.dropbox.com/scl/fi/0zsph3y6xnfgcibizjos1/sept_2025_jumbo_sample.zip?rlkey=ozmekjx1klshfp8r1y66xdtvx&e=2&st=izkt62t6&dl=0
You can find the full version of the October 2025 dataset here: https://versiondb.io
I hope you guys like it.
submitted by /u/Upper-Character-6743
[link] [comments]
Hi I have a large cohort that I’m exploring characteristics for. However, it will only generate partial results due to large size. For example I have one million patients in my cohort. I wanted to look at an outcome before and after an index event (eg homocide rate before and after an event). However instead of showing me numbers for ALL 1 million patients it only generates them off about half of that from base of 500,000. Is there way to get complete number off the actual one million patient cohort?
submitted by /u/iamnotaman2000
[link] [comments]
Curious about LLM prediction performance, I built this tool to create an auditable, transparent record of LLM (GPT-Models) stock forecasts. I know LLMs aren’t designed for predictions, but I think a huge amount of data can give more insights into their capabilities.
The core idea is methodology transparency:
Baseline: The app ignores the price at the request time. It uses the Actual Closing Price (T_0) as the non-negotiable baseline for all subsequent sequential trend checks.
Tracking: Accuracy is measured against both the Overall Trend (start to finish) and sequential Micro-Trends (step-by-step). Test a tracking run and critique the methodology:
https://glassballai.sumotrainer.com/main
Seeking Feedback: Does using the Closing Price as the sequential baseline feel robust for this type of analysis?
Any other key input parameter (like specific news volume or market sentiment) I should be tracking?
(Note: This is an MVP on a temporary URL and is not financial advice.)
submitted by /u/aufgeblobt
[link] [comments]
ong story short i can provide betradar odds,historical odds (with time stamp) if u need u can dm me.
Coverage
soccer
Tennis
Basketball
Am. Football
Baseball
Boxing
MMA
Coverage
soccer
Tennis
Basketball
Am. Football
Baseball
Boxing
MMA
The historical odds tracker essentially stores all odds changes in a match’s upcoming live and ended states on a second-by-second and millisecond-by-millisecond basis. An example chart is shown in the image.
without historical odds our coverage is total 58 sports
"configured_sports": { "count": 58, "names": [ "novelties", "american_football", "baseball", "soccer", "tennis", "basketball", "cs2", "mma", "dota2", "f1", "golf", "ice_hockey", "valorant", "volleyball", "lol", "darts", "rugby_union", "boxing", "cricket", "ecricket", "table_tennis", "aussie_rules", "motor_sport", "aoe", "aov", "badminton", "cod", "cs2_duels", "dota2_duels", "ebasketballbots", "efootballbots", "esports", "efootball", "fifa", "fortnite", "futsal", "halo", "handball", "hearthstone", "kog", "ml", "nascar_camping_world_truck", "nascar_cup_series", "nascar_xfinity_series", "ebasketball", "nba2k", "nhl", "overwatch", "pubg", "pubg_mobile", "r6", "rocketleague", "squash", "sc1", "sc2", "stock_car_racing", "w3", "wr" ]
submitted by /u/apalexxy
[link] [comments]
Hey! I am working on a project to make it easy for anyone to ask questions about data and want to use fun / interesting datasets to make the tool more appealing to folks and to help them understand how it works!
I am looking for quality datasets on specific topics specifically around Sports, Culture, Politics.
Would anyone like to collaborate?
I am happy to pay for help on this 🙂
As you might know it’s not as straightforward as using Kaggle datasets (or a similar source) and just host them. These datasets are rarely complete / comprehensive.
You can check out the tool here to get a better idea!
DM me or comment here 🫡
submitted by /u/XavierPladevall
[link] [comments]
Looking for a B2C US list with a tilt toward finance, business and investing. Which websites delivered decent quality for you, and how was support and replacements? Real experiences wanted.
submitted by /u/Real_Jay_Dee
[link] [comments]
Hi , I’m building a model to generate step-by-step pencil portrait tutorials from a face photo. I need a small, high-quality dataset of face photo → 8 progressive sketch frames (or vector stroke sequences for faces). Ideally: 50–500 identities, neutral pose, consistent pose across steps, and cumulative stroke frames or stroke-ordered vector drawings.
If you have existing photo↔sketch data (CUFS, person-face-sketch data etc.) and are open to: (a) sharing vector/stroke info, or (b) helping infer stroke order for progressive frames, please reply or DM me. Will provide credit and/or co-authorship for contributors. Happy to pay for high-quality artist contributions (10–100 high-quality tutorials).
submitted by /u/Dizzy_Level455
[link] [comments]
I compiled and structured a global automotive specifications dataset covering more than 12,000 vehicle variants from over 100 brands, model years 1990–2025.
Each record includes: Brand, model, year, trim Engine specifications (fuel type, cylinders, power, torque, displacement) Dimensions (length, width, height, wheelbase, weight) Performance data (0–100 km/h, top speed, CO₂ emissions, fuel consumption) Price, warranty, maintenance, total cost per km Feature list (safety, comfort, convenience)
Available in CSV, JSON, and SQL formats. Useful for developers, researchers, and AI or data analysis projects.
GitHub (sample, details and structure): https://github.com/vbalagovic/cars-dataset
submitted by /u/Ok_Cucumber_131
[link] [comments]
I’m collecting data for analysis prices or rankings. Do you run scrapes at fixed intervals (daily/hourly), or trigger them on changes (like detected updates)? I’m exploring event-driven scraping but not sure if it’s overengineering for most datasets. How to handle temporal accuracy?
submitted by /u/Vivid_Stock5288
[link] [comments]
Introducing JFLEG-JA, a new Japanese language error correction benchmark with 1,335 sentences, each paired with 4 high-quality human corrections.
Inspired by the English JFLEG dataset, this dataset covers diverse error types, including particle mistakes, kanji mix-ups, incorrect contextual verb, adjective, and literary technique usage.
You can use this for evaluating LLMs, few-shot learning, error analysis, or fine-tuning correction systems.
submitted by /u/Ok_Employee_6418
[link] [comments]
im looking for a free source of cannabis genomic data from recent years
submitted by /u/zynbobguey
[link] [comments]
Hello,
I’ve been building a platform that reconstructs and displays SEC-filed financial statements (www.freefinancials.com). The backend is working well, but I’m now working through a data-standardization challenge.
Some companies report the same financial concept using different XBRL tags across periods. For example, one year they might use us-gaap:SalesRevenueNet, and the next year they switch to us-gaap:Revenues. This results in duplicated rows for what should be the same line item (e.g., “Revenue”).
Does anyone have experience normalizing or mapping XBRL tags across filings so that concept names remain consistent across periods and across companies? Any guidance, best practices, or resources would be greatly appreciated.
Thanks!
submitted by /u/Ok-Access5317
[link] [comments]
Hi, I previously built a project for a hackathon and needed some open jobs data so I built some aggregators. You can find it in the readme.
submitted by /u/Own_Relationship9794
[link] [comments]
hi guys , i need good dataset sources for my data analyst capstone project
submitted by /u/ConcentrateMain1862
[link] [comments]
Sharing my processed archive of 100+ real estate + census metrics, broken down by zip code and date. I don’t want to promote, but I built it for a fun (and free) data visualization tool thats linked in my profile. I’ve had a few people ask me for this data since real estate data (at the zip code level) is really large and hard to process.
It took many hours to clean and process the data, but it has:
– home values going back to 2005 (broken down by home size)
– Rents per home size, dating 5 years back
– Many relevant census data points since 2009 I believe
– Home listing counts (+ listing prices, price cuts, price increases, etc.)
– Section 8 profitability per home size + various Section 8 metrics
– All in all about 120 metrics IIRC
Its a tad bit abridged at <1gb, the raw data is about 80gb but its gone through heavy processing (rounding, removing irrelevant columns, etc.). I have a larger dataset thats about 5gb with more data points, can share that later if anybody is interested.
Link to data: https://www.prop-metrics.com/about#download-data
submitted by /u/maps_can_be_fun
[link] [comments]
Hey r/datasets, If you’re into training AI that actually works in the messy real world buckle up. An 18-year-old founder just dropped Egocentric-10K, a massive open-source dataset that’s basically a goldmine for embodied AI. What’s in it?
- 10K+ hours of first-person video from 2,138 factory workers worldwide .
- 1.08 billion frames at 30fps/1080p, captured via sneaky head cams (no staging, pure chaos).
- Super dense on hand actions: grabbing tools, assembling parts, troubleshooting—way better visibility than lab fakes.
- Total size: 16.4 TB of MP4s + JSON metadata, streamed via Hugging Face for easy access.
Why does this matter? Current robots suck at dynamic tasks because datasets are tiny or too “perfect.” This one’s raw, scalable, and licensed Apache 2.0—free for researchers to train imitation learning models. Could mean safer factories, smarter home bots, or even AI surgeons that mimic pros. Eddy Xu (Build AI) announced it on X yesterday: Link to X post:
Grab it here: https://huggingface.co/datasets/builddotai/Egocentric-10K
submitted by /u/NotSuper-man
[link] [comments]
Hello, I was wondering if anyone might have any good ideas about how to go about getting data like this. I have already tried the Bureau of Transportation Statistics DB1B and T-100 data, but they don’t have anything on the intermediate stops of the itineraries.
So is there some other way to get data on which passengers at an airport are simply connecting on an itinerary that includes a connection (self-connections obviously excluded), and which passengers are originating or terminating at the airport?
Any help and ideas would be greatly appreciated. Thanks!
submitted by /u/Vyksendiyes
[link] [comments]
High-Quality USA Data Available — Fresh & Verified ✅
Hey everyone, I have access to fresh, high-quality USA data available in bulk. Packages start from 10,000 numbers and up. The data is clean, updated, and perfect for anyone who needs verified contact datasets.
🔹 Flexible quantities 🔹 Fast delivery 🔹 Reliable source
If you’re interested or need more details, feel free to DM me anytime.
Thanks!
submitted by /u/Alphaboi123
[link] [comments]
I scraped the top 100 products in a few categories daily for 30 days and got this chunky dataset with rank histories, prices, and reviews. What do i go after first? maybe trend analysis, price elasticity, or review manipulation patterns. If you had this data, how would you guys start to work on it?
submitted by /u/Vivid_Stock5288
[link] [comments]
Hey everyone,
I’ve got two big lists of songs that I need to compare: • List 1: 3,509 songs • List 2: 3,402 songs Most of the songs appear in both lists, but I need to find which songs are in List 1 but not in List 2
I’ve tried running it through ChatGPT but I don’t have pro so I’m limited
If someone can do this for me I’d be willing to pay
CSV files: https://drive.google.com/drive/folders/1VxLHnw9lfGhB-yOoZv_mcwNTGcrTF0dS
submitted by /u/Vidwiz_
[link] [comments]