submitted by /u/brave_w0ts0n
[link] [comments]
Category: Datatards
Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?
A few days after the Nov 12th 2025 Epstein email dump went public, I pulled all the individual text files together, cleaned them, removed duplicates, and converted everything into a single standardized .jsonl dataset.
No PDFs, no images — this is text-only. The raw dump wasn’t structured: filenames were random, topics weren’t grouped, and keyword search barely worked. Names weren’t consistent, related passages didn’t use the same vocabulary, and there was no way to browse by theme.
So I built a structured version:
merged everything into one JSONL file each line = one JSON object (9966 total entries) cleaned formatting + removed noise chunked text properly grouped the dataset into clusters (topic-based) added BM25 keyword search added simple topic-term extraction added entity search made a lightweight explorer UI on HuggingFace
🔗 HuggingFace explorer + dataset:
https://huggingface.co/spaces/cjc0013/epstein-semantic-explorer
JSONL structure (one entry per line):
json {“id”: 123, “cluster”: 47, “text”: “…”} What you can do in the explorer:
Browse clusters by topic Run BM25 keyword search Search entities (names/places/orgs) View cluster summaries See top terms Upload your own JSONL to reuse the explorer for any dataset
This is not commentary — just a structured dataset + tools for anyone who wants to analyze the dump more efficiently.
Please let me know if you encounter any errors. Will answer any questions about the datasets construction.
submitted by /u/Either_Pound1986
[link] [comments]
Hey there! I’m wondering if there is a publicly available dataset on cancer statistics among European nations, similar to SEER in the US. Thanks!
submitted by /u/Stud_Muffin15
[link] [comments]
Hello, I’m looking for a dataset with a count response variable to apply Poisson regression models. I found the well-known Bike Sharing dataset, but it has been used by many people, so I ruled it out. While searching, I found another dataset, the Seoul Bike Sharing Demand dataset. It’s better in the sense that it hasn’t been used as much, but it’s not as good as the first one.
So I have the following question: could someone share a dataset suitable for Poisson regression, i.e., one with a count response variable that can be used as the dependent variable in the model? It doesn’t need to be related to bike sharing, but if it is, that would be even better for me.
submitted by /u/Yaguil23
[link] [comments]
I’ve processed all the text and image files (~25,000 document pages/emails) within individual folders released last friday into a two column text file. I used Googles tesseract OCR library to convert jpg to text.
You can download it here: https://huggingface.co/datasets/tensonaut/EPSTEIN_FILES_20K
For each document, I’ve included the full path to the original google drive folder from House oversight committee so you can link and verify contents. In using this dataset, please be sensitive to the privacy of the people involved (and remember that many of these people were certainly not involved in any of the actions which precipitated the investigation)
submitted by /u/tensonaut
[link] [comments]
I’ve built a dataset of 100 million domains ranked by web authority and releasing it publicly under MIT license.
Dataset: https://github.com/WebsiteLaunches/top-100-million-domains
Stats: – 100M domains ranked by authority – Updated monthly (last: Nov 15, 2025) – MIT licensed (free for any use) – Multiple size tiers: 1K, 10K, 100K, 1M, 10M, 100M – CSV format, simple ranked lists
Methodology: Rankings based on Common Crawl web graph analysis, domain age, traffic patterns, and site quality metrics from Website Launches data. Domains ordered from highest to lowest authority.
Potential uses: – ML training data for domain/web classification – SEO and competitive research – Web graph analysis – Domain investment research – Large-scale web studies
Free and open. Feedback welcome.
submitted by /u/antiochIst
[link] [comments]
I’ve been building datasets from retail and job sites for a while. The hardest part isn’t crawling it’s standardizing. Product specs, company names, job levels nothing matches cleanly. Even after cleaning, every new source breaks the schema again. For those who publish datasets: how do you maintain consistency without rewriting your schema every month?
submitted by /u/Vivid_Stock5288
[link] [comments]
Working on creating a BI business that is geared specifically towards small supply chain businesses but I am needing access to real world supply chain databases to create some examples and practice on. Would love some guidance on this!
submitted by /u/DiabeticDays
[link] [comments]
Byo-model, re-generations won’t be pixel perfect and that’s ok
submitted by /u/fukijama
[link] [comments]
Im in a sex and gender class for school and we have to interview a bunch of people for a paper and see the differences on people’s perspectives based on their backgrounds. If you feel comfortable sharing a bit about yourself and awnsering any or all of these questions I would greatly appreciate it. I will also message you if I quote you in my paper!
SLO 1: Define sex, gender, and gender identity and explain the relationship between these concepts.
-
How are the concepts of sex, gender, and gender identity defined in psychology and sociology, how do they relate to each other and why do you think these terms are misunderstood?
-
Is it possible to be rid of gendered stereotypes, something that has occurred for centuries? How do we as a society have an impact on this negative perception?
-
What does gender mean to you personally, and how do you think your experiences have shaped that understanding?
-
Can you describe how you understand the differences between sex, gender, and gender identity, and how these aspects of identity have influenced your experiences or the way you see others?
-
How do you think understanding the difference between sex and gender can help promote inclusion and equality? How do you think not understanding it affects a public or professional setting?
submitted by /u/lil_bag_a_fritos
[link] [comments]
Make an IPL dataset from IPL offical website Check out this and upvote if you like
https://www.kaggle.com/datasets/robin5024/ipl-pointtable-2008-2025
submitted by /u/Mr_Writer_206
[link] [comments]
Hey all, I spent some time organizing the Eptstein files to make transparency a little clearer. I need to tighten the data for organizations and people a bit more, but hopeful this is helpful in research in the interim.
submitted by /u/Vaughnatri
[link] [comments]
So i need footage of people walking hight for a graduation project but it seems that this hard date to get, so i need advice how to get it, or what will you do if you where in my place. thank you
submitted by /u/mohamed_hi
[link] [comments]
I’m preparing a public dataset built from open retail listings. It includes: timestamp, country, source URL, and field descriptions. But is there something more that shared datasets must have? Maybe sample size, crawl frequency, error rate? I’m trying to make it genuinely useful not just another CSV dump.
submitted by /u/Vivid_Stock5288
[link] [comments]
Each dataset includes
- What technologies were detected (e.g. WordPress 4.5.3)
- The domain it was found on
- The page it was found on
- The IP address associated with the page
- Who owns the IP address
- The geolocation for that IP address
- The URLs found on the page
- The meta description tags for that page
- The size of the HTTP response
- What protocol was used to fulfill the HTTP request
- The date the page was crawled
September 2025: https://www.dropbox.com/scl/fi/0zsph3y6xnfgcibizjos1/sept_2025_jumbo_sample.zip?rlkey=ozmekjx1klshfp8r1y66xdtvx&e=2&st=izkt62t6&dl=0
You can find the full version of the October 2025 dataset here: https://versiondb.io
I hope you guys like it.
submitted by /u/Upper-Character-6743
[link] [comments]
Hi I have a large cohort that I’m exploring characteristics for. However, it will only generate partial results due to large size. For example I have one million patients in my cohort. I wanted to look at an outcome before and after an index event (eg homocide rate before and after an event). However instead of showing me numbers for ALL 1 million patients it only generates them off about half of that from base of 500,000. Is there way to get complete number off the actual one million patient cohort?
submitted by /u/iamnotaman2000
[link] [comments]
Curious about LLM prediction performance, I built this tool to create an auditable, transparent record of LLM (GPT-Models) stock forecasts. I know LLMs aren’t designed for predictions, but I think a huge amount of data can give more insights into their capabilities.
The core idea is methodology transparency:
Baseline: The app ignores the price at the request time. It uses the Actual Closing Price (T_0) as the non-negotiable baseline for all subsequent sequential trend checks.
Tracking: Accuracy is measured against both the Overall Trend (start to finish) and sequential Micro-Trends (step-by-step). Test a tracking run and critique the methodology:
https://glassballai.sumotrainer.com/main
Seeking Feedback: Does using the Closing Price as the sequential baseline feel robust for this type of analysis?
Any other key input parameter (like specific news volume or market sentiment) I should be tracking?
(Note: This is an MVP on a temporary URL and is not financial advice.)
submitted by /u/aufgeblobt
[link] [comments]
ong story short i can provide betradar odds,historical odds (with time stamp) if u need u can dm me.
Coverage
soccer
Tennis
Basketball
Am. Football
Baseball
Boxing
MMA
Coverage
soccer
Tennis
Basketball
Am. Football
Baseball
Boxing
MMA
The historical odds tracker essentially stores all odds changes in a match’s upcoming live and ended states on a second-by-second and millisecond-by-millisecond basis. An example chart is shown in the image.
without historical odds our coverage is total 58 sports
"configured_sports": { "count": 58, "names": [ "novelties", "american_football", "baseball", "soccer", "tennis", "basketball", "cs2", "mma", "dota2", "f1", "golf", "ice_hockey", "valorant", "volleyball", "lol", "darts", "rugby_union", "boxing", "cricket", "ecricket", "table_tennis", "aussie_rules", "motor_sport", "aoe", "aov", "badminton", "cod", "cs2_duels", "dota2_duels", "ebasketballbots", "efootballbots", "esports", "efootball", "fifa", "fortnite", "futsal", "halo", "handball", "hearthstone", "kog", "ml", "nascar_camping_world_truck", "nascar_cup_series", "nascar_xfinity_series", "ebasketball", "nba2k", "nhl", "overwatch", "pubg", "pubg_mobile", "r6", "rocketleague", "squash", "sc1", "sc2", "stock_car_racing", "w3", "wr" ]
submitted by /u/apalexxy
[link] [comments]
Hey! I am working on a project to make it easy for anyone to ask questions about data and want to use fun / interesting datasets to make the tool more appealing to folks and to help them understand how it works!
I am looking for quality datasets on specific topics specifically around Sports, Culture, Politics.
Would anyone like to collaborate?
I am happy to pay for help on this 🙂
As you might know it’s not as straightforward as using Kaggle datasets (or a similar source) and just host them. These datasets are rarely complete / comprehensive.
You can check out the tool here to get a better idea!
DM me or comment here 🫡
submitted by /u/XavierPladevall
[link] [comments]
Looking for a B2C US list with a tilt toward finance, business and investing. Which websites delivered decent quality for you, and how was support and replacements? Real experiences wanted.
submitted by /u/Real_Jay_Dee
[link] [comments]
Hi , I’m building a model to generate step-by-step pencil portrait tutorials from a face photo. I need a small, high-quality dataset of face photo → 8 progressive sketch frames (or vector stroke sequences for faces). Ideally: 50–500 identities, neutral pose, consistent pose across steps, and cumulative stroke frames or stroke-ordered vector drawings.
If you have existing photo↔sketch data (CUFS, person-face-sketch data etc.) and are open to: (a) sharing vector/stroke info, or (b) helping infer stroke order for progressive frames, please reply or DM me. Will provide credit and/or co-authorship for contributors. Happy to pay for high-quality artist contributions (10–100 high-quality tutorials).
submitted by /u/Dizzy_Level455
[link] [comments]