Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

[Project] FULL_EPSTEIN_INDEX: A Unified Archive Of House Oversight, FBI, DOJ Releases

TL;DR: I am aggregating all public releases regarding the Epstein estate (House Oversight docs, DOJ disclosures, flight logs, multimedia) into one repository. While I finish processing the data (OCR and Whisper transcription), I have opened my Dropbox for public access to the raw files.

This archive aims to be a unified resource for OSINT analysis and research. It expands on previous dumps by combining the recent November 2025 House Oversight releases with the DOJ’s “First Phase” declassification.

  • Note: I am still in the process of uploading some of the larger media files, so keep checking back. However, it currently contains ALL the raw pdf’s from every source (fbi, house/senate, doj, etc), including the most recent (tho heavily redacted) release

To avoid bots scraping, the Dropbox is password protected, but you can access it via password. The pass is my username for my github account, theelderemo

I am currently running a pipeline to process these files to make them fully searchable:

OCR: Extracting high-fidelity text from the raw PDFs.

Transcription: Using OpenAI Whisper to generate transcripts for all audio and video evidence.

Once the processing is complete, the structured dataset will be hosted on Hugging Face, and I will be releasing a Gradio app to make searching the index user friendly.

Please Watch or Star the GitHub repository. That is where I will post the updates, the link to the final Hugging Face dataset, and the search app once they are live.

Github Repo

Dropbox with all files

Original Repo for 20k Emails (this contains the november dataset and gradio search app)

content warning: This repository contains graphic and highly sensitive material regarding sexual abuse, exploitation, and violence. It also contains unverified allegations. discretion is strongly advised.

EDIT: apparantly subfolders are not being publicly shared for some reason, so only the top parent folder is shared in dropbox. I’m cloning them to my google drive. Be patient with me, lol. I’ll update the dropbox link to the drive link once it’s done. It’s over 150gb.

Here’s the link for the google drive
It is being updated via a script in colab cloning my dropbox to the drive, so each refresh will have new folders/docs.

For now, here’s individual share links for each subfolder:

https://www.dropbox.com/scl/fo/mu2ebqnutbehj5ix063hi/AO_gd0QCu7dopIc5KulYqcs?rlkey=eoqzz5a8x9v1qsjotxmwax8ed&st=7d5tjzjq&dl=0

https://www.dropbox.com/scl/fo/lhdne8ebxvih4z9y83aqj/ACFUeplO_SCiCYF6PLVQTNE?rlkey=miisoobzylco8hzhc8yjtfbim&st=3k6uha26&dl=0

https://www.dropbox.com/scl/fo/xmgoirs4n1cjobpu45wgo/AH-YxKPuoecKz2cvrV24xtA?rlkey=6dmiuieavbifgucvtmhxg5oz2&st=fm2lceeb&dl=0

https://www.dropbox.com/scl/fo/nommub0xf7yw1uvnzzu6s/ACPTR-QCmzRj_-YXUFnONws?rlkey=zf0e1l0tggxagphvl8z0qj1j2&st=hlsvrqf8&dl=0

https://www.dropbox.com/scl/fo/q4sjrvwfemg3uwx63kgiz/AP_HvwExmO7YxYD32Nixvwg?rlkey=ygb0w2ardd1vud5tknr2xf6zv&st=y0pyxhv3&dl=0

https://www.dropbox.com/scl/fo/va3f0oraph91wljz2dhst/AFkaQGsAPDWad4U9gg8_8Ag?rlkey=hjkyqs6q9hqjttf8dvot6c5w4&st=vd1f6rk1&dl=0

https://www.dropbox.com/scl/fo/k3hwoqmax72un20ok70cy/AHmkB7YPXV_6xRLtDRNxPVQ?rlkey=7ak8w1dm2iyzvjxuqjxd5qsoo&st=uroug8x1&dl=0

submitted by /u/Ok-District-1330
[link] [comments]

Help Me Figure Out What To Do With This Massive Israeli Car Data File I Stumbled Upon

Okay, so here’s the deal – I somehow ended up with this massive file that’s got like a million lines of what looks like Israeli car data. It’s all separated by these pipe characters (|) and has Hebrew writing mixed in. From what I can tell by looking at it, it’s got stuff about different cars – models, years, engine info, all that – but written out in Hebrew. Kinda wild.

02263039|0650|P|ñåáàøå éôï|0226|GP3ELCC|XV|XV|1.6 PREMIUM|5|14|2016|FB 16

02258339|0650|P|ñåáàøå éôï|0247|GP7ELUC|XV|XV|2.0I|5|14|2016|FB 20

02279939|0650|P|ñåáàøå éôï|0253|SJ5DL7C|FORESTER|FORESTER|2.0XS|5|14|2017|FB 20

02247639|0650|P|ñåáàøå éôï|0243|GP7ELTC|XV|XV|2.0 PREMIUM|5|14|2016|FB 20

01851239|0650|P|ñåáàøå éôï|0228|GP7ELUC|XV|XV|2.0I|1|14|2017|FB 20

What I’ve Figured Out:

  • Pipe-delimited format
  • Column 4: Hebrew vehicle descriptions (decodes to makes/models like Honda CR-V, Seat, BMW)
  • Column 12: Year (1999-2017+)
  • Column 13: Engine codes (G4LC = Hyundai/Kia 1.4L, etc.)
  • Columns 10-11: Likely cylinders and engine displacement
  • ISO-8859-8 encoding for Hebrew

Questions for the Community:

  1. Does anyone recognize this specific data format or structure?
  2. What industries would find this data most valuable?
  3. Any creative but legitimate applications for this type of automotive dataset?
  4. What are the best ways to process/enhance this data?
  5. Any Israeli-specific considerations I should know about?
  6. Has anyone worked with similar automotive data commercially?
  7. What might the other columns represent (1-3, 5-9)?

I have technical skills (Python, SQL, APIs) to work with this but need domain knowledge about what’s actually valuable here and how to properly interpret the structure.

Not looking to share the full dataset publicly, but happy to provide more samples if helpful for analysis. Interested in legitimate applications and technical insights.

Thanks for any help!

submitted by /u/Only1_abdou
[link] [comments]

Out Of Curiosity, How Much Would Be Worth This Mortgages Dataset?

In my past job, and I want to as vague as possible, there was a need for data manipulation/migration/backup numerous times over the cours of say 2 years. There were almost no safety standards at place in handling the data, I couldn’t believe some of the tasks I was assigned by the management, for example I was supposed to back them up on my local machine temporarily, etc.

I don’t want to go more into detail and possibly get anyone (myself included) in trouble.

I was just curious, how much would be the data worth on open (and possibly black) market? I had no intention of betraying anyone but I wondered this for a couple of years now just being in awe how much the management was risking in trusting several people without having no protocols at place. I am pretty sure our contracts had no clauses about leaking data etc.

The data contained about 5-7000 mortgage details over a span of 5-7 years and its entire screening process (a very complex data model) – the applicants’ health reports based on their medical records, their specific and verified assets and liquidity, verified income, liabilities, the property information etc., banking information, contact information. Anything that would be required in a screening process for a mortgage was basically in the dataset. Lots of sensitive and personal information.

I don’t want to specify the country exactly, you may consider it was either USA, UK, or Canada.

And just to clarify, I would never do anything illegal with the data as I appreciated the people and had no intention of going to jail.

submitted by /u/John200xw
[link] [comments]

Weekly Pricing Snapshots For 500+ Online Brands (Free, MIT Licensed)

I’ve been working on a dataset that captures weekly pricing behavior from online brand storefronts.

What it is:

– Weekly snapshots of pricing data from 500+ DTC and e-commerce brands

– Structured schema: current price, original price, discount percentage, category

– Historical comparability (same schema across all snapshots)

– MIT licensed

What it’s for:

– Pricing analysis and benchmarking

– Market research on e-commerce behavior

– Academic research on retail pricing dynamics

– Building models that need consistent pricing signals

What it’s not:

– A product catalog (it’s behavioral data, not inventory)

– Real-time (weekly cadence, not live feeds)

– Complete (consistent sample > exhaustive coverage)

The repo has full documentation on methodology, schema, and limitations. First data release is coming soon.

GitHub: https://github.com/mranderson01901234/online-brand-pricing-snapshots

Source and full methodology: https://projectblueprint.io/datasets

submitted by /u/operastudio
[link] [comments]

Esports DFS Dataset: CS2 Match Stats + Player Game Logs + Prop Outcomes (hit/miss)

I built an esports DFS dataset/API pipeline and I’m releasing a sample dataset from it.

What’s inside (CS2):

• Fixtures (upcoming + completed, any date) • Box scores + per-player match stats • Player game logs • Prop outcomes grading (hit/miss/push) • Player images + team logos (media fields included) 

Trimmed JSON:

{

“sport”: “cs2”,

“fixture_id”: “fix_144592”,

“event_time”: “2025-11-30T10:00:00Z”,

“competition”: “DraculaN #4: Open Qualifier”,

“team1”: “Mousquetaires”,

“team2”: “Young Ninjas”,

“metadata”: { “format”: “bestOf3”, “maps”: [“Inferno”,”Mirage”,”Nuke”] }

}

Disclosure: I run KashRock (the API behind this).

If you’re building a bot/dashboard/model, comment “key” and I’ll send access.

submitted by /u/Apprehensive_Ice8314
[link] [comments]

How Does Your Organization Find Outsourcing Vendors For Data Labeling?

I’m the founder of a data labeling platform startup based in a Southeast Asian country. Since the beginning, we’ve worked with two major clients from the public sector (locally), providing both a self-hosted end-to-end solution and data labeling services. Their requirements are often broad and sometimes very niche (e.g., geographical data, medical data, etc.). Many times, these requirements don’t follow standardized contracts—for example, they might request non-Hugging Face-compatible outputs or even Excel files instead of JSON due to security concerns.

While we’ve been profitable and stable, we’re looking to pivot into the international market in the long term (B2B focus) rather than remaining exclusively in B2G.

Because of the strict requirements from government clients, our data labeling team is highly skilled. For context, our project leads include ex-team leaders from big tech companies, and we enforce a rigorous QA process. This has made us unaffordable within our local market, so we’re hoping to expand internationally.

However, after spending around $10,000 on a local agency to run paid ads, we didn’t generate useful leads or convert any users. I understand that our product is challenging to market, but I’d like to hear from others who have faced similar issues.

If your organization needs a data labeling vendor, where do you typically look? Google? LinkedIn? Word of mouth?

submitted by /u/not_apply_yet
[link] [comments]

Embeddings For The Wikipedia Link Graph

Hi, I am looking for embeddings of the links in English Wikipedia pages, the version I have currently is more than a year out of date and only includes a limited number of entity types.

Does anyone here have experience using these or training their own? Training looks it would be quite expensive so I want to make sure I’ve explored all other options first.

submitted by /u/Useful-Pride1035
[link] [comments]

Be Honest: Which AI Tool Do You Actually Use Daily?

I’m genuinely curious about the AI tools people actually use every day. There are thousands of AI products out there, but there’s a big gap between the tools people talk about and the ones they truly rely on in their daily workflow.

So here’s my question:

If you used an AI tool today:

What did you use it for?What made it stick?

For example, I use Supaboard every single day to help with my analytics and reporting work. Before Supaboard, I depended heavily on my tech team for this. What made Supaboard “sticky” for me is that it lets me do work I was already doing, just faster and without the back-and-forth.

I also use the latest version of ChatGPT daily for writing, ideation, quick research, and thinking through problems.

What makes it stick is how naturally it fits into my workflow, it’s fast, flexible, and helps me move from idea to execution without friction.

I’m not looking for promo links or marketing pitches, just genuine recommendations for tools you personally find useful and would confidently recommend to others

Thanks in advance!

submitted by /u/Ok-Friendship-9286
[link] [comments]

DataSetIQ Python Library – Millions Of Datasets In Pandas

Sharing datasetiq v0.1.2 – a lightweight Python library that makes fetching and analyzing global macro data super simple.

It pulls from trusted sources like FRED, IMF, World Bank, OECD, BLS, and more, delivering data as clean pandas DataFrames with built-in caching, async support, and easy configuration.

### What My Project Does

datasetiq is a lightweight Python library that lets you fetch and work millions of global economic time series from trusted sources like FRED, IMF, World Bank, OECD, BLS, US Census, and more. It returns clean pandas DataFrames instantly, with built-in caching, async support, and simple configuration—perfect for macro analysis, econometrics, or quick prototyping in Jupyter.

Python is central here: the library is built on pandas for seamless data handling, async for efficient batch requests, and integrates with plotting tools like matplotlib/seaborn.

### Target Audience

Primarily aimed at economists, data analysts, researchers, macro hedge funds, central banks, and anyone doing data-driven macro work. It’s production-ready (with caching and error handling) but also great for hobbyists or students exploring economic datasets. Free tier available for personal use.

### Comparison

Unlike general API wrappers (e.g., fredapi or pandas-datareader), datasetiq unifies multiple sources (FRED + IMF + World Bank + 9+ others) under one simple interface, adds smart caching to avoid rate limits, and focuses on macro/global intelligence with pandas-first design. It’s more specialized than broad data tools like yfinance or quandl, but easier to use for time-series heavy workflows.

### Quick Example

import datasetiq as iq # Set your API key (one-time setup) iq.set_api_key("your_api_key_here") # Get data as pandas DataFrame df = iq.get("FRED/CPIAUCSL") # Display first few rows print(df.head()) # Basic analysis latest = df.iloc[-1] print(f"Latest CPI: {latest['value']} on {latest['date']}") # Calculate year-over-year inflation df['yoy_inflation'] = df['value'].pct_change(12) * 100 print(df.tail()) 

Links & Resources

submitted by /u/dsptl
[link] [comments]

Sales Analysis Yearly Report- Help A Newbie

Hello all, Hope evryone is doing well

I just started new job and have sales report coming up…are there anyone who’s into sales data who can tell me what metrics and visuals I can add to get more out of this kind of data(I have done some analysis and want some inputs from experts)the data is transaction wise with 1 year worth of data

Thank you in advance

submitted by /u/Afraid-Sound5502
[link] [comments]

[Dataset] Multi-Asset Market Signals Dataset For ML (leakage-safe, Research-grade)

I’ve released a research-grade financial dataset designed for machine

learning and quantitative research, with a strong focus on preventing

lookahead bias.

The dataset includes:

– Multi-asset daily price data

– Technical indicators (momentum, volatility, trend, volume)

– Macroeconomic features aligned by release dates

– Risk metrics (drawdowns, VaR, beta, tail risk)

– Strictly forward-looking targets at multiple horizons

All features are computed using only information available at the time,

and macro data is aligned using publication dates to ensure temporal

integrity.

The dataset follows a layered structure (raw → processed → aggregated),

with full traceability and reproducible pipelines. A baseline,

leakage-safe modeling notebook is included to demonstrate correct usage.

The dataset is publicly available here:

Kaggle link:

https://www.kaggle.com/datasets/DIKKAT_LINKI_BURAYA_YAPISTIR

Feedback and suggestions are very welcome.

submitted by /u/subcomandante_65
[link] [comments]

Github Top Projects From 2013 To 2025 (423,098 Entries)

Introducing the github-top-projects dataset: A comprehensive dataset of 423,098 GitHub trending repository entries spanning 12+ years (August 2013 – November 2025).

This dataset tracks the evolution of GitHub’s trending repositories over time, offering insights into software development trends across programming languages and domains spanning 12 years.

submitted by /u/Ok_Employee_6418
[link] [comments]

KashRock API Is In Public Beta — Normalized Player Props + DFS + Esports + Odds (looking For Testers)

Disclosure: I’m the developer of KashRock (this is my project).

I’m sharing a normalized sports betting markets dataset/API that unifies player props, main markets, esports props, and traditional odds across multiple books (DFS + sportsbooks). The core value is canonicalization: one stat key, one player name, consistent IDs across books (so merges/joining across sources is straightforward). Some records also include bet links.

What’s included

• Player props + main markets • Esports props • Traditional odds • DFS books (PrizePicks, Underdog, ParlayPlay, etc.) • Sportsbooks (bet365, Pinnacle, Hard Rock, Bovada, and more) 

What I want feedback on (from dataset users)

• Schema/field naming (what you’d change to make it easier to use) • Missing identifiers you need for joins (event/team/player IDs) • Any normalization edge cases you want covered 

Docs / access: https://api.kashrock.com/docs#/

submitted by /u/Apprehensive_Ice8314
[link] [comments]

How Do I Scrape Data From A Subreddit?

Hey everyone, I am new to the subreddit here but I have looked and looked and am not able to find a straight answer.

I am masters student who needs the data from a particular subreddit (r/antiwork). Part of it is available on Kaggle but I need the lastest posts as well. I know there have been some changes in the Reddit API rules and with Pushshift not being available any more… Is there a way I can get more data??

I am using R and have tried using the RedditExtractoR package but that only gives me about 250 posts at once. Any tips would be really helpful. Thank you!

submitted by /u/Legitimate-Bite4801
[link] [comments]

Any Recs For Solid Data Analysis Tools That Don’t Leak My Info?

I’m hunting for tools to help crunch data without the manual headache. What are you guys actually using for deep analysis, especially for mixing messy Excel sheets with PDFs?

Edit: I’ve messed around with a few—ChatGPT is decent for basic formulas, and Infinisynapse has been a game changer. It’s pretty sick because it handles cross-source analysis locally on my machine, so I can scrape web data straight into my DB without worrying about privacy leaks.

submitted by /u/MongWonP
[link] [comments]

How Do You Decide When A Messy Dataset Is “good Enough” To Start Modeling?

Lately I’ve been jumping between different public datasets for a side project, and I keep running into the same question: at what point do you stop cleaning and start analyzing?

Some datasets are obviously noisy – duplicated IDs, half-missing columns, weird timestamp formats, etc. My usual workflow is pretty standard: Pandas profiling → a few sanity checks in a notebook → light exploratory visualizations → then I try to build a baseline model or summary. But I’ve noticed a pattern: I often spend way too long chasing “perfect structure” before I actually begin the real work.

I tried changing the process a bit. I started treating the early phase more like a rehearsal. I’d talk through my reasoning out loud, use GPT or Claude to sanity-check assumptions, and occasionally run mock explanations with the Beyz coding assistant to see if my logic held up when spoken. This helped me catch weak spots in my cleaning decisions much faster. But I’m still unsure where other people draw the line.
How do you decide:

  • when the cleaning is “good enough”?
  • when to switch from preprocessing to actual modeling?
  • what level of missingness/noise is acceptable before you discard or rebuild a dataset?

Would love to hear how others approach this, especially for messy real-world datasets where there’s no official schema to lean on. TIA!

submitted by /u/jinxxx6-6
[link] [comments]

Seeking Tips For A Paid Dataset Of Twitter (X) High-follower Count Contact Info / Emails

I operate the Unofficial Twitter (X) Discord with 3400 members, and in 2026 we plan to begin hosting guest speakers with large followings to share their content strategy, tools they use etc.

I’m looking for a paid index or database of verified emails and Twitter profiles to automate the invitation process. Tweetscraper has a conversion rate of 10% contact emails which is a start. Bright Data has profile data and PII like real names but no contact information.

Any tips for other paid or free solutions are greatly appreciated!

submitted by /u/Alan-Foster
[link] [comments]

I Done Mt First Project Spotify Trends And Popularity Analysis

This is my first data analysis project, and I know it’s far from perfect.

I’m still learning, so there are definitely mistakes, gaps, or things that could have been done better — whether it’s in data cleaning, SQL queries, insights, or the dashboard design.

I’d genuinely appreciate it if you could take a look and point out anything that’s wrong or can be improved.
Even small feedback helps a lot at this stage.

I’m sharing this to learn, not to show off — so please feel free to be honest and direct.
Thanks in advance to anyone who takes the time to review it 🙏

github : https://github.com/1prinnce/Spotify-Trends-Popularity-Analysis

submitted by /u/1prinnce
[link] [comments]