Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Zero-touch Pipeline + Explorer For A Subset Of The Epstein-related DOJ PDF Release (hashed, Restart-safe, Source-path Traceable)

I ran an end-to-end preprocess on a subset of the Epstein-related files from the DOJ PDF release I downloaded (not claiming completeness). The goal is corpus exploration + provenance, not “truth,” and not perfect extraction.

Explorer: https://huggingface.co/spaces/cjc0013/epstein-corpus-explorer

Raw dataset artifacts (so you can validate / build your own tooling): https://huggingface.co/datasets/cjc0013/epsteindataset/tree/main


What I did

1) Ingest + hashing (deterministic identity)

  • Input: /content/TEXT (directory)
  • Files hashed: 331,655
  • Everything is hashed so runs have a stable identity and you can detect changes.
  • Every chunk includes a source_file path so you can map a chunk back to the exact file you downloaded (i.e., your local DOJ dump on disk). This is for auditability.

2) Text extraction from PDFs (NO OCR)

I did not run OCR.

Reason: the PDFs had selectable/highlightable text, so there’s already a text layer. OCR would mostly add noise.

Caveat: extraction still isn’t perfect because redactions can disrupt the PDF text layer, even when text is highlightable. So you may see:

  • missing spans
  • duplicated fragments
  • out-of-order text
  • odd tokens where redaction overlays cut across lines

I kept extraction as close to “normal” as possible (no reconstruction / no guessing redacted content). This is meant for exploration, not as an authoritative transcript.

3) Chunking

  • Output chunks: 489,734
  • Stored with stable IDs + ordering + source path provenance.

4) Embeddings

  • Model: BAAI/bge-large-en-v1.5
  • embeddings.npy shape (489,734, 1024) float32

5) BM25 artifacts

  • bm25_stats.parquet
  • bm25_vocab.parquet
  • Full BM25 index object skipped at this scale (chunk_count > 50k), but vocab/stats are written.

6) Clustering (scale-aware)

HDBSCAN at ~490k points can take a very long time and is largely CPU-bound, so at large N the pipeline auto-switches to:

  • PCA → 64 dims
  • MiniBatchKMeans This completed cleanly.

7) Restart-safe / resume

If the runtime dies or I stop it, rerunning reuses valid artifacts (chunks/BM25/embeddings) instead of redoing multi-hour work.


Outputs produced

  • chunks.parquet (chunk_id, order_index, doc_id, source_file, text)
  • embeddings.npy
  • cluster_labels.parquet (chunk_id, cluster_id, cluster_prob)
  • bm25_stats.parquet
  • bm25_vocab.parquet
  • fused_chunks.jsonl
  • preprocess_report.json

Quick note on “quality” / bugs

I’m not a data scientist and I’m not claiming this is bug-free — including the Hugging Face explorer itself. That’s why I’m also publishing the raw artifacts so anyone can audit the pipeline outputs, rebuild the index, or run their own analysis from scratch: https://huggingface.co/datasets/cjc0013/epsteindataset/tree/main


What this is / isn’t

  • Not claiming perfect extraction (redactions can corrupt the text layer even without OCR).
  • Not claiming completeness (subset only).
  • Is deterministic + hashed + traceable back to source file locations for auditing.

submitted by /u/Either_Pound1986
[link] [comments]

Public APIs For Monthly CPI (Consumer Price Index) For All Countries?

Hi everyone,

I’m building a small CLI tool and I’m looking for public (or at least well-documented) APIs that provide monthly CPI / inflation data for as many countries as possible.

Requirements / details:

  • Coverage: ideally global (all or most countries)
  • Frequency: monthly (not just annual)
  • Data type:
    • CPI index level (e.g. 2015 = 100), not only inflation % YoY
    • Headline CPI is fine; bonus if core CPI is also available
  • Access:
    • Public or free tier available
    • REST / JSON preferred
  • Nice to have:
    • Country codes mapping (ISO / IMF / WB)
    • Reasonable uptime / stability
    • Historical depth (10–20+ years if possible)

One use case of the CLI tool is to select a country, specify a past year, type a nominal value of budget at that year and contact by API an online provider to retrieve the mentioned information above and compute the real value of that budget at the current time.

Are there reliable data providers or APIs (public or freemium) that expose monthly CPI data globally?

Thanks!

submitted by /u/D3vil0p
[link] [comments]

How Do You Flag Low-effort Responses That Aren’t Bots?

Bot detection is relatively straightforward these days (honeypots, timestamps, etc.). But I’m struggling with a different data quality issue: The “Bored Human.”

These are real people who technically pass the bot checks but select “C” for every answer or type “good” in every text box just to finish.

When cleaning a new dataset, what are your heuristics for flagging these? Do you look for standard deviation in their answers (straight-lining), or do you rely on minimum character counts for open text?

submitted by /u/EnergyBrilliant540
[link] [comments]

[PAID] FragDB: 119K Fragrances, 7.2K Brands, 2.7K Perfumers — Free Sample On GitHub & Kaggle

Disclosure: I’m the creator of FragDB. The sample is free and MIT licensed. The full database is a paid product.

I’m releasing a structured fragrance database with a free sample for the community.

What’s in the database

File Records Fields
fragrances.csv 119,000+ 28
brands.csv 7,200+ 10
perfumers.csv 2,700+ 11

Data highlights

Fragrances include: – Notes pyramid (top/mid/base layers with ingredient names) – Accords with strength percentages (woody:100, amber:85, etc.) – Community ratings (19.8M total votes) – Longevity & sillage votes (9.3M and 10.1M respectively) – Season suitability (winter/spring/summer/fall percentages) – “People also like” recommendations

Brands include: – Country of origin – Parent company (LVMH, Kering, etc.) – Logo URLs – Official websites

Perfumers include: – Professional status (Master Perfumer, etc.) – Current and previous employers – Education background – Biography

Technical specs

  • Format: Pipe-delimited CSV
  • Encoding: UTF-8
  • Relational structure via IDs (fragrances → brands, fragrances → perfumers)
  • Year range: 1533–2026

Free sample

The sample includes 10 fragrances (Chanel, Dior, Tom Ford, YSL, etc.) with matching brands and perfumers — enough to test your pipelines and see the data quality.

Links

Quick start

“`python import pandas as pd

fragrances = pd.read_csv(‘fragrances.csv’, sep=’|’) brands = pd.read_csv(‘brands.csv’, sep=’|’) perfumers = pd.read_csv(‘perfumers.csv’, sep=’|’)

Join tables

fragrances[‘brand_id’] = fragrances[‘brand’].str.split(‘;’).str[1] df = fragrances.merge(brands, left_on=’brand_id’, right_on=’id’)

print(df[[‘name’, ‘name_brand’, ‘country’, ‘rating’]]) “`

Happy to answer any questions about the data structure.

submitted by /u/FragDBnet
[link] [comments]

Music Listening Data – Data From ~500k Users

Hi everyone, I released this dataset on kaggle a couple months ago and thought that it’d be appreciated here.

This dataset has the top 50 artists, tracks, and albums for each user, alongside its playcount and musicbrainz ID. All data is anonymized of course. It’s super interesting for analyzing listening patterns.

I made a notebook that creates a sort of “listening map” of the most popular artists, but there’s so much more than can be done with the data. LMK what you guys think!

submitted by /u/Agile_Mortgage_2013
[link] [comments]

Bamboo Filing Cabinet: Vietnam Elections (open, Source-linked Datasets + Site)

TL;DR: Open, source-linked Vietnam election datasets (starting with NA15-2021) with reproducible pipelines + GitHub Pages site; seeking source hunters and devs.

Hi all,

I want to share Vietnam Elections, a project I’ve been working on to make Vietnam election data more accessible, archived, and fully sourced.

The code for both the site and the data is on GitHub. The pipeline is provenance-first: raw sources → scripts → JSON exports, and every factual field links back to a source URL with retrieval timestamps.

Data access: the exported datasets live in public/data/ within the repo.

If anyone has been interested in this data before, I think you may have been stymied by the lack of English-language information, slow or buggy websites, and data soft-hidden behind PDFs.

So far I’ve mapped out the 2021 National Assembly XV election in anticipation of the coming 2026 Vietnamese legislative election. Even with only one election, there are already a bunch of interesting stats, for example, did you know that in 2021:

  1. …the smallest gap between a winner and a loser in a constituency was only 197 votes, representing a 0.16% gap?
  2. …8 people born in 1990 or later won a seat, with 7 of them being women?
  3. …2 candidates only had middle school education?
  4. …1 person won, but was not confirmed?

I’m looking for contributors or anyone interested in building this project as I want to map out all the elections in Vietnam’s history, primarily:

  1. Source hunters (no coding): help find official/public source pages or PDFs (candidate lists, results tables, constituency/unit docs) — even just one link helps.
  2. Devs: help automate collection + parsing (HTML/PDF → structured tables), validation, and reproducible builds.

For corrections or contributions, it would be best to start with either the GitHub Issues or use the anonymous form.

You might ask, “what is this Bamboo Filing Cabinet?” It’s the umbrella GitHub organization (org page here) I created to store and make accessible Vietnam-related datasets. It’s aiming to be community-run, not affiliated with any government agency, and focuses on provenance-first, reproducible, neutral datasets with transparent change history. If you have ideas for other Vietnam-related datasets that would fit under this umbrella, please reach out.

submitted by /u/thanhoangviet1996
[link] [comments]

30,000 Human CAPTCHA Interactions: Mouse Trajectories, Telemetry, And Solutions

Just released the largest open-source behavioral dataset for CAPTCHA research on huggingface. Most existing datasets only provide the solution labels (image/text); this dataset includes the full cursor telemetry.

Specs:

  • 30,000+ verified human sessions.
  • Features: Path curvature, accelerations, micro-corrections, and timing.
  • Tasks: Drag mechanics and high-precision object tracking (harder than current production standards).
  • Source: Verified human interactions (3 world records broken for scale/participants).

Ideal for training behavioral biometric models, red-teaming anti-bot systems, or researching human-computer interaction (HCI) patterns.

Dataset: https://huggingface.co/datasets/Capycap-AI/CaptchaSolve30k

submitted by /u/SilverWheat
[link] [comments]

Tons Of Clean Econ/finance Datasets That Are Quite Messy In Their Original Form

FetchSeries (https://www.fetchseries.com) provides a clean and fast way to access lots of open/free datasets that are quite messy when downloaded from their original sources. Think stuff that is on Government websites spread in dozens of excel files with often non-coherent formats (e.g., the CFTC’s COT reports, regional FED’s manufacturing surveys, port and air traffic data).

submitted by /u/mtaboga
[link] [comments]

Where To Find Traffic Data For A Specific Road?

Hello there,

I have a personal project on my mind to investigate an issue that has been plaguing my town for decades through solid data analysis.

Specifically i am interested in extracting the traffic data of a specific local road, not highway or motorway, to create a traffic time series and also look into the nature of traffic jams at different hours of the day.

Is there any service that allows to extract this data from google maps or other sources?

I am not in US.

submitted by /u/Trollercoaster101
[link] [comments]

Seating On High End GPU Resources That I Have Not Been Able To Put To Work

Some months ago we decided to do some heavy data processing and we had just learned about Cloud LLMs and open source models so with excitement we got some decent amount of Cloud credits with access to high end GPUs like the b200 , h200 , h100 and ofcourse anything below these, turns out we did not need all of these resources and even worst there was a better way to do this and had to switch to the other better way, since then the cloud credits have been seating idle and doing nothing , i don’t have much time and anything that important to do with them and am trying to figure out if i can put this to work and how.
any ideas how i can utilize these and make something off it ?

submitted by /u/TelevisionHot468
[link] [comments]

Data Center Geolocation Data In The US

Long time lurker here

Curious to know if anyone has pointers for data center location data. Hearing data center clusters having impact on million things, eg northern virginia has a cluster but where are they on the map? Operational ones? Those in construction?

Early stage discovery so any pointers are helpful

submitted by /u/leobenjamin80
[link] [comments]

HELP! Does Anyone Have A Way To Download The Qilin Watermelon Dataset For Free? I’m A Super Broke High School Student.

I want to make a machine learning algorithm which takes in an audio clip of tapping a watermelon and outputs the ripeness/how good the watermelon is. I need training data and the Qilin Watermelon dataset is perfect. However, I’m a super broke high school student. If anyone already has the zip file and provide a free download link or have another applicable dataset, I would really appreciate it.

submitted by /u/ComfortableMenu1114
[link] [comments]

Independent Weekly Cannabis Price Index (consumer Prices) – Looking For Methodological Feedback

I’ve been building an independent weekly cannabis price index focused on consumer retail prices, not revenue or licensing data. Most cannabis market reporting tracks sales, licenses, or company performance. I couldn’t find a public dataset that consistently tracks what consumers actually pay week to week, so I started aggregating prices from public online retail listings and publishing a fixed-baseline index. High-level approach: Weekly index with a fixed baseline Category-level aggregation (CBD, THC, etc.) No merchant or product promotion Transparent, public methodology Intended as a complementary signal to macro market reports Methodology and latest index are public here: https://cannabisdealsus.com/cannabis-price-index/ https://cannabisdealsus.com/cannabis-price-index/methodology/ I’m mainly posting to get methodological feedback: Does this approach seem sound for tracking consumer price movement? Any obvious biases or gaps you’d expect from this type of data source? Anything you’d want clarified if you were citing something like this? Not selling anything and not looking for promotion — genuinely interested in critique.

submitted by /u/theov666
[link] [comments]