Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Bamboo Filing Cabinet: Vietnam Elections (open, Source-linked Datasets + Site)

TL;DR: Open, source-linked Vietnam election datasets (starting with NA15-2021) with reproducible pipelines + GitHub Pages site; seeking source hunters and devs.

Hi all,

I want to share Vietnam Elections, a project I’ve been working on to make Vietnam election data more accessible, archived, and fully sourced.

The code for both the site and the data is on GitHub. The pipeline is provenance-first: raw sources → scripts → JSON exports, and every factual field links back to a source URL with retrieval timestamps.

Data access: the exported datasets live in public/data/ within the repo.

If anyone has been interested in this data before, I think you may have been stymied by the lack of English-language information, slow or buggy websites, and data soft-hidden behind PDFs.

So far I’ve mapped out the 2021 National Assembly XV election in anticipation of the coming 2026 Vietnamese legislative election. Even with only one election, there are already a bunch of interesting stats, for example, did you know that in 2021:

  1. …the smallest gap between a winner and a loser in a constituency was only 197 votes, representing a 0.16% gap?
  2. …8 people born in 1990 or later won a seat, with 7 of them being women?
  3. …2 candidates only had middle school education?
  4. …1 person won, but was not confirmed?

I’m looking for contributors or anyone interested in building this project as I want to map out all the elections in Vietnam’s history, primarily:

  1. Source hunters (no coding): help find official/public source pages or PDFs (candidate lists, results tables, constituency/unit docs) — even just one link helps.
  2. Devs: help automate collection + parsing (HTML/PDF → structured tables), validation, and reproducible builds.

For corrections or contributions, it would be best to start with either the GitHub Issues or use the anonymous form.

You might ask, “what is this Bamboo Filing Cabinet?” It’s the umbrella GitHub organization (org page here) I created to store and make accessible Vietnam-related datasets. It’s aiming to be community-run, not affiliated with any government agency, and focuses on provenance-first, reproducible, neutral datasets with transparent change history. If you have ideas for other Vietnam-related datasets that would fit under this umbrella, please reach out.

submitted by /u/thanhoangviet1996
[link] [comments]

30,000 Human CAPTCHA Interactions: Mouse Trajectories, Telemetry, And Solutions

Just released the largest open-source behavioral dataset for CAPTCHA research on huggingface. Most existing datasets only provide the solution labels (image/text); this dataset includes the full cursor telemetry.

Specs:

  • 30,000+ verified human sessions.
  • Features: Path curvature, accelerations, micro-corrections, and timing.
  • Tasks: Drag mechanics and high-precision object tracking (harder than current production standards).
  • Source: Verified human interactions (3 world records broken for scale/participants).

Ideal for training behavioral biometric models, red-teaming anti-bot systems, or researching human-computer interaction (HCI) patterns.

Dataset: https://huggingface.co/datasets/Capycap-AI/CaptchaSolve30k

submitted by /u/SilverWheat
[link] [comments]

Tons Of Clean Econ/finance Datasets That Are Quite Messy In Their Original Form

FetchSeries (https://www.fetchseries.com) provides a clean and fast way to access lots of open/free datasets that are quite messy when downloaded from their original sources. Think stuff that is on Government websites spread in dozens of excel files with often non-coherent formats (e.g., the CFTC’s COT reports, regional FED’s manufacturing surveys, port and air traffic data).

submitted by /u/mtaboga
[link] [comments]

Where To Find Traffic Data For A Specific Road?

Hello there,

I have a personal project on my mind to investigate an issue that has been plaguing my town for decades through solid data analysis.

Specifically i am interested in extracting the traffic data of a specific local road, not highway or motorway, to create a traffic time series and also look into the nature of traffic jams at different hours of the day.

Is there any service that allows to extract this data from google maps or other sources?

I am not in US.

submitted by /u/Trollercoaster101
[link] [comments]

Seating On High End GPU Resources That I Have Not Been Able To Put To Work

Some months ago we decided to do some heavy data processing and we had just learned about Cloud LLMs and open source models so with excitement we got some decent amount of Cloud credits with access to high end GPUs like the b200 , h200 , h100 and ofcourse anything below these, turns out we did not need all of these resources and even worst there was a better way to do this and had to switch to the other better way, since then the cloud credits have been seating idle and doing nothing , i don’t have much time and anything that important to do with them and am trying to figure out if i can put this to work and how.
any ideas how i can utilize these and make something off it ?

submitted by /u/TelevisionHot468
[link] [comments]

Data Center Geolocation Data In The US

Long time lurker here

Curious to know if anyone has pointers for data center location data. Hearing data center clusters having impact on million things, eg northern virginia has a cluster but where are they on the map? Operational ones? Those in construction?

Early stage discovery so any pointers are helpful

submitted by /u/leobenjamin80
[link] [comments]

HELP! Does Anyone Have A Way To Download The Qilin Watermelon Dataset For Free? I’m A Super Broke High School Student.

I want to make a machine learning algorithm which takes in an audio clip of tapping a watermelon and outputs the ripeness/how good the watermelon is. I need training data and the Qilin Watermelon dataset is perfect. However, I’m a super broke high school student. If anyone already has the zip file and provide a free download link or have another applicable dataset, I would really appreciate it.

submitted by /u/ComfortableMenu1114
[link] [comments]

Independent Weekly Cannabis Price Index (consumer Prices) – Looking For Methodological Feedback

I’ve been building an independent weekly cannabis price index focused on consumer retail prices, not revenue or licensing data. Most cannabis market reporting tracks sales, licenses, or company performance. I couldn’t find a public dataset that consistently tracks what consumers actually pay week to week, so I started aggregating prices from public online retail listings and publishing a fixed-baseline index. High-level approach: Weekly index with a fixed baseline Category-level aggregation (CBD, THC, etc.) No merchant or product promotion Transparent, public methodology Intended as a complementary signal to macro market reports Methodology and latest index are public here: https://cannabisdealsus.com/cannabis-price-index/ https://cannabisdealsus.com/cannabis-price-index/methodology/ I’m mainly posting to get methodological feedback: Does this approach seem sound for tracking consumer price movement? Any obvious biases or gaps you’d expect from this type of data source? Anything you’d want clarified if you were citing something like this? Not selling anything and not looking for promotion — genuinely interested in critique.

submitted by /u/theov666
[link] [comments]

Looking For Dataset On Menopausal Subjective Cognitive Decline (Academic Use) Post

Hi everyone,

I’m working on an academic project focused on Subjective Cognitive Decline (SCD) in menopausal women, using machine learning and explainable AI techniques.

While reviewing prior work, I found the paper “Clinical-Grade Hybrid Machine Learning Framework for Post-Menopausal subjective cognitive decline” particularly helpful. The hybrid ML approach and the focus on post-menopausal sleep-related health conditions closely align with the direction of my research.

Project overview (brief):

Machine learning–based risk prediction for cognitive issues in menopausal women

Use of Explainable AI (e.g., SHAP) to interpret contributing factors

Intended strictly for academic and educational purposes

Fully anonymous — no personally identifiable information is collected or stored

Goal is awareness and early screening support, not clinical diagnosis

submitted by /u/Small-Day-8755
[link] [comments]

Emotions Dataset: 14K Texts Tagged With 7 Emotions (NLP / Classification)

About Dataset –

https://www.kaggle.com/datasets/prashanthan24/synthetic-emotions-dataset-14k-texts-7-emotions

Overview
High-quality synthetic dataset with 13,970 text samples labeled across 7 emotions (Anger, Happiness, Sad, Surprise, Hate, Love and Fun). Generated using Mistral-7B for diverse, realistic emotion expressions in short-to-medium texts. Ideal for benchmarking NLP models like RNNs, BERT, or LLMs in multi-class emotion detection.

Sample
Text: “John clenched his fists, his face turning red as he paced back and forth in the room. His eyes flashed with frustration as he muttered under his breath about the latest setback at work.”

Emotion: Anger

Key Stats

  • Rows: 13970
  • Columns: text, emotion
  • Emotions: 7 balanced classes
  • Generator: Mistral-7B (synthetic, no PII/privacy risks)
  • Format: CSV (easy import to Kaggle notebooks)

Use Cases

  • Train/fine-tune emotion classifiers (e.g., DistilBERT, LSTM)
  • Compare traditional ML vs. LLMs (zero-shot/few-shot)
  • Augment real datasets for imbalanced classes
  • Educational projects in NLP/sentiment analysis

Notes Fully synthetic—labels auto-generated via LLM prompting for consistency. Check for duplicates/biases before heavy use. Pairs well with emotion notebooks!

submitted by /u/prashanthpavi
[link] [comments]

Any Good Sources Of Free Verbatim / Open-text Datasets?

Hi all,

I’m trying to track down free / open datasets that contain real human open ends for testing and research. I have tried using AI but they just don’t capture the nuance of a real market research project.

If anyone knows of good public sources, I’d really appreciate being pointed in the right direction.

Thanks!

submitted by /u/472826
[link] [comments]

Best Way To Pull Twitter/X Data At Scale Without Getting Rate Limited To Death?

Been trying to build a dataset of tweets for a research project (analyzing discourse patterns around specific topics) and the official X API is basically unusable unless you want to drop $5k+/month for reasonable limits.

I’ve tried a few different approaches:

  • Official API → rate limits killed me immediately
  • Manual scraping → got my IP banned within a day
  • Some random npm packages → half of them are broken now

Found a breakdown comparing different methods and it actually explained why most DIY scrapers fail (anti-bot stuff has gotten way more aggressive lately). Makes sense why so many tools just stopped working after Elon’s changes.

Anyone here working with Twitter data regularly? What’s actually reliable right now? Need something that can pull ~50k tweets/day without constant babysitting.

Not trying to do anything shady – just need public tweet text, timestamps, and basic engagement metrics for academic analysis.

submitted by /u/Technical_Fee4829
[link] [comments]

I Am Looking To Buy Instagram Influencer Data.

Are you sitting on a compiled Instagram creator database with depth beyond just handles?

I’m looking to buy a dataset outright that includes:

  • Instagram handle
  • District / city
  • State
  • Phone number
  • Email

Creator range: nano / micro influencers
Geo focus: South India

This is a clean purchase, not rev-share, not scraping on demand, not ongoing work.
If you already have the data, we can close quickly.

If interested, DM with:

  • Approx record count
  • Fields available
  • Price expectation

Only reaching out to people with ready data at this depth.

submitted by /u/mined_it
[link] [comments]