Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Independent Weekly Cannabis Price Index (consumer Prices) – Looking For Methodological Feedback

I’ve been building an independent weekly cannabis price index focused on consumer retail prices, not revenue or licensing data. Most cannabis market reporting tracks sales, licenses, or company performance. I couldn’t find a public dataset that consistently tracks what consumers actually pay week to week, so I started aggregating prices from public online retail listings and publishing a fixed-baseline index. High-level approach: Weekly index with a fixed baseline Category-level aggregation (CBD, THC, etc.) No merchant or product promotion Transparent, public methodology Intended as a complementary signal to macro market reports Methodology and latest index are public here: https://cannabisdealsus.com/cannabis-price-index/ https://cannabisdealsus.com/cannabis-price-index/methodology/ I’m mainly posting to get methodological feedback: Does this approach seem sound for tracking consumer price movement? Any obvious biases or gaps you’d expect from this type of data source? Anything you’d want clarified if you were citing something like this? Not selling anything and not looking for promotion — genuinely interested in critique.

submitted by /u/theov666
[link] [comments]

Looking For Dataset On Menopausal Subjective Cognitive Decline (Academic Use) Post

Hi everyone,

I’m working on an academic project focused on Subjective Cognitive Decline (SCD) in menopausal women, using machine learning and explainable AI techniques.

While reviewing prior work, I found the paper “Clinical-Grade Hybrid Machine Learning Framework for Post-Menopausal subjective cognitive decline” particularly helpful. The hybrid ML approach and the focus on post-menopausal sleep-related health conditions closely align with the direction of my research.

Project overview (brief):

Machine learning–based risk prediction for cognitive issues in menopausal women

Use of Explainable AI (e.g., SHAP) to interpret contributing factors

Intended strictly for academic and educational purposes

Fully anonymous — no personally identifiable information is collected or stored

Goal is awareness and early screening support, not clinical diagnosis

submitted by /u/Small-Day-8755
[link] [comments]

Emotions Dataset: 14K Texts Tagged With 7 Emotions (NLP / Classification)

About Dataset –

https://www.kaggle.com/datasets/prashanthan24/synthetic-emotions-dataset-14k-texts-7-emotions

Overview
High-quality synthetic dataset with 13,970 text samples labeled across 7 emotions (Anger, Happiness, Sad, Surprise, Hate, Love and Fun). Generated using Mistral-7B for diverse, realistic emotion expressions in short-to-medium texts. Ideal for benchmarking NLP models like RNNs, BERT, or LLMs in multi-class emotion detection.

Sample
Text: “John clenched his fists, his face turning red as he paced back and forth in the room. His eyes flashed with frustration as he muttered under his breath about the latest setback at work.”

Emotion: Anger

Key Stats

  • Rows: 13970
  • Columns: text, emotion
  • Emotions: 7 balanced classes
  • Generator: Mistral-7B (synthetic, no PII/privacy risks)
  • Format: CSV (easy import to Kaggle notebooks)

Use Cases

  • Train/fine-tune emotion classifiers (e.g., DistilBERT, LSTM)
  • Compare traditional ML vs. LLMs (zero-shot/few-shot)
  • Augment real datasets for imbalanced classes
  • Educational projects in NLP/sentiment analysis

Notes Fully synthetic—labels auto-generated via LLM prompting for consistency. Check for duplicates/biases before heavy use. Pairs well with emotion notebooks!

submitted by /u/prashanthpavi
[link] [comments]

Any Good Sources Of Free Verbatim / Open-text Datasets?

Hi all,

I’m trying to track down free / open datasets that contain real human open ends for testing and research. I have tried using AI but they just don’t capture the nuance of a real market research project.

If anyone knows of good public sources, I’d really appreciate being pointed in the right direction.

Thanks!

submitted by /u/472826
[link] [comments]

Best Way To Pull Twitter/X Data At Scale Without Getting Rate Limited To Death?

Been trying to build a dataset of tweets for a research project (analyzing discourse patterns around specific topics) and the official X API is basically unusable unless you want to drop $5k+/month for reasonable limits.

I’ve tried a few different approaches:

  • Official API → rate limits killed me immediately
  • Manual scraping → got my IP banned within a day
  • Some random npm packages → half of them are broken now

Found a breakdown comparing different methods and it actually explained why most DIY scrapers fail (anti-bot stuff has gotten way more aggressive lately). Makes sense why so many tools just stopped working after Elon’s changes.

Anyone here working with Twitter data regularly? What’s actually reliable right now? Need something that can pull ~50k tweets/day without constant babysitting.

Not trying to do anything shady – just need public tweet text, timestamps, and basic engagement metrics for academic analysis.

submitted by /u/Technical_Fee4829
[link] [comments]

I Am Looking To Buy Instagram Influencer Data.

Are you sitting on a compiled Instagram creator database with depth beyond just handles?

I’m looking to buy a dataset outright that includes:

  • Instagram handle
  • District / city
  • State
  • Phone number
  • Email

Creator range: nano / micro influencers
Geo focus: South India

This is a clean purchase, not rev-share, not scraping on demand, not ongoing work.
If you already have the data, we can close quickly.

If interested, DM with:

  • Approx record count
  • Fields available
  • Price expectation

Only reaching out to people with ready data at this depth.

submitted by /u/mined_it
[link] [comments]

I Fine-tuned LLaMA 3.2 1B Brazilian Address Parser — Looking For Honest Feedback

Recently, I posted here on Reddit asking for ideas on what I could build with a dataset of ~2 million pairs of messy/clean Brazilian addresses. A few kind folks shared some great suggestions, and one idea that really stood out was building an address parser.

That pushed me into the world of LLM fine-tuning for the first time.

I decided to partially fine-tune LLaMA 3.2 1B, focusing specifically on address normalization and field extraction (address, complement, neighborhood, city, state, country, coordinates, etc.). Surprisingly, the early results look quite promising.

To properly evaluate it, I also built a small API to:

  • Run inference tests
  • Perform post-inference validation
  • Compute a confidence score based on consistency checks (postal code, city/state match, field presence, etc.)

Below is an example request body and the corresponding response.

Request

{ "inputs": [ "quadra -42.93386179 quadra arse 102 alameda 12 a, 5045 77023-582 brasil -21.26567258 palmas", "torre -43.02525939 bela vista 5 brasil minas gerais são joão do paraíso beco do pôr do sol, 4289 -19.14142529" ] } 

Response

[ { "address": "Quadra Arse 102 Alameda 12 A, 5045", "complement": "quadra", "city": "Palmas", "country": "Brasil", "postal_code": "77023-582", "latitude": "-21.26567258", "longitude": "-42.93386179", "confidence": 1.0, "validation": { "postal_code_validation": { "is_valid": true, "found_in_input": true, "city_match": true }, "field_validation": { "address_found": true, "complement_found": true, "neighborhood_found": false, "city_found": true, "state_found": false, "country_found": true } } }, { "address": "Beco Do Pôr Do Sol, 4289", "complement": "torre", "neighborhood": "Bela Vista 5", "city": "São João Do Paraíso", "state": "Minas Gerais", "country": "Brasil", "latitude": "-19.14142529", "longitude": "-43.02525939", "confidence": 0.92, "validation": { "postal_code_validation": { "is_valid": false }, "field_validation": { "address_found": true, "complement_found": true, "neighborhood_found": true, "city_found": true, "state_found": true, "country_found": true, "city_in_state": false, "neighborhood_in_city": false } } } ] 

I’d really appreciate honest feedback from people more experienced with:

  • Fine-tuning small LLMs
  • Address parsing / entity extraction
  • Post-inference validation strategies
  • Confidence scoring approaches

Does this look like a reasonable direction for a 1B model?
Anything you’d improve architecturally or evaluation-wise?

Thanks in advance — this project has been a great learning experience so far 🙏

submitted by /u/Hour-Dirt-8505
[link] [comments]

[FREE DATASET] 67K+ Domains With Technology Fingerprints

This dataset contains information on what technologies were found on domains that were crawled in December 2025.

A few common use cases for this type of data

  • You’re a developer who had built a particular solution for a client, and you want to replicate your success by finding more leads based on that client’s profile. For example, find me all electrical wholesalers using WordPress that have a `.com.au` domain.
  • You’re performing market research and you want to see who is already paying for your competitors. For example, find me all companies using my competitors product who are also paying for enterprise technologies (indicates high technology expenditure).
  • You’re a security researcher who is evaluating the impact of your findings. For example, give me all sites running a particular version of a WordPress plugin.

The 67K domain dataset can be found here: https://www.dropbox.com/scl/fi/d4l0gby5b5wqxn52k556z/sample_dec_2025.zip?rlkey=zfqwxtyh4j0ki2acxv014ibnr&e=1&st=xdcahaqm&dl=0

The full 5M+ domains can be purchased for 99 USD at: https://versiondb.io/

VersionDB’s WordPress catalogue can be found here: https://versiondb.io/technologies/wordpress/

Enjoy!

submitted by /u/Upper-Character-6743
[link] [comments]

A Workflow For Generating Labeled Object-detection Datasets Without Manual Annotation (experiment / Feedback Wanted)

I’m experimenting with using prompt-based object detection (open-vocabulary / vision-language models) as a way to auto-generate training datasets for downstream models like YOLO.

Instead of fixed classes, the detector takes any text prompt (e.g. “white Toyota Corolla”, “people wearing safety helmets”, “parked cars near sidewalks”) and outputs bounding boxes. Those detections are then exported as YOLO-format annotations to train a specialized model.

Observations so far:

  • Detection quality is surprisingly high for many niche or fine-grained prompts
  • Works well as a bootstrapping or data expansion step
  • Inference is expensive and not suitable for real-time use. this is strictly a dataset creation / offline pipeline idea

I’m trying to evaluate:

  • How usable these auto-generated labels are in practice
  • Where they fail compared to human-labeled data
  • Whether people would trust this for pretraining or rapid prototyping

Demo / tool I’m using for the experiment (Don’t abuse, it will crash if bombarded with requests:

Detect Anything

I’m mainly looking for feedback, edge cases, and similar projects. similar approaches before, I’d be very interested to hear what worked (or didn’t).

submitted by /u/eyasu6464
[link] [comments]

Snipper: An Open-source Chart Scraper And OCR Text+table Data Gathering Tool [self-promotion]

I was a heavy automeris.io (WebPlotDigitizer) user until the v5 version. Somewhat inspired by it, I’ve been working on a combined chart snipper and OCR text+table sampler. Desktop rather than web-based and built using Python, tesseract, and openCV. MIT licensed. Some instructions to get started in the readme.

Chart snipping should be somewhat familiar to automeris.io users but it starts with a screengrab. The tool is currently interactive but I’m thinking about more automated workflows. IMO the line detection is a bit easier to manage than it is in automeris with just a sequence of clicks but you can also drag individual points around. Still adding features and support for more chart types, better x-axis date handling etc. The Tkinter GUI has some limitations (e.g., hi-res screen support is a bit flaky) but is cross-platform and a Python built-in. Requests welcome.

submitted by /u/foldedcard
[link] [comments]

PAID] Global Car Specs & Features Dataset (1990-2025) – 12,000 Variants, 100+ Brands

I compiled and structured a global automotive specifications dataset covering more than 12,000 vehicle variants from over 100 brands, model years 1990-2025.

Each record includes: Brand, model, year, trim Engine specifications (fuel type, cylinders, power, torque, displacement) Dimensions (length, width, height, wheelbase, weight) Performance data (0-

100 km/h, top speed, COz emissions, fuel consumption) Price, warranty, maintenance, total cost per km Feature list (safety, comfort, convenience)

Available in CSV, JSON, and SQL formats. Useful for developers, researchers, and Al or data analysis projects.

GitHub (sample, details and structure):

https://github.com/vbalagovic/cars-dataset

submitted by /u/Ok_Cucumber_131
[link] [comments]

Harris County (TX) Parcel-level Real Estate Dataset

Clean, analysis-ready Harris County (TX) parcel-level real estate dataset.
Fully documented, GIS-ready, delivered in Parquet format.
Perfect for analytics, GIS, and data science workflows.

#realestate #HarrisCounty #Texas #GIS #parceldata #dataset #Parquet #opendata #HCAD #propertyrecords #datascience #analytics #geospatial

submitted by /u/ThorImagery
[link] [comments]

Looking For Advice On Pricing And Selling Smart Home Telemetry Data (EU)

Hi guys,

We’re a young company based in Europe and collect a significant amount of telemetry data from smart home devices in residential houses (e.g. temperature, energy consumption, usage patterns).

We believe this data could be valuable for companies across multiple industries (energy, proptech, insurance, analytics, etc.). However, we’re still quite new to the data monetization topic and are trying to better understand:

  • How to price such data (typical models, benchmarks, CPMs, subscriptions, etc.)
  • Who the realistic buyers might be
  • What transaction volumes or market sizes to expect
  • Where data like this is usually sold (marketplaces, direct sales, partnerships)

Where would you recommend starting to learn about this? Are there resources, communities, marketplaces, or frameworks you’ve found useful? First-hand experiences are especially welcome.

Thanks a lot for any help!

submitted by /u/Intelligent_Offer954
[link] [comments]

[Self-Release] 65 Hours Of Kenyan/Filipino English Dialogue | Split-Track WebRTC | VAD-Segmented

Hi all,

I’m the Co-founder of Datai. We are releasing a 65-hour dataset of spontaneous, two-speaker dialogues focused on Kenyan (KE) and Filipino (PH) English accents.

We built this to solve a specific internal problem: standard datasets (like LibriSpeech) are too clean. We needed data that reflects WebRTC/VoIP acoustics and non-Western prosody.

We are releasing this batch on Hugging Face for the community to use for ASR benchmarking, accent robustness testing, or diarization experiments.

The Specs:

  • Total Duration: ~65 hours (Full dataset is 800+ hours)
  • Speakers: >150 (Majority Kenyan interviewees, ~15 Filipino interviewers)
  • Topic: Natural, unscripted day-to-day life conversations.
  • Audio Quality: Recorded via WebRTC in Opus 48kHz, transcoded to pcm_s16le.
  • Structure: Split-track (Stereo). Each speaker is on a separate track.

Processing & Segmentation: We processed the raw streams using silero-vad to chunk audio into 1 to 30-second segments.

File/Metadata Structure: We’ve structured the filenames to help with parsing: ROOM-ID_TRACK-ID_START-MS_END-MS

  • ROOM-ID: Unique identifier for the conversation session.
  • TRACK-ID: The specific audio track (usually one speaker per track).

Technical Caveat (the edge case): Since this is real-world WebRTC data, we are transparent about the dirt in the data: If a speaker drops connection and rejoins, they may appear on a new TRACK-ID within the same ROOM-ID. We are clustering these in v2, but for now, treat Track IDs as session-specific rather than global speaker identities.

Access: The dataset is hosted on Hugging Face (gated to prevent bots/abuse, but I approve manual requests quickly).

Link is in the comments.

submitted by /u/Downtown_Valuable_44
[link] [comments]

I Put Together A Dataset That Might Be Useful For Researchers

I’ve been working on a side project and ended up compiling a dataset that may be useful beyond what I originally needed it for, so I’m considering releasing it publicly.

At a high level, the dataset contains:

  • structured records collected over a multi-year period
  • consistent timestamps and identifiers
  • minimal preprocessing (basic cleaning + deduplication only)

It’s not tied to a specific paper or product, more something that could support exploratory analysis, modeling, or benchmarking, depending on the use case.

Before publishing, I wanted to sanity-check with this community:

  • what details do you usually look for to judge dataset quality?
  • is light preprocessing preferred, or raw + processed versions?
  • anything that would immediately make this more usable for research?

Happy to share more specifics if there’s interest, and open to feedback before release.

submitted by /u/crowpng
[link] [comments]

How Can I Learn DS/DA From Scratch To Stand Out In The Highly Competitive Market?

Hello, I am currently studying data analytics and data science. I generally want to focus on one of these two fields and learn. But due to the high competition in the market and the negative impact of artificial intelligence on the field, should I start or choose another field? What exactly do I need to know and learn to stand out in the market competition in the DA DS fields and find a job more easily? There is a lot of information on the Internet, so I can’t find the exact required learning path. Recommendations from professionals in this field are very important to me. Is it worth studying this field and how? Thank you very much

submitted by /u/No_Staff_7246
[link] [comments]

Looking For S&P 500 (GICS Information Technology Sector) Dataset: Revenue, Net Income & R&D Expenses (Excel/CSV)

Hi everyone,

I’m a master’s student working on academic research and I’m looking for a compiled dataset

for S&P 500 companies that includes:

– Revenue

– Net Income (profit)

– R&D expenses (I know some companies don’t report them)

Ideally:

– Annual data

– Multiple years (e.g. 2010–2024, but flexible)

– Excel or CSV format

This is strictly for non-commercial, academic use (master’s thesis).

If anyone already has this dataset (e.g. from Compustat / Capital IQ / Bloomberg)

and is willing to share or point me in the right direction, I’d really appreciate it.

Thanks a lot!

submitted by /u/SuddenBookkeeper6351
[link] [comments]