SUBIT‑64 Spec V0.9.0 — The First Stable Release. A New Foundation For Information Theory

submitted by /u/MainPuzzleheaded8880
[link] [comments]

Looking For Wheat Disease Datasets!!!

What we need is the dataset that contains Disease image, label, Description of disease, remedies.If possible please provide some resources. Thanks in advance

submitted by /u/Total-Narwhal-3263
[link] [comments]

0

Curated AI VC Firm List For Early-stage Founders

Hand-verified investors backing AI and machine learning companies.

https://aivclist.com

submitted by /u/project_startups
[link] [comments]

0

Independent Weekly Cannabis Price Index (consumer Prices) – Looking For Methodological Feedback

I’ve been building an independent weekly cannabis price index focused on consumer retail prices, not revenue or licensing data. Most cannabis market reporting tracks sales, licenses, or company performance. I couldn’t find a public dataset that consistently tracks what consumers actually pay week to week, so I started aggregating prices from public online retail listings and publishing a fixed-baseline index. High-level approach: Weekly index with a fixed baseline Category-level aggregation (CBD, THC, etc.) No merchant or product promotion Transparent, public methodology Intended as a complementary signal to macro market reports Methodology and latest index are public here: https://cannabisdealsus.com/cannabis-price-index/ https://cannabisdealsus.com/cannabis-price-index/methodology/ I’m mainly posting to get methodological feedback: Does this approach seem sound for tracking consumer price movement? Any obvious biases or gaps you’d expect from this type of data source? Anything you’d want clarified if you were citing something like this? Not selling anything and not looking for promotion — genuinely interested in critique.

submitted by /u/theov666
[link] [comments]

0

Looking For Dataset On Menopausal Subjective Cognitive Decline

submitted by /u/Small-Day-8755
[link] [comments]

0

Looking For Dataset On Menopausal Subjective Cognitive Decline (Academic Use) Post

Hi everyone,

I’m working on an academic project focused on Subjective Cognitive Decline (SCD) in menopausal women, using machine learning and explainable AI techniques.

While reviewing prior work, I found the paper “Clinical-Grade Hybrid Machine Learning Framework for Post-Menopausal subjective cognitive decline” particularly helpful. The hybrid ML approach and the focus on post-menopausal sleep-related health conditions closely align with the direction of my research.

Project overview (brief):

Machine learning–based risk prediction for cognitive issues in menopausal women

Use of Explainable AI (e.g., SHAP) to interpret contributing factors

Intended strictly for academic and educational purposes

Fully anonymous — no personally identifiable information is collected or stored

Goal is awareness and early screening support, not clinical diagnosis

submitted by /u/Small-Day-8755
[link] [comments]

0

A European Database Of Ecological Restoration

submitted by /u/cavedave
[link] [comments]

0

Emotions Dataset: 14K Texts Tagged With 7 Emotions (NLP / Classification)

About Dataset –

https://www.kaggle.com/datasets/prashanthan24/synthetic-emotions-dataset-14k-texts-7-emotions

Overview
High-quality synthetic dataset with 13,970 text samples labeled across 7 emotions (Anger, Happiness, Sad, Surprise, Hate, Love and Fun). Generated using Mistral-7B for diverse, realistic emotion expressions in short-to-medium texts. Ideal for benchmarking NLP models like RNNs, BERT, or LLMs in multi-class emotion detection.

Sample
Text: “John clenched his fists, his face turning red as he paced back and forth in the room. His eyes flashed with frustration as he muttered under his breath about the latest setback at work.”

Emotion: Anger

Key Stats

Rows: 13970
Columns: text, emotion
Emotions: 7 balanced classes
Generator: Mistral-7B (synthetic, no PII/privacy risks)
Format: CSV (easy import to Kaggle notebooks)

Use Cases

Train/fine-tune emotion classifiers (e.g., DistilBERT, LSTM)
Compare traditional ML vs. LLMs (zero-shot/few-shot)
Augment real datasets for imbalanced classes
Educational projects in NLP/sentiment analysis

Notes Fully synthetic—labels auto-generated via LLM prompting for consistency. Check for duplicates/biases before heavy use. Pairs well with emotion notebooks!

submitted by /u/prashanthpavi
[link] [comments]

0

Any Good Sources Of Free Verbatim / Open-text Datasets?

Hi all,

I’m trying to track down free / open datasets that contain real human open ends for testing and research. I have tried using AI but they just don’t capture the nuance of a real market research project.

If anyone knows of good public sources, I’d really appreciate being pointed in the right direction.

Thanks!

submitted by /u/472826
[link] [comments]

0

Best Way To Pull Twitter/X Data At Scale Without Getting Rate Limited To Death?

Been trying to build a dataset of tweets for a research project (analyzing discourse patterns around specific topics) and the official X API is basically unusable unless you want to drop $5k+/month for reasonable limits.

I’ve tried a few different approaches:

Official API → rate limits killed me immediately
Manual scraping → got my IP banned within a day
Some random npm packages → half of them are broken now

Found a breakdown comparing different methods and it actually explained why most DIY scrapers fail (anti-bot stuff has gotten way more aggressive lately). Makes sense why so many tools just stopped working after Elon’s changes.

Anyone here working with Twitter data regularly? What’s actually reliable right now? Need something that can pull ~50k tweets/day without constant babysitting.

Not trying to do anything shady – just need public tweet text, timestamps, and basic engagement metrics for academic analysis.

submitted by /u/Technical_Fee4829
[link] [comments]

0

I Am Looking To Buy Instagram Influencer Data.

Are you sitting on a compiled Instagram creator database with depth beyond just handles?

I’m looking to buy a dataset outright that includes:

Instagram handle
District / city
State
Phone number
Email

Creator range: nano / micro influencers
Geo focus: South India

This is a clean purchase, not rev-share, not scraping on demand, not ongoing work.
If you already have the data, we can close quickly.

If interested, DM with:

Approx record count
Fields available
Price expectation

Only reaching out to people with ready data at this depth.

submitted by /u/mined_it
[link] [comments]

0

How To Get DFDC Dataset Access ?? Is The Website Working???

Was working on a deepfake research paper and was trying to get access to DFDC dataset but for some reason the dfdc official website ain’t working, is it because I didnt acquire access to it ??? Is there any other way I can get hands on the dataset???

submitted by /u/Ok_Concert6723
[link] [comments]

0

Where Do People Find Usable Social Interaction Datasets These Days?

I’m interested in datasets around engagement, posts, comments, or interaction graphs.

A lot of older datasets seem outdated or incomplete now.
Are there still good public sources, or is most social data now API/scrape-only?

submitted by /u/crowpng
[link] [comments]

0

I Fine-tuned LLaMA 3.2 1B Brazilian Address Parser — Looking For Honest Feedback

Recently, I posted here on Reddit asking for ideas on what I could build with a dataset of ~2 million pairs of messy/clean Brazilian addresses. A few kind folks shared some great suggestions, and one idea that really stood out was building an address parser.

That pushed me into the world of LLM fine-tuning for the first time.

I decided to partially fine-tune LLaMA 3.2 1B, focusing specifically on address normalization and field extraction (address, complement, neighborhood, city, state, country, coordinates, etc.). Surprisingly, the early results look quite promising.

To properly evaluate it, I also built a small API to:

Run inference tests
Perform post-inference validation
Compute a confidence score based on consistency checks (postal code, city/state match, field presence, etc.)

Below is an example request body and the corresponding response.

Request

{ "inputs": [ "quadra -42.93386179 quadra arse 102 alameda 12 a, 5045 77023-582 brasil -21.26567258 palmas", "torre -43.02525939 bela vista 5 brasil minas gerais são joão do paraíso beco do pôr do sol, 4289 -19.14142529" ] }

Response

[ { "address": "Quadra Arse 102 Alameda 12 A, 5045", "complement": "quadra", "city": "Palmas", "country": "Brasil", "postal_code": "77023-582", "latitude": "-21.26567258", "longitude": "-42.93386179", "confidence": 1.0, "validation": { "postal_code_validation": { "is_valid": true, "found_in_input": true, "city_match": true }, "field_validation": { "address_found": true, "complement_found": true, "neighborhood_found": false, "city_found": true, "state_found": false, "country_found": true } } }, { "address": "Beco Do Pôr Do Sol, 4289", "complement": "torre", "neighborhood": "Bela Vista 5", "city": "São João Do Paraíso", "state": "Minas Gerais", "country": "Brasil", "latitude": "-19.14142529", "longitude": "-43.02525939", "confidence": 0.92, "validation": { "postal_code_validation": { "is_valid": false }, "field_validation": { "address_found": true, "complement_found": true, "neighborhood_found": true, "city_found": true, "state_found": true, "country_found": true, "city_in_state": false, "neighborhood_in_city": false } } } ]

I’d really appreciate honest feedback from people more experienced with:

Fine-tuning small LLMs
Address parsing / entity extraction
Post-inference validation strategies
Confidence scoring approaches

Does this look like a reasonable direction for a 1B model?
Anything you’d improve architecturally or evaluation-wise?

Thanks in advance — this project has been a great learning experience so far 🙏

submitted by /u/Hour-Dirt-8505
[link] [comments]

0

I Scraped 48k Court Filings To Find Verified B2B Ideas. Here Are 3 Niches Bleeding Money Right Now (Steal These)

submitted by /u/Ogretribe
[link] [comments]

0

[FREE DATASET] 67K+ Domains With Technology Fingerprints

This dataset contains information on what technologies were found on domains that were crawled in December 2025.

A few common use cases for this type of data

You’re a developer who had built a particular solution for a client, and you want to replicate your success by finding more leads based on that client’s profile. For example, find me all electrical wholesalers using WordPress that have a `.com.au` domain.
You’re performing market research and you want to see who is already paying for your competitors. For example, find me all companies using my competitors product who are also paying for enterprise technologies (indicates high technology expenditure).
You’re a security researcher who is evaluating the impact of your findings. For example, give me all sites running a particular version of a WordPress plugin.

The 67K domain dataset can be found here: https://www.dropbox.com/scl/fi/d4l0gby5b5wqxn52k556z/sample_dec_2025.zip?rlkey=zfqwxtyh4j0ki2acxv014ibnr&e=1&st=xdcahaqm&dl=0

The full 5M+ domains can be purchased for 99 USD at: https://versiondb.io/

VersionDB’s WordPress catalogue can be found here: https://versiondb.io/technologies/wordpress/

Enjoy!

submitted by /u/Upper-Character-6743
[link] [comments]

0

A Workflow For Generating Labeled Object-detection Datasets Without Manual Annotation (experiment / Feedback Wanted)

I’m experimenting with using prompt-based object detection (open-vocabulary / vision-language models) as a way to auto-generate training datasets for downstream models like YOLO.

Instead of fixed classes, the detector takes any text prompt (e.g. “white Toyota Corolla”, “people wearing safety helmets”, “parked cars near sidewalks”) and outputs bounding boxes. Those detections are then exported as YOLO-format annotations to train a specialized model.

Observations so far:

Detection quality is surprisingly high for many niche or fine-grained prompts
Works well as a bootstrapping or data expansion step
Inference is expensive and not suitable for real-time use. this is strictly a dataset creation / offline pipeline idea

I’m trying to evaluate:

How usable these auto-generated labels are in practice
Where they fail compared to human-labeled data
Whether people would trust this for pretraining or rapid prototyping

Demo / tool I’m using for the experiment (Don’t abuse, it will crash if bombarded with requests:

Detect Anything

I’m mainly looking for feedback, edge cases, and similar projects. similar approaches before, I’d be very interested to hear what worked (or didn’t).

submitted by /u/eyasu6464
[link] [comments]

0

Snipper: An Open-source Chart Scraper And OCR Text+table Data Gathering Tool [self-promotion]

I was a heavy automeris.io (WebPlotDigitizer) user until the v5 version. Somewhat inspired by it, I’ve been working on a combined chart snipper and OCR text+table sampler. Desktop rather than web-based and built using Python, tesseract, and openCV. MIT licensed. Some instructions to get started in the readme.

Chart snipping should be somewhat familiar to automeris.io users but it starts with a screengrab. The tool is currently interactive but I’m thinking about more automated workflows. IMO the line detection is a bit easier to manage than it is in automeris with just a sequence of clicks but you can also drag individual points around. Still adding features and support for more chart types, better x-axis date handling etc. The Tkinter GUI has some limitations (e.g., hi-res screen support is a bit flaky) but is cross-platform and a Python built-in. Requests welcome.

submitted by /u/foldedcard
[link] [comments]

0

Where Can I Buy High Quality/unique Datasets For Model Training?

I am looking for platforms with listings of commercial/proprietary datasets. Any recommendations where to find them?

submitted by /u/Latter-Gift630
[link] [comments]

0

Track Any Topic Across The Internet And Get Aggregated, Ranked Results From Multiple Sources In One Place

submitted by /u/MickolasJae
[link] [comments]

0

PAID] Global Car Specs & Features Dataset (1990-2025) – 12,000 Variants, 100+ Brands

I compiled and structured a global automotive specifications dataset covering more than 12,000 vehicle variants from over 100 brands, model years 1990-2025.

Each record includes: Brand, model, year, trim Engine specifications (fuel type, cylinders, power, torque, displacement) Dimensions (length, width, height, wheelbase, weight) Performance data (0-

100 km/h, top speed, COz emissions, fuel consumption) Price, warranty, maintenance, total cost per km Feature list (safety, comfort, convenience)

Available in CSV, JSON, and SQL formats. Useful for developers, researchers, and Al or data analysis projects.

GitHub (sample, details and structure):

https://github.com/vbalagovic/cars-dataset

submitted by /u/Ok_Cucumber_131
[link] [comments]

0

Harris County (TX) Parcel-level Real Estate Dataset

Clean, analysis-ready Harris County (TX) parcel-level real estate dataset.
Fully documented, GIS-ready, delivered in Parquet format.
Perfect for analytics, GIS, and data science workflows.

#realestate #HarrisCounty #Texas #GIS #parceldata #dataset #Parquet #opendata #HCAD #propertyrecords #datascience #analytics #geospatial

submitted by /u/ThorImagery
[link] [comments]

0

Looking For Advice On Pricing And Selling Smart Home Telemetry Data (EU)

Hi guys,

We’re a young company based in Europe and collect a significant amount of telemetry data from smart home devices in residential houses (e.g. temperature, energy consumption, usage patterns).

We believe this data could be valuable for companies across multiple industries (energy, proptech, insurance, analytics, etc.). However, we’re still quite new to the data monetization topic and are trying to better understand:

How to price such data (typical models, benchmarks, CPMs, subscriptions, etc.)
Who the realistic buyers might be
What transaction volumes or market sizes to expect
Where data like this is usually sold (marketplaces, direct sales, partnerships)

Where would you recommend starting to learn about this? Are there resources, communities, marketplaces, or frameworks you’ve found useful? First-hand experiences are especially welcome.

Thanks a lot for any help!

submitted by /u/Intelligent_Offer954
[link] [comments]

0

[Self-Release] 65 Hours Of Kenyan/Filipino English Dialogue | Split-Track WebRTC | VAD-Segmented

Hi all,

I’m the Co-founder of Datai. We are releasing a 65-hour dataset of spontaneous, two-speaker dialogues focused on Kenyan (KE) and Filipino (PH) English accents.

We built this to solve a specific internal problem: standard datasets (like LibriSpeech) are too clean. We needed data that reflects WebRTC/VoIP acoustics and non-Western prosody.

We are releasing this batch on Hugging Face for the community to use for ASR benchmarking, accent robustness testing, or diarization experiments.

The Specs:

Total Duration: ~65 hours (Full dataset is 800+ hours)
Speakers: >150 (Majority Kenyan interviewees, ~15 Filipino interviewers)
Topic: Natural, unscripted day-to-day life conversations.
Audio Quality: Recorded via WebRTC in Opus 48kHz, transcoded to pcm_s16le.
Structure: Split-track (Stereo). Each speaker is on a separate track.

Processing & Segmentation: We processed the raw streams using silero-vad to chunk audio into 1 to 30-second segments.

File/Metadata Structure: We’ve structured the filenames to help with parsing: ROOM-ID_TRACK-ID_START-MS_END-MS

ROOM-ID: Unique identifier for the conversation session.
TRACK-ID: The specific audio track (usually one speaker per track).

Technical Caveat (the edge case): Since this is real-world WebRTC data, we are transparent about the dirt in the data: If a speaker drops connection and rejoins, they may appear on a new TRACK-ID within the same ROOM-ID. We are clustering these in v2, but for now, treat Track IDs as session-specific rather than global speaker identities.

Access: The dataset is hosted on Hugging Face (gated to prevent bots/abuse, but I approve manual requests quickly).

Link is in the comments.

submitted by /u/Downtown_Valuable_44
[link] [comments]

0

I Put Together A Dataset That Might Be Useful For Researchers

I’ve been working on a side project and ended up compiling a dataset that may be useful beyond what I originally needed it for, so I’m considering releasing it publicly.

At a high level, the dataset contains:

structured records collected over a multi-year period
consistent timestamps and identifiers
minimal preprocessing (basic cleaning + deduplication only)

It’s not tied to a specific paper or product, more something that could support exploratory analysis, modeling, or benchmarking, depending on the use case.

Before publishing, I wanted to sanity-check with this community:

what details do you usually look for to judge dataset quality?
is light preprocessing preferred, or raw + processed versions?
anything that would immediately make this more usable for research?

Happy to share more specifics if there’s interest, and open to feedback before release.

submitted by /u/crowpng
[link] [comments]

0

Looking For CPAs In The USA – Available To Purchase Or How To Scrape?

Does anyone have access to current lists of CPAs in the US? Or ideas on the best way to scrape this information?

submitted by /u/jeremydy
[link] [comments]

0

How Can I Learn DS/DA From Scratch To Stand Out In The Highly Competitive Market?

Hello, I am currently studying data analytics and data science. I generally want to focus on one of these two fields and learn. But due to the high competition in the market and the negative impact of artificial intelligence on the field, should I start or choose another field? What exactly do I need to know and learn to stand out in the market competition in the DA DS fields and find a job more easily? There is a lot of information on the Internet, so I can’t find the exact required learning path. Recommendations from professionals in this field are very important to me. Is it worth studying this field and how? Thank you very much

submitted by /u/No_Staff_7246
[link] [comments]

0

Looking For S&P 500 (GICS Information Technology Sector) Dataset: Revenue, Net Income & R&D Expenses (Excel/CSV)

Hi everyone,

I’m a master’s student working on academic research and I’m looking for a compiled dataset

for S&P 500 companies that includes:

– Revenue

– Net Income (profit)

– R&D expenses (I know some companies don’t report them)

Ideally:

– Annual data

– Multiple years (e.g. 2010–2024, but flexible)

– Excel or CSV format

This is strictly for non-commercial, academic use (master’s thesis).

If anyone already has this dataset (e.g. from Compustat / Capital IQ / Bloomberg)

and is willing to share or point me in the right direction, I’d really appreciate it.

Thanks a lot!

submitted by /u/SuddenBookkeeper6351
[link] [comments]

0

Built A Multi-Source Knowledge Discovery API (arXiv, GitHub, YouTube, Kaggle) — Looking For Feedback

Support me with your contribution, ❤️ To get Donations for this project. Thank you!

submitted by /u/Appropriate_West_879
[link] [comments]

0

[Dataset] An Open-source Image-prompt Dataset

Sharing a new open-source (Apache 2.0) image-prompt dataset. Lunara Aesthetic is an image dataset generated using our sub-10B diffusion mixture architecture, then curated, verified, and refined by humans to emphasize aesthetic and stylistic quality.

https://huggingface.co/datasets/moonworks/lunara-aesthetic

submitted by /u/paper-crow
[link] [comments]

0

Category: Datatards

SUBIT‑64 Spec V0.9.0 — The First Stable Release. A New Foundation For Information Theory

Looking For Wheat Disease Datasets!!!

Curated AI VC Firm List For Early-stage Founders

Independent Weekly Cannabis Price Index (consumer Prices) – Looking For Methodological Feedback

Looking For Dataset On Menopausal Subjective Cognitive Decline

Looking For Dataset On Menopausal Subjective Cognitive Decline (Academic Use) Post

A European Database Of Ecological Restoration

Emotions Dataset: 14K Texts Tagged With 7 Emotions (NLP / Classification)

About Dataset –

https://www.kaggle.com/datasets/prashanthan24/synthetic-emotions-dataset-14k-texts-7-emotions

Any Good Sources Of Free Verbatim / Open-text Datasets?

Best Way To Pull Twitter/X Data At Scale Without Getting Rate Limited To Death?

I Am Looking To Buy Instagram Influencer Data.

How To Get DFDC Dataset Access ?? Is The Website Working???

Where Do People Find Usable Social Interaction Datasets These Days?

I Fine-tuned LLaMA 3.2 1B Brazilian Address Parser — Looking For Honest Feedback

I Scraped 48k Court Filings To Find Verified B2B Ideas. Here Are 3 Niches Bleeding Money Right Now (Steal These)

[FREE DATASET] 67K+ Domains With Technology Fingerprints

A Workflow For Generating Labeled Object-detection Datasets Without Manual Annotation (experiment / Feedback Wanted)

Snipper: An Open-source Chart Scraper And OCR Text+table Data Gathering Tool [self-promotion]

Where Can I Buy High Quality/unique Datasets For Model Training?

Track Any Topic Across The Internet And Get Aggregated, Ranked Results From Multiple Sources In One Place

PAID] Global Car Specs & Features Dataset (1990-2025) – 12,000 Variants, 100+ Brands

Harris County (TX) Parcel-level Real Estate Dataset

Looking For Advice On Pricing And Selling Smart Home Telemetry Data (EU)

[Self-Release] 65 Hours Of Kenyan/Filipino English Dialogue | Split-Track WebRTC | VAD-Segmented

I Put Together A Dataset That Might Be Useful For Researchers

Looking For CPAs In The USA – Available To Purchase Or How To Scrape?

How Can I Learn DS/DA From Scratch To Stand Out In The Highly Competitive Market?

Looking For S&P 500 (GICS Information Technology Sector) Dataset: Revenue, Net Income & R&D Expenses (Excel/CSV)

Built A Multi-Source Knowledge Discovery API (arXiv, GitHub, YouTube, Kaggle) — Looking For Feedback

[Dataset] An Open-source Image-prompt Dataset

Recent Posts

Recent Comments

18+ Content

About Dataset –

Recent Posts

Recent Comments