Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Best Way To Market & Price 280k Cannabis Consumer Records (80% NY State)?

Best Way to Market & Price 280k Cannabis Consumer Records (80% NY State)?

I’ve got a cleaned, permissioned dataset from a prior cannabis retail business: ~278–282k consumer profiles with purchase history (SKUs bought, frequency, spend bands), product preferences, timestamps, and opt-in/consent records.

Geographic split: ~80% of profiles are from New York State, ~20% from other U.S. states (with compliant, adult-use purchase history). All profiles granted permission for their data to be used/sold when collected.

I’m looking for real-world advice on: 1. Where to list/sell — reputable data marketplaces or brokers (LiveRamp, Snowflake, AvocaData, direct brokers)? 2. Buyer types — who actually pays for this kind of cannabis purchase-behavior data (brands, MSOs, dispensaries, distributors, ad platforms, analysts)? 3. Compliance checks — what proof of consent, CCPA/CPRA, NY State privacy compliance, opt-out mechanisms, and audit trails do buyers need to see? 4. Data format — hashed identifiers vs. plaintext PII, sample rows, schema, enrichment — what do buyers prefer? 5. Pricing ballpark — per-profile, per-record, or subscription models you’ve seen for transactional consumer datasets in a regulated industry? 6. State-specific issues — given that most data is NY-based, are there particular ad/marketing restrictions I should disclose?

What I can provide to vetted buyers right away:

• Schema + 100-row sample (no PII in public sample).

• Consent logs (timestamps and collection language).

• Basic enrichment (ZIP, age bands, spend tiers).

• Delivery via hashed identifiers (SHA256/HMAC) or raw CSV depending on buyer preference.

• NDA + data use agreement and proof of secure hosting (S3/private transfer).

Would love to hear from anyone who has bought or sold similar datasets: specific marketplaces, broker contacts, or pricing ranges you’d recommend. Also open to intros to compliance/legal shops that pre-audit datasets for data buyers, I know that speeds up the sales process and boosts valuation.

Thanks! I want to do this cleanly and legally, especially with the NY-heavy dataset. DM or comment if you’ve got leads.

submitted by /u/Fun_Ad7909
[link] [comments]

Help Downloading MOLA In-Car Dataset (file Too Large To Download Due To Limits)

Hi everyone,

I’m currently working on a project related to violent action detection in in-vehicle scenarios, and I came across the paper “AI-based Monitoring Violent Action Detection Data for In-Vehicle Scenarios” by Nelson Rodrigues. The paper uses the MOLA In-Car dataset, and the link to the dataset is available.

The issue is that I’m not able to download the dataset because of a file size restriction (around 100 MB limit on my end). I’ve tried multiple times but the download either fails or gets blocked.

Could anyone here help me with:

  • A mirror/alternative download source, or
  • A way to bypass this size restriction, or
  • If someone has already downloaded it, guidance on how I could access it?

This is strictly for academic research use. Any help or pointers would be hugely appreciated 🙏

Thanks in advance!

this is the link of the website : https://datarepositorium.uminho.pt/dataset.xhtml?persistentId=doi:10.34622/datarepositorium/1S8QVP

please help me guys

submitted by /u/Scared-Material4044
[link] [comments]

Real Estate Data API [PAID] Questions

I’ve built an API called AlyProp that delivers 70+ data points per property (ownership, valuation, taxes, zoning, comps, etc.) pulled from public records.

Right now, my pricing looks like this: • $29.99 → 1,000 property lookups (~3¢ each) • $100 → 10,000 property lookups (~1¢ each)

Since it costs me about 1¢ per property to provide, I’m trying to figure out the best way to position it: • Do analysts/developers prefer smaller, tiers (like $5–10/month ), or do you only work with bulk datasets? • Does anyone that works with/sells data sell through API’s or is it only bulk datasets? Should I transition to selling entire datasets?

submitted by /u/AlyProp
[link] [comments]

Open Dataset: 40M GitHub Repositories (2015–mid-Jul 2025) + 1M Sample + Quickstart Notebook

I made an open dataset of 40M GitHub repositories.

I play with GitHub data for a long time. And I noticed there are almost no public full dumps with repository metadata: BigQuery gives ~3M with trimmed fields; GitHub API hits rate limits fast. So I collected what I was missing and decided to share — maybe it will make someone’s life easier. The write-up explains details.

How I built (short): GH Archive → joined events → extracted repository metadata. Snapshot covers 2015 → mid-July 2025.

What’s inside

  • 40M repos in full + 1M in sample for quick try;
  • fields: language, stars, forks, license, short description, description language, open issues, last PR index at snapshot date, size, created_at, etc.;
  • “alive” data with gaps, categorical/numeric features, dates and short text — good for EDA and teaching;
  • a Jupyter notebook for quick start (basic plots).

Links

Who may find useful
Students, teachers, juniors — for mini-research, visualizations, search/cluster experiments. Feedback is welcome.

submitted by /u/Fabulous_Pollution10
[link] [comments]

Looking For Methodology To Handle Legal Text Data Worth 13 Gb

I have collected 13 gb of legal textual data( consisting of court transcripts and law books), and I want to make it usable for llm training and benchmarking. I am looking for methodology to curate this data. If any of you guys are aware of GitHub repos or libraries that could be helpful then it is much appreciated.

Also if there are any research papers that can be helpful for this please do suggest. I am looking for sending this work in conference or journal.

Thank you in advance for your responses.

submitted by /u/Fit-Musician-8969
[link] [comments]

Transcripts For All Apple September Keynotes?

I’d like to get the transcripts for all Apple Keynotes (the September ones) since 1998. I was hoping to play with this dataset and get fun data nuggets.

But I can only find the transcripts for the last 3 ones (as they were auto-generated on YouTube). The other videos are on YouTube, but without transcript.

I can’t believe they are not stored somewhere on the Internet… does anyone have any tip or suggestion?

submitted by /u/TypeUnique8960
[link] [comments]

Help Us Build A Heart Sound Dataset (Normal & Abnormal)

Dear all,

I am conducting a personal research project focused on the testing of a system for heart sound analysis. To properly evaluate this system, I am seeking volunteers to provide short recordings of their heart sounds via Phone.

Eligibility

  • Participants must be 18 years or older.
  • Participation is voluntary and can be withdrawn at any time.

What is needed

  • Two categories of recordings:
    • 🫀 Normal heart sounds
    • 💔 Murmur/abnormal heart sounds (murmur, extra_systole, extra_heart_sound)
  • Recording device: your smartphone microphone (no stethoscope required).
  • Duration: approximately 10–15 second.
  1. Place the phone close to your chest (apical area of the heart) – Instruction here: Instruction
  2. Record for 10–15 seconds.
  3. Save the file (WAV or MP3 preferred, but any common format is acceptable).
  4. Label recording if its normal or abnormal (specific here if its murmur, extra_systole_systole, extra_heart_sound)
  5. Upload the recording in the given link

Thank you!

submitted by /u/Comprehensive-Rest90
[link] [comments]

Where And How To Sell Small Synthetic Datasets ?

I’m curious, is there a marketplace for individuals for selling small synthetic datasets (500 -1000 lines) ? Synthetic datasets about emotional nuance in text, Annotated by emotion, intensity, tone, register and context and handchecked by a practitioner in mental health for example? And can anyone sell datasets or do you have to be a developer to know what you’re doing/selling ? Thank you in advance for your help!

submitted by /u/True-magic-22
[link] [comments]

Seeking Open Public Medical Datasets For LLM Finetuning

Good evening, community. This is my first post; if I break a rule, please let me know.

I’m working on MedeX v25.8.3, a clinical assistant aimed at professional use with an educational mode. I’m looking for public, open medical datasets for finetuning.

Ideal traits: clear licenses, solid annotations, documented pipelines, population diversity, common formats (CSV/JSON/DICOM), and standard benchmarks/splits.

Disclosure: I’m the developer of MedeX. I’ll add the repo in the first comment if the sub allows.

submitted by /u/DeepRatAI
[link] [comments]

Looking For (US R1) Longitudinal Faculty Dataset

I’m looking for pointers to one or more datasets that have some or all of the following data:

  • Faculty name (tenure track only)
  • Current professional title/designation
  • Department employed
  • Name of the university/academic employer
  • Degree-granting department and institution (PhD, Masters, and undergraduate degrees, as applicable)
  • Year of degree (PhD, Masters, and undergraduate degrees)
  • Current employment start year
  • Other academic employment history (eg. department, start and end date of previous post-PhD employments)

It would be really nice if longitudinal data (every academic year) was also available for these items. In addition, data about non tenure track faculty appointments would also be nice, but not necessary.

I’m looking for something similar (but expanded in terms of scope) to the dataset used in this paper.

I’m aware that AARC could be a potential data source but I’ve been told it’s not trivial to get data access through them, so looking for alternatives.

Alternatively, would also appreciate if anyone can point me to ways to scrape (at least some of) this data from university directories.

Thanks in advance!

submitted by /u/Timely-Ad2743
[link] [comments]

Free [Synthetic] Datasets For AI Model Tuning [self-promotion]

I run a synthetic data platform called DataCreator AI that helps AI professionals and businesses generate customized datasets.

Along with these capabilities, we offer a section called Community Datasets where we post datasets for free. Community Datasets

Some of the current free datasets we have are:

  • A dataset to perform Direct Preference Optimization to reduce sycophancy of LLMs.
  • A dataset that contains structured multi-turn conversations between patients and customer service agents at hospitals.
  • A dataset with a collection of random facts from various topics like biology, astronomy,
  • Classification and Question-Answer Datasets.

Your feedback would be of huge help to me to come up with more useful datasets. If you have any specific dataset ideas, please let me know in the comments so that we can put up more of them.

submitted by /u/Routine-Sound8735
[link] [comments]

Complete Powerball & Mega Millions Draw + Winners Dataset

I’m working on a data project and need a more complete dataset for Powerball and Mega Millions than what’s usually available on sites like lotteryusa or state lottery pages.

Most public datasets just have the draw date and winning numbers, but I need all the columns, specifically things like: – Draw date & draw number – Winning numbers + Powerball/Mega Ball – Power Play / Megaplier multiplier – Jackpot amount (annuity & cash value) – Number of winners by tier (match 5, 4+PB, etc.) – Power Play winners by tier – State-by-state winner breakdown (if available)

Basically, the full official results table that the lotteries publish after each draw, not just the numbers themselves.

I haven’t been able to find a historical dataset with all of this.

Does anyone know if this exists publicly, or will I need to scrape it directly from Powerball.com / MegaMillions.com (or individual state sites)? If scraping is the way to go, I’d love any tips on best practices for this since the data spans back to the ’90s.

submitted by /u/b2bdemand
[link] [comments]

Requesting Supply Chain Dataset For Academic Research

I am conducting academic research on supplier evaluation and selection using machine learning as part of my postgraduate work. For this, I am seeking access to supplier-related datasets that include features such as unit price, product availability, order quantities, revenue generated, stock levels, lead times, shipping times, shipping costs, shipping carriers, supplier location, production volumes, manufacturing lead times, manufacturing costs, defect rates, transportation modes, and overall procurement costs. The data will be used strictly for academic purposes, and any confidential or sensitive information will be anonymized. Access to such data would greatly enhance the reliability of my research and contribute to building a practical decision-support framework for procurement systems.
If these features are not there any dataset will do. Please I really need the dataset

submitted by /u/BackgroundFar8017
[link] [comments]

Survey For A Data Marketplace | For Anyone Looking To Earn From Data

I’m in the process of developing a marketplace to sell data because I feel like there is no simple marketplace to facilitate sell data, especially for subscriptions and I really wanted people in the communities opinions. If you have data, are interested in selling data etc. an entry would be appreciated, it has been checked by mods, emails are not collect

Here is the link: https://forms.gle/xNp7a7vEEioa7vrE8

submitted by /u/daviddosm8
[link] [comments]

New Analyst Building A Portfolio While Job Hunting-what Datasets Actually Show Real-world Skill?

I’m a new data analyst trying to land my first full-time role, and I’m building a portfolio and practicing for interviews as I apply. I’ve done the usual polished datasets (Titanic/clean Kaggle stuff), but I feel like they don’t reflect the messy, business-question-driven work I’d actually do on the job.

I’m looking for public datasets that let me tell an end-to-end story: define a question, model/clean in SQL, analyze in Python, and finish with a dashboard. Ideally something with seasonality, joins across sources, and a clear decision or KPI impact.

Datasets I’m considering: – NYC TLC trips + NOAA weather to explain demand, tipping, or surge patterns – US DOT On-Time Performance (BTS) to analyze delay drivers and build a simple ETA model – City 311 requests to prioritize service backlogs and forecast hotspots – Yelp Open Dataset to tie reviews to price range/location and detect “menu creep” or churn risk – CMS Hospital Compare (or Medicare samples) to compare quality metrics vs readmission rates

For presentation, is a repository containing a clear README (business question, data sources, and decisions), EDA/modeling notebooks, a SQL folder for transformations, and a deployed Tableau/Looker Studio link enough? Or do you prefer a short write-up per project with charts embedded and code linked at the end?

On the interview side, I’ve been rehearsing a crisp portfolio walkthrough with Beyz interview assistant, but I still need stronger datasets to build around. If you hire analysts, what makes you actually open a portfolio and keep reading?

Last thing, are certificates like DataCamp’s worth the time/money for someone without a formal DS degree, or would you rather see 2–3 focused, shippable projects that answer a business question? Any dataset recommendations or examples would be hugely appreciated.

submitted by /u/Various_Candidate325
[link] [comments]