Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Where To To Purchase Licensed Videos For AI Training?

Hey everyone,

I’m looking to purchase licensed video datasets (ideally at scale, hundreds of thousands of hours) to use for AI training. The main requirements are:

  • Licensed for AI training.
  • 720p or higher quality
  • Preferably with metadata or annotations, but raw videos could also work.
  • Vertical mandatory.
  • Large volume availability (500k hours++)

So far I’ve come across platforms like Troveo and Protege, but I’m trying to compare alternatives and find the best pricing options for high volume.

Does anyone here have experience buying licensed videos for AI training? Any vendors, platforms, or marketplaces you’d recommend (or avoid)?

Thanks a lot in advance!

submitted by /u/Mariolotus
[link] [comments]

Stuck On Extracting Structured Data From Charts/graphs — OCR Not Working Well

Hi everyone,

I’m currently stuck on a client project where I need to extract structured data (values, labels, etc.) from charts and graphs. Since it’s client data, I cannot use LLM-based solutions (e.g., GPT-4V, Gemini, etc.) due to compliance/privacy constraints.

So far, I’ve tried:

  • pytesseract
  • PaddleOCR
  • EasyOCR

While they work decently for text regions, they perform poorly on chart data (e.g., bar heights, scatter plots, line graphs).

I’m aware that tools like Ollama models could be used for image → text, but running them will increase the cost of the instance, so I’d like to explore lighter or open-source alternatives first.

Has anyone worked on a similar chart-to-data extraction pipeline? Are there recommended computer vision approaches, open-source libraries, or model architectures (CNN/ViT, specialized chart parsers, etc.) that can handle this more robustly?

Any suggestions, research papers, or libraries would be super helpful 🙏

Thanks!

submitted by /u/Fit-Soup9023
[link] [comments]

API To Find The Right Amazon Categories For A Product From Title And Description. Feedback Appreciated

I am new into the SaaS/API world and decided to build something on the weekend so I built an API that let you put a product title and an optional description and it gives the relevant Amazon categories. Is this something you guys use or need? If yes, what do you look for in such an API? I’m playing with it so far and put it a version of it out there : https://rapidapi.com/textclf-textclf-default/api/amazoncategoryfinder

Let me know what you think. Your feedback is greatly appreciated

submitted by /u/textclf
[link] [comments]

What’s The Most Comprehensive Medical Dataset You’ve Used That Includes EHRs, Physician Dictation, And Imaging (CT, MRI, X-ray)? How Well Did It Cover Diverse Patient Demographics And Geographic Regions?

I’m exploring truly multimodal medical datasets that combine all three elements:

  • Structured EHR data
  • Physician dictation (audio or transcripts)
  • Medical imaging (CT, MRI, X-ray)

Looking for real-world experience—especially around:

  • Whether the dataset was diverse in terms of age, gender, ethnicity, and geographic representation
  • If modality coverage felt balanced or skewed toward one type
  • Practical strengths or limitations you encountered in using such datasets

Any specific dataset names, project insights, or lessons learned would be hugely appreciated!

submitted by /u/Selmakiley
[link] [comments]

[Synthetic] Multilingual Customer Support Chat Logs – English, Spanish, French (Free, Privacy-Safe, Created With MOSTLY AI)

Hi everyone,

I’m sharing a synthetic dataset of customer support chat logs, available in English, Spanish, and Multilingual.
Disclaimer: I work at MOSTLY AI, the platform used to generate this dataset.

About the dataset:

  • Fully synthetic (no real customer data, privacy-safe)
  • Includes realistic support conversations, agent notes, satisfaction scores, and more
  • Useful for NLP, chatbot training, sentiment analysis, and multilingual AI projects

Original source:

Download links:

How it was made:
I used natural language instructions with the MOSTLY AI Assistant to add new columns and generate multilingual samples.
The dataset is free to use and designed for easy experimentation. For example, you can add more columns and rows on demand, and fine tune it according to your specific needs.

Let me know if you have feedback or ideas for further improvements!

submitted by /u/ZealousidealCard4582
[link] [comments]

Looking For Research Partners Who Need Synthetic Tabular Datasets

Hi all,

I’m looking to partner with researchers/teams who need support creating synthetic tabular datasets — realistic, privacy-compliant (HIPAA/GDPR) and tailored to research needs.

This is an excellent tool for expanding small samples, ensuring data safety for machine learning and artificial intelligence prototyping, and supporting academic or applied research.

If you or your group could use this kind of support, let’s connect!

I’m also interested in participating in initiatives aimed at promoting health and biomedical research. I possess expertise in developing high-quality, privacy-preserving synthetic datasets that can be utilized for educational purposes. I would be more than willing to contribute my skills and knowledge to these efforts, even if it means providing my services for free.

submitted by /u/Adrian2vp
[link] [comments]

[Request] Looking For Datasets Of 2D Point Sequences For Shape Approximation

I’ve been working on a library that approximates geometric shapes (circle, ellipse, triangle, square, pentagon, hexagon, oriented bounding box) from a sequence of 2D points.

  • Given a list of (x, y) points, it tries to fit the best-matching shape.
  • Example use case: hand-drawn sketches, geometric recognition, shape fitting in graphics/vision tasks.

I’d like to test and improve the library using real-world or benchmark datasets. Ideally something like:

  • Point sequences or stroke data (like hand-drawn shapes).
  • Annotated datasets where the intended shape is known.
  • Noisy samples that simulate real drawing or sensor data.

Library for context: https://github.com/sarimmehdi/Compose-Shape-Fitter

Does anyone know of existing datasets I could use for this?

submitted by /u/zimmer550king
[link] [comments]

Haether. Coding Data Set Api, Made By An Ai Model

Basically I’m trying to create a huge data set(probably with about 1t tokens, of good quality code). Disclaimer: this code will be generated by qwen 3 coder 480b, which I’ll run locally(Yes I can do that). The data set will have a lot of programming languages, I’ll prolly make it on every possible one. For api requests, you will be able to specify the Programming language, the type of the code(debugging, algorithms, library usage, and snippets). After the api request, you will get a json file with what you asked for in the request, which will be randomly chosen, but you will not be able to get the same code twice. But if you need to get the same code, you can send a reset request with you api key, which will clear the data, about the asked data.

submitted by /u/CurtissYT
[link] [comments]

Dataset De +120.000 Productos Con Códigos De Barras (EAN-13), Descripciones Normalizadas Y Formato CSV Para Retail, Kioscos, Supermercados Y E-commerce En Argentina/LatAm

Hola a todos,

Hace un tiempo me tocó arrancar un proyecto que empezó como algo muy chico: una base de datos de productos con códigos de barras para kioscos y pequeños negocios en Argentina. En su momento me la robaron y la empezaron a revender en MercadoLibre, así que decidí rehacer todo desde cero, pero esta vez con scraping, normalización de descripciones y un poco de IA para ordenar categorías.

Hoy tengo un dataset con más de 120.000 productos que incluye códigos de barras EAN-13 reales, descripciones normalizadas y categorías básicas (actualmente estoy investigando cómo puedo usar ia para clasificar todo con rubro y subrubro). Lo tengo en formato CSV y lo estoy usando en un buscador web que armé, pero la base como tal puede servir para distintos fines: cargar catálogos masivos en sistemas POS, stock, e-commerce, o incluso entrenar modelos de NLP aplicados a productos de consumo masivo.
Un ejemplo de cómo se ve cada registro:

7790070410120, Arroz Gallo Oro 1kg

7790895000860, Coca Cola Regular 1.5L

7791234567890, Shampoo Sedal Ceramidas 400ml

Lo que me interesa saber es si un dataset así puede tener utilidad también fuera de Argentina o LatAm. ¿Ven que pueda servir para la comunidad en general? ¿Qué cosas agregarían para que sea más útil, por ejemplo precios, jerarquía de categorías más detallada, marcas, etc.?

Si a alguien le interesa, puedo compartir un CSV reducido de 500 filas para que lo prueben.

Gracias por leer, y abierto a feedback.

submitted by /u/Tricky-Birthday-176
[link] [comments]

Looking For Time-series Waveform Data With Repeatable Peaks And Troughs (systole/diastole–like) For Labeling Project

Hi everyone, I’m working on a research project where I need a time-series dataset structured similarly to the waveform attached—basically a signal with repeatable cycles marked by distinct peaks and troughs (like systolic and diastolic phases). There may also be false positives or noise in the signal.

I’m not necessarily looking for physiological heartbeat data—just any dataset that behaves similarly enough to allow me to prototype my labeling pipeline (e.g., finding cycles, handling noise artifacts).

Key requirements:

  • Time-series data with clear, repeated peaks and dips (like systole & diastole).
  • Presence of noise or spurious peaks for robustness testing.
  • Ideally available in a simple, accessible format (e.g., CSV).

If you know of any open-source datasets (Kaggle, UCI, PhysioNet, or others) that fit the bill, please share! A second-best option for more general signals (not biological) is also welcome if they mimic this structure.

I’d love to get started ASAP—thanks so much in advance!

photos 1

photo 2

submitted by /u/xpmoonlight1
[link] [comments]

Kijiji And Facebook Automatic Poster Script

Hi!

Does anyone know how or have a script to post ads automatically? I’ve made an app where I take photos of car tires, input some info, and then it creates a full ad. I just want to post that on Kijiji and Facebook but have it automated cause I don’t want to do that for 100+ sets. Kijiji doesn’t have an open API and I’ve been getting blocked by HTTPS and all kijiji’s protection. Haven’t tried for Facebook yet but I’m not a seasoned coder and chatgpt hasn’t helped me at all

submitted by /u/YoghurtFinal1845
[link] [comments]

I Need To Pull Data On All Of Count Von Count’s Tweets

Okay so we’re talking about the Twitter feed of the Sesame Street character Count Von Count. https://x.com/CountVonCount On May 2, 2012, he tweeted simply https://x.com/CountVonCount/status/197685573325029379 “One!”, and over the past 13 years he has made it to “Five thousand three hundred twenty-eight!” I need the date and time that each tweet was posted, plus how many likes and retweets each post had. This contains some interesting data, for example each tweet was originally just posted randomly (no pattern to the time), and then at some point tweets began to be scheduled x hours in advance (the minutes past the hour are noticeably identical for a while until the poster forgot to schedule any and they needed yo start with a new random time). Also, the likes and retweets are mostly a simple function of how many followers the account had at the time they were posted, with some exceptions. There have been situations where someone has retweeted a certain number when it became newsworthy (for instance on election night 2020 someone retweeted the number of electoral votes Joe Biden had when he clinched the presidency and got the tweet a bunch of likes). And the round numbers and the funny numbers (69 and 420) show higher than expected “like” nnumbers. I was collecting data by hand but I realized by not getting it all at once i might be skewing the data. I have used Selenium before to scrap data from websites, but I don’t know if that will work for x.com . I also don’t want to pay for API key usage for anything so frivolous. Does anyone have any ideas?

submitted by /u/ConsistentAmount4
[link] [comments]

I Have Created A Massive Crypto Backtesting Dataset

I was trying to find high quality crypto datasets for backtesting and all the ones I found were very expensive or poor quality.

So I decided to get all the data myself and build my own dataset. The only expenses were storage and running the scripts for many days.

Anyway, I now have data going back to 2017 and around 3000 pairs. I’m thinking of selling it but not sure where should I start. I thought I’ll start here. If this is not the right place for it, it would be very helpful if you could please let me know some good places. I think I can sell it for much lower compared to the bigger players.

Here’s how I’m thinking I’ll sell:

Single Pair data: $10
Top 200 Pairs: $50
All data (~3000 pairs): $500

No subscription. No API. Just link for full data download in one go.

Note: When I say pairs I mean like ETH/BTC pair or BTC/SOL pair etc.

submitted by /u/Amazing-Sky-504
[link] [comments]

📸 New Dataset: MMP-2K — A Benchmark For Macro Photography Image Quality Assessment (IQA)

Hi everyone,

We just released MMP-2K, the first large-scale benchmark dataset for Macro Photography Image Quality Assessment (IQA). (PLEASE GIVE US A STAR IN GITHUB)

What’s inside:

  • ✅ 2,000 macro photos (captured under diverse settings)
  • ✅ Human MOS (Mean Opinion Score) quality ratings
  • ✅ Multi-dimensional distortion labels (blur, noise, color, artifacts, etc.)

Why it matters:

  • Current state-of-the-art IQA models perform well on natural images, but collapse on macro photography.
  • MMP-2K reveals new challenges for IQA and opens a new research frontier.

Resources:

I’d love to hear your thoughts:
👉 How would you approach IQA for macro photos?
👉 Do you think existing deep IQA models can adapt to this domain?

Thanks, and happy to answer any questions!

submitted by /u/Equivalent_Use_3762
[link] [comments]

Update On An Earlier Post About 300 Million RSS Feeds

Hi All, I heard back from a couple companies and effectively all of them, including ones like Everbridge effectively said “Thanks, xxx, I don’t think we’d be able to effectively consume that volume of RSS feeds at this time. If things change in the future, Xxx or I will reach out.”, now the thing is I don’t have the infrastructure to handle this data at all, would anyone want this data, like if I put it up on Kaggle or HF would anyone make something of it? I’m debating putting the data on kaggle or taking suggestions for an open source project, any help would be appreciated.

submitted by /u/Horror-Tower2571
[link] [comments]

Real Estate Data (Rents By Bedroom, Home Prices, Etc) Broken Down By Zip Code

Went through the hassle of compiling data from near every free (and some paid) real estate resources to have (probably) the most comprehensive dataset of its kind. Currently its being displayed in a tool I built, but the MO is to make this data free and accessible to anybody who wants it.

For most of the zip codes in the USA (about 25k, accounting for ~90% of the population), I have:

  1. home prices (average, median, valuation) — broken down by bedroom
  2. rent prices — by bedroom
  3. listing counts, days on market, etc, y/y%
  4. mortgage data (originations, first lien, second lien, debt to income, etc.)
  5. affordability metrics, mortgage cost
  6. basic demographics (age, college, poverty, race / ethnicity)

Once you’re in the dashboard and select a given area (ie: Chicago metro), there’s a table view in the bottom left corner and you can download the export the data for that metro.

I”m working on setting up an S3 bucket to host the data (including the historical datasets too), but wanted to give a preview (and open myself up to any comments / requests) before I start including it there.

submitted by /u/prop-metrics
[link] [comments]

Labeling 10k Sentences Manually Vs Letting The Model Pick The Useful Ones 😂 (uni Project On Smarter Text Labeling)

Hey everyone, I’m doing a university research project on making text labeling less painful.
Instead of labeling everything, we’re testing an Active Learning strategy that picks the most useful items next.
I’d love to ask 5 quick questions from anyone who has labeled or managed datasets:
– What makes labeling worth it?
– What slows you down?
– What’s a big “don’t do”?
– Any dataset/privacy rules you’ve faced?
– How much can you label per week without burning out?

Totally academic, no tools or sales. Just trying to reflect real labeling experiences

submitted by /u/vihanga2001
[link] [comments]

Open Sourced A CLI That Turns PDFs And Docs Into Fine Tuning Datasets Now With Multi File Support

Repo: https://github.com/Datalore-ai/datalore-localgen-cli

Hi everyone,

During my internship I built a small terminal tool that could generate fine tuning datasets from real world data using deep research. I later open sourced it and recently built a version that works fully offline on local files like PDFs DOCX TXT or even JPGs.

I shared this update a few days ago and it was really cool to see the response. It got around 50 stars and so many thoughtful suggestions. Really grateful to everyone who checked it out.

One suggestion that came up a lot was if it can handle multiple files at once. So I integrated that. Now you can just point it at a directory path and it will process everything inside extract text find relevant parts with semantic search apply your schema or instructions and output a clean dataset.

Another common request was around privacy like supporting local LLMs such as Ollama instead of relying only on external APIs. That is definitely something we want to explore next.

We are two students juggling college with this side project so sorry for the slow updates but every piece of feedback has been super motivating. Since it is open source contributions are very welcome and if anyone wants to jump in we would be really really grateful.

submitted by /u/Interesting-Area6418
[link] [comments]

Google Maps Scrapping For Large Dataset

so i wanna scrape every business name registered on google in an entire city or state but scraping it directly through selenium does not seem like a good idea even with proxies so is there is any dataset like this for a city like Delhi so that i don’t need to scrape entirety of google maps i need id to train a model for text classification any viable way i can do this?

submitted by /u/Existing_Pay8831
[link] [comments]

I Scraped 1.2M+ US Jobs (Here The Stats)

A huge chunk of US jobs never reach boards at all they sit only on internal career pages.

So I built an AI crawler that goes straight to the source: 70k+ corporate websites.
It collects and cleans the data automatically. Here’s what I found (US only):

Function Open Roles
Software Development 171,789
Marketing & Sales 183,143
Health & Pharma 192,426
Retail & Consumer Goods 127,782
Engineering, Manufacturing & Environment 134,912
Operations, Logistics, Procurement 98,370
Finance & Accounting 101,166
Business & Strategy 47,076
Data & AI 18,239
Creative & Design 11,472
Hardware, Systems & Electronics 30,112
Legal, HR & Administration 42,845
Public & Education 26,826
Hospitality, Travel & Tourism 46,121
Beauty & Wellness 7,597
Real Estate 15,405

You can explore and apply to all these jobs for free here: laboro.co

submitted by /u/Elieroos
[link] [comments]

Looking For Dataset On “ease Of Remembering Numbers”

Hi everyone,

I’m working on a project where I need a dataset that contains numbers (like 4–8 digit sequences, phone numbers, PINs, etc.) along with some measure of how easy they are to remember.

For example, numbers like 1234 or 7777 are obviously easier to recall than something like 9274, but I need structured data where each number has a “memorability” score (human-rated or algorithmically assigned).

I’ve been searching, but I haven’t found any existing dataset that directly covers this. Before I go ahead and build a synthetic dataset (based on repetition, patterns, palindromes, chunking, etc.), I wanted to check:

  • Does such a dataset already exist in psychology, telecom, or cognitive science research?
  • If not, has anyone here worked on generating similar “memorability” metrics for numbers?
  • Any tips on crowdsourcing this kind of data (e.g., survey setups)?

Any leads or references would be super helpful

Thanks in advance!

submitted by /u/abel_maireg
[link] [comments]