Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Data For Marketing Campaigns Or Audience Insights Practice?

My background is in insights and market research. I’m currently job hunting and I’m seeing a lot of roles in audience insights and marketing research, which I don’t have direct experience in. I was thinking about trying to do some small projects to include in my applications to show I have transferrable skills, but I’m struggling to find open source data to work with. Does anyone have any suggestions? Thanks so much.

submitted by /u/belledamesans-merci
[link] [comments]

Data Of Mileage/breakdown For Vehicles?

Howdy folks,

I’m based in the states. Im just wondering if anyone might know if there is any data out there that would be able to inform when cars/models tend to have whatever services/breakdowns at particular mileage…and what those services or items tend to be?

I’m looking at this regressively, as Im not trying to predict or project what services are needed for future mileage but something that would actually SHOW at what mileage a particular model has received particular services/repairs or breakdowns PREVIOUSLY or shown itself to happen at, etc?

Does anyone know if anything like this exists or is available?

submitted by /u/WhatsTheAnswerDude
[link] [comments]

Datasets For Training A 2D Virtual Try-On Model (TryOnDiffusion)

Hi everyone,

I’m currently working on training a 2D virtual try-on model, specifically something along the lines of TryOnDiffusion, and I’m looking for datasets that can be used for this purpose.

Does anyone know of any datasets suitable for training virtual try-on models that allow commercial use? Alternatively, are there datasets that can be temporarily leased for training purposes? If not, I’d also be interested in datasets available for purchase.

Any recommendations or insights would be greatly appreciated!

Thanks in advance!

submitted by /u/Straight-Piccolo5722
[link] [comments]

Create A Database With Historical Soccer Results

I would like to create a database with historical soccer results and odds. Since I have no idea about programming, I had thought about Excel or Google Sheets. The question is, how do I get the data? I have heard of web scraping or using an API. There are some at rapidapi, e.g. from Sofascore. But they have limits in the free version. I imagined it like this: e.g. country, league, date, season, round, home team, away team, goals home, goals, away, half time: goals home, away, odds 1 x 2, elo home, away.

Chatgpt has me Google sheets, there Google Apps script use for the API. I just can’t get along with the endpoints. Furthermore, I want the daily results from the last day/days to be fetched automatically or by command, as well as upcoming games with odds for the next 7 days.

How can I implement this? What ideas do you have Thanks a lot

submitted by /u/PokerMurray
[link] [comments]

Where Can I Find Data? Working On Econometrics Paper

I’m working on an econometrics paper for my college course. I am aiming to reproduce the results of the following paper:

Incentives, time use and BMI: The roles of eating, grazing and goods by Daniel S. Hamermesh

I want to reproduce these results with more modern and accurate methods in mind rather than BMI but I am having trouble finding the data. I’d appreciate any help you guys can offer

submitted by /u/seventydaily
[link] [comments]

Synthetic Infant Detection Dataset In Cribs

I’ve been doing a lot of work on building computer vision models to track infants in cribs, since becoming a parent. Recently I’ve tried to start making models and datasets that are more generalized and not just for my kid. Turns out this is pretty difficult, since there aren’t a lot of datasets made for tracking infants in cribs.

I made a first attempt at producing a synthetic dataset that can be used to bootstrap a model. The idea is you’d either supplement the synthetic data with a small subset of real data, or something else like transfer learning. The dataset was made using path tracing, so it looks a little bit better than some of the other synthetic datasets on infants that I’ve seen (links on my GitHub repo).

Relevant Links:

https://github.com/tay10r/infant-detection-dataset https://www.kaggle.com/datasets/tay10r/synthetic-infant-dataset

It’ll be a week or so before the full dataset is done rendering (10k images). I’m traveling over the weekend so I was only able to upload a subset of the dataset (a little over 100 images).

Currently I use a trained model I made with about 2000 labeled images on my kid to analyze sleep patterns. I’m hoping this dataset, perhaps after a few improvements, will help produce more general models for this type of work. I’m curious to know if anyone else finds this interesting or practical. Let me know what you think!

submitted by /u/taylorcholberton
[link] [comments]

You Can Now Train Your Own Reasoning Model With Just 5GB VRAM

Hey amazing people! First post here! Today, I’m excited to announce that you can now train your own reasoning model with just 5GB VRAM for Qwen2.5 (1.5B) using our open-source project Unsloth: https://github.com/unslothai/unsloth

GRPO is the algorithm behind DeepSeek-R1 and how it was trained. You need a dataset with about 500 rows in question, answer pairs and a reward function and you can then start the whole process!

This allows any open LLM like Llama, Mistral, Phi etc. to be converted into a reasoning model with chain-of-thought process. The best part about GRPO is it doesn’t matter if you train a small model compared to a larger model as you can fit in more faster training time compared to a larger model so the end result will be very similar! You can also leave GRPO training running in the background of your PC while you do other things!

Due to our newly added Efficient GRPO algorithm, this enables 10x longer context lengths while using 90% less VRAM vs. every other GRPO LoRA/QLoRA (fine-tuning) implementations with 0 loss in accuracy. With a standard GRPO setup, Llama 3.1 (8B) training at 20K context length demands 510.8GB of VRAM. However, Unsloth’s 90% VRAM reduction brings the requirement down to just 54.3GB in the same setup. We leverage our gradient checkpointing algorithm which we released a while ago. It smartly offloads intermediate activations to system RAM asynchronously whilst being only 1% slower. This shaves a whopping 372GB VRAM since we need num_generations = 8. We can reduce this memory usage even further through intermediate gradient accumulation. Use our GRPO notebook with 10x longer context using Google’s free GPUs: Llama 3.1 (8B) on Colab-GRPO.ipynb)

Blog for more details on the algorithm, the Maths behind GRPO, issues we found and more: https://unsloth.ai/blog/grpo)

GRPO VRAM Breakdown:

Metric Unsloth TRL + FA2 Training Memory Cost (GB) 42GB 414GB GRPO Memory Cost (GB) 9.8GB 78.3GB Inference Cost (GB) 0GB 16GB Inference KV Cache for 20K context (GB) 2.5GB 2.5GB Total Memory Usage 54.3GB (90% less) 510.8GB

Also we spent a lot of time on our Guide (with pics) for everything on GRPO + reward functions/verifiers so would highly recommend you guys to read it: docs.unsloth.ai/basics/reasoning

Thank you so so much for reading! 😀

submitted by /u/yoracale
[link] [comments]

Looking For Hinge Data From Users Of The App

I am a journalism student looking for Hinge datasets to analyze dating patterns. Hinge lets users export their personal data including likes sent and received, matches, conversations, etc. If someone has a dataset of multiple users or is willing to share their own data please let me know. If sharing personal data, I could anonymize your name in my findings if you prefer. Thanks in advance!

submitted by /u/cappingaf
[link] [comments]

Looking For Well-structured Datasets On D2C Brand Directories And Product Discovery

I’m exploring how people discover D2C brands and want to improve search/filtering experiences in large directories. To do this, I’m looking for well-structured datasets related to:

D2C brand directories (with categories, tags, or attributes) E-commerce product databases with metadata Consumer search behavior for brands/products

If you know of any publicly available datasets that could help, I’d love to hear about them! Also, if you have tips on structuring datasets for better discoverability, feel free to share.

Thanks in advance!

submitted by /u/Mobile_Candidate_926
[link] [comments]

Looking For A Dataset That Scrapes Newly Posted ICE/Police Job Postings By State So That I Can Visualize The Trend Over Time?

Hello,

I’m looking for help finding or building a dataset that captures new ICE/Police job postings by state. My hypothesis is that we are going to see an increase in the number of these openings over the year and I’m keen on tracking trends – think it may be a useful leading barometer.

Does anyone know of a database that already tracks job listings by industry by state on a more granular scale that would be useful in this case?

If not maybe we start with California, Texas, Arizona, Florida, NY?

I am completely new to this but am interested in seeing this trend so any help is appreciated.

submitted by /u/Powder9
[link] [comments]

Historic Temperature Per Location, Hourly Granularity

I am really a weather geek and I am looking for historic temperature data (preferably via easy to use API) per location and hourly granularity.

I’d like to use queries in scripts (e.g. python) and visualize data.

Reason for hourly: I’d like to know highest and lowest temperature and average temperature but not (Tmax+Min)/2 but the proper average. Also, I’d like to plot average temperature profiles for different locations.

Weather Underground has just that but no API (free for the end-user) and only available by manually clicking through the data. In the past, I have exported data via the clipboard but it’s too exhausting if the dataset exceeds a few days/locations.

submitted by /u/segdy
[link] [comments]

Intimate Partner Violence Across U.S. States-Longitudinal Dataset For A 5yr Timeframe

Hi!!

Can anyone PLEASE PLEASE PRETTY PLEASE give me links or database suggestions for a research paper on “ How do firearm prohibition and relinquishment laws for individuals with a history of domestic violence impact female firearm-related fatalities?”?? any 5yr range is perfectly good, but preferably the 21st century that records and analyzed all 50 states , the gun-related firearm deaths (perpetrated by intimate partners)!!

this will really really help my teammates and i! its for our masters, and we are tryna get a good study out there !! THANK YOU

submitted by /u/Puzzleheaded_Cup8780
[link] [comments]

Dataset Needed – S&P 500 Constituents With Daily Prices

I want to run backtests on a momentum investing strategy.

So I’m looking for a dataset with a daily list of S&P 500 constituencies, their price for each day, and any possible events (such stock splits or company merger/splits). I bought this dataset in 2014 for $49 (1963-2014) but the company that sold the data to me is no longer in business.

Preferably usable in node.js, Python is a bit rusty.

submitted by /u/SaltBat6229
[link] [comments]

Dataset Needed – Child Welfare (Child Abuse Investigations And Foster Care Cases)

Hi all,

I am a current Social Work PhD student interested in the child welfare system (investigations of abuse/neglectneglect and foster care), especially the experiences of the caseworkers themselves. I am in need of a dataset to analyze for one of my courses and am in the process of requesting restricted data from the US Department of Health and Human Services’ Child Bureau. With everything going on, I am getting a little nervous it may be pulled from the site or my request denied so I’d like to have a backup. Is anyone aware of any public datasets available focusing on the child welfare system that I could look at?

I am looking for a dataset from 2019 or later.

Thank you in advance for your help!!

submitted by /u/ssdgm23
[link] [comments]

Can Someone Help Me Find The Source Of This Data?

Hey! The IMF Global Financial Stability Report (2024 October) has a graph which would really help me in my studies, the only problem is that I cant find the source of the data.

The specific graph is “3.2.1 Cyberattacks (number per year; finance and insurance sector share in percent)” located on page 99 of the said report. The IMF lists AIAAC and the University of Maryland Center as sources but they dont really have anything on their website.

I would really appriciate if someone helped me with this topic!

submitted by /u/nagybotond
[link] [comments]