Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Do You Know Where I Can Access Twitch Stream-level Historical Data For Free?

Hello everyone, I hope you’re doing okay.

The thing is that for a project at uni I want to access historical data on daily streams, and get, for example, info about the time and date of the stream, channel, content, average viewers, stream duration, etc. What I need is something like this (but for this page I have to pay):

https://streamscharts.com/streams?sortBy=avg_concurrent_viewers&time=30-days

Does anyone know any alternatives to get this kind of data for free?

Thank you in advance ! Any help is appreciated.

submitted by /u/Valuable-Interest921
[link] [comments]

Can I Find Tune A LLM Model Like GPT4-O To Parse Data In A JSON Format From Partially Structured PDFs?

I am working on a project that relies heavily on pattern matching and regexes to extract and give strucuture to data that the company relies on. This data is extracted from PDFs that are partially structured but here and there something will break because of weird character or some edge case that is not taken care off. Because of this there is a chance that our current parsing engine might miss something in the pdfs.

I have been wondering a lot and have tested GPT4-O as it is by uploading pdfs attachments and have observed that is pretty good at parsing the information that we need. Ever since I have been planning to build something new that instead of pattern recognition relies on LLMs such as the ones from OPEN AI.

My question is, can I train a OPEN AI or another model to parse the information that I need from these PDFs and make it spit output in purely a JSON structure that I want? So I can use OPEN AIs’ API and integrate it in our backend services to do all of the work. Do you guys think this is possible?

If fine tuning is not possible, what is the best way of going about building something like this.

submitted by /u/captain-ass-smasher
[link] [comments]

Any Dataset In Cardiology Domain To Begin A Project ?

Hello everyone, Context : I have medical background and I want to enter in the deep learning/machine learning world. Some requires have be obtain, like in python programmation, machine learning and deep learning theory. I want to create a project in the cardiology. But I don’t know what’s the free dataset in the domain. I research many point of view, like radiology, pharmacology, biology etc…

Question : Can you have many suggestions on free dataset, I can use for my project. Thanks all,

submitted by /u/Sane_pharma
[link] [comments]

Customer Segmentation But With Ground Truth Labels

Hello, as the title states I am looking for customer segmentation datasets but with segment labels since I want to benchmark different methods. In truth, any variable (such as satisfaction) will be fine as long as it is more than 2 categories.

I’ve looked all around kaggle and UCI but I cannot find any, all these datasets contain no labels. Do you guys have any suggestions? Thanks

submitted by /u/Grand_Comparison2081
[link] [comments]

Q: Fine-tuning Coding LLMs On Git[hub] Histories Rather Than Just Final Code?

I run a small software company creating traditional C++ desktop apps for font & graphic design work. We have 10+ years of Git histories of our apps.

What open “coding” LLMs are there out there that weren’t just trained on final code but on Git histories (commits & pull requests), and Github stuff (PR discussions, issues etc.)?

What dataset formats for such data would be advisable to use?

I’d like to fine-tune a coding LLM to privately assist in our software development, ideally not just on the current state of the code but on its evolution.

I have a “feeling” that this would be much better. 🙂

submitted by /u/Minimum_Art_2263
[link] [comments]

[Request] Need Workout Images Dataset

Greetings! I’m working on a project that requires me to annotate people in different workout postures. I’ll be requiring workout images of individual people where their bodies are either 1) On the ground (Crunches, Russian Twist, etc.)/ any flat surface like a gym bench (Bench Press), or 2) parallel to the ground(Push-Up, Mountain Climbers, etc.).

I’ve already found two for Push-Ups on Roboflow, but the rest have been a pain to find.

Please suggest datasets where I can either find a such images.

submitted by /u/tekinayor
[link] [comments]

Soccer Corner Odds Dataset For Betting

Hello everyone,

I am looking for a website, API, or database that contains historical data on corner odds. I have found some databases online, but they all only offer limited odds values, covering just a specific betting range: less than 9, 10-12, and more than 13, for example (Betfair’s free historic data service). I am looking for a database that includes odds for over, exactly, and under for each corner value in a large range of values (4 to 18 coerner), as I have built a betting model based on these types of odds. I just need a good database to test the model.

submitted by /u/Fun-Associate-6139
[link] [comments]

Where Can I Find Audio Datasets For Automotive Engines

Hi there I’m going to make a graduation project in which I will make a DL/ML which recognize the sounds of some mechanical failures that happens to a passenger car for example when a bearing is going bad you will hear a specific noise which is famous to mechanics but not the average user and I’ve searched kaggle , UCL and many sites stol6no results if anyone can give me a clue where I can find this data

submitted by /u/Frequent-Newspaper40
[link] [comments]

Dataset For Background Music / Sound Effects

I want to build a library with background music and sound effects. Label them into categories/sub-categories and create a properly indexed dataset.

I am willing to structure it myself but so far haven’t been able to find a good, reliable data source which offers these music/sound effects on a creative commons license (free to use). Any help will be greatly appreciated

submitted by /u/jayesh_f33l
[link] [comments]

Needing Data For Pornhub Analysis From X-present. Machine Learning Project.

Hello everyone,

I’m planning to compile data from Pornhub to conduct an analysis that explores the relationship between pornography consumption across different generations and its potential links to issues such as addiction, depression, and other related concerns. My goal is to identify patterns that might contribute to a solution for porn addiction. I’ll be participating in a hackathon in 21 days, and I need .csv files for this data analysis. Does anyone know if Pornhub provides such data?

submitted by /u/02Mellow
[link] [comments]

Historical Sports Bet Odds Past 2020?

Hi all, doing some research on ML and AI and I’m trying to find a historical sports betting odds API. Ive checked precious threads and although so do list resources, they weren’t what I was quite needing.

Trying to find an API (preferably, spreadsheet will work if one isn’t avaliable) for historic betting odds for different sports. I’m using https://the-odds-api.com currently, and it has the data I need just not to the full date range.

Looking for something that goes back to 2019, but also if possible, back to 2011 would be great.

Let me know. Thanks!

submitted by /u/Legal-Yam-235
[link] [comments]

Data Set For All S&P 500 Company Ratios From 2020-2023

Not sure if I am in the right place but I’m hoping someone can lead me in the right direction atleast.

I am a masters student looking to do a research paper on how data science can be used to find undervalued stocks.

The specific ratios I am looking for is P/E Ratio P/B Ratio PEG ratio Dividend yield Debt to equity Return on assets Return on equity EPS EV/EBITDA Free cash flow

Would also be nice to know the stock price and ticker symbol

An example AAPL 2020 PRICE: X P/E Ratio: x P/B Ratio: X PEG ratio: x Dividend yield: x Debt to equity: x Return on assets: x Return on equity: x EPS: x EV/EBITDA: x Free cash flow: x

Then the next year after:

AAPL 2021 PRICE: X P/E Ratio: x P/B Ratio: X PEG ratio: x Dividend yield: x Debt to equity: x Return on assets: x Return on equity: x EPS: x EV/EBITDA: x Free cash flow: x

Then 2022 and so on till the year 2023.

I am not a cider but I have tried extensively to make a program using Chatgpt and Gemini to scrape the data from multiple sources….I was able to get a list of everything that I was looking for, For the year 2024 using Yfinance on python but was not able to get the historical data using yfinance. I have tried my hand at trying to scrape the data from EDGAR as well but as I said I am not a coder and could not figure it out. Would be willing to pay 10-50$ for the dataset from a website too but could not find one that was easy to use/had all the info I was looking for. (I did find one I believe but they wanted $1800 for it) willing to get on a phone call or discord call if that helps.

submitted by /u/SadPhone8067
[link] [comments]

Searching For Nepali Handwritten Word Datasets.

I’ve been searching for datasets that primarily focus on Nepali handwritten words or documents, but so far, I’ve only found resources related to numerals and characters. Also, handwritten document for Devanagari scripts would also come in handy. Can someone help me with getting the this dataset ?

I’ve already checked platforms like Kaggle, Zenodo, and other usual sources but haven’t had much luck. Does anyone here know where I might find such a dataset, or could point me in the right direction?

Any help or advice would be greatly appreciated!

submitted by /u/East-Scarcity-6357
[link] [comments]