Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Where Can I Find Audio Datasets For Automotive Engines

Hi there I’m going to make a graduation project in which I will make a DL/ML which recognize the sounds of some mechanical failures that happens to a passenger car for example when a bearing is going bad you will hear a specific noise which is famous to mechanics but not the average user and I’ve searched kaggle , UCL and many sites stol6no results if anyone can give me a clue where I can find this data

submitted by /u/Frequent-Newspaper40
[link] [comments]

Dataset For Background Music / Sound Effects

I want to build a library with background music and sound effects. Label them into categories/sub-categories and create a properly indexed dataset.

I am willing to structure it myself but so far haven’t been able to find a good, reliable data source which offers these music/sound effects on a creative commons license (free to use). Any help will be greatly appreciated

submitted by /u/jayesh_f33l
[link] [comments]

Needing Data For Pornhub Analysis From X-present. Machine Learning Project.

Hello everyone,

I’m planning to compile data from Pornhub to conduct an analysis that explores the relationship between pornography consumption across different generations and its potential links to issues such as addiction, depression, and other related concerns. My goal is to identify patterns that might contribute to a solution for porn addiction. I’ll be participating in a hackathon in 21 days, and I need .csv files for this data analysis. Does anyone know if Pornhub provides such data?

submitted by /u/02Mellow
[link] [comments]

Historical Sports Bet Odds Past 2020?

Hi all, doing some research on ML and AI and I’m trying to find a historical sports betting odds API. Ive checked precious threads and although so do list resources, they weren’t what I was quite needing.

Trying to find an API (preferably, spreadsheet will work if one isn’t avaliable) for historic betting odds for different sports. I’m using https://the-odds-api.com currently, and it has the data I need just not to the full date range.

Looking for something that goes back to 2019, but also if possible, back to 2011 would be great.

Let me know. Thanks!

submitted by /u/Legal-Yam-235
[link] [comments]

Data Set For All S&P 500 Company Ratios From 2020-2023

Not sure if I am in the right place but I’m hoping someone can lead me in the right direction atleast.

I am a masters student looking to do a research paper on how data science can be used to find undervalued stocks.

The specific ratios I am looking for is P/E Ratio P/B Ratio PEG ratio Dividend yield Debt to equity Return on assets Return on equity EPS EV/EBITDA Free cash flow

Would also be nice to know the stock price and ticker symbol

An example AAPL 2020 PRICE: X P/E Ratio: x P/B Ratio: X PEG ratio: x Dividend yield: x Debt to equity: x Return on assets: x Return on equity: x EPS: x EV/EBITDA: x Free cash flow: x

Then the next year after:

AAPL 2021 PRICE: X P/E Ratio: x P/B Ratio: X PEG ratio: x Dividend yield: x Debt to equity: x Return on assets: x Return on equity: x EPS: x EV/EBITDA: x Free cash flow: x

Then 2022 and so on till the year 2023.

I am not a cider but I have tried extensively to make a program using Chatgpt and Gemini to scrape the data from multiple sources….I was able to get a list of everything that I was looking for, For the year 2024 using Yfinance on python but was not able to get the historical data using yfinance. I have tried my hand at trying to scrape the data from EDGAR as well but as I said I am not a coder and could not figure it out. Would be willing to pay 10-50$ for the dataset from a website too but could not find one that was easy to use/had all the info I was looking for. (I did find one I believe but they wanted $1800 for it) willing to get on a phone call or discord call if that helps.

submitted by /u/SadPhone8067
[link] [comments]

Searching For Nepali Handwritten Word Datasets.

I’ve been searching for datasets that primarily focus on Nepali handwritten words or documents, but so far, I’ve only found resources related to numerals and characters. Also, handwritten document for Devanagari scripts would also come in handy. Can someone help me with getting the this dataset ?

I’ve already checked platforms like Kaggle, Zenodo, and other usual sources but haven’t had much luck. Does anyone here know where I might find such a dataset, or could point me in the right direction?

Any help or advice would be greatly appreciated!

submitted by /u/East-Scarcity-6357
[link] [comments]

Need Dataset For X-Ray Images Of Fractures

Hi, we’re working on a medical imaging project for Fracture detection through X-Ray Images, performing segmentation and then classification of fractures in an X-Ray. So far we’ve struggled at finding good datasets, and I was hoping for some suggestions or resources where I can find annotated X-Ray images for fractures.

submitted by /u/wajahatsatti018
[link] [comments]

The Big Porn Dataset – Over 20 Million Video URLs

The Big Porn Dataset is the largest and most comprehensive collection of adult content available on the web. With an amount of 23.686.411 Video URLs it exceeds possibly every other Porn Dataset.

I got quite a lot of feedback. I’ve removed unnecessary tags (some I couldn’t include due to the size of the dataset) and added others.

Use Cases

Since many people said my previous dataset was a “useless dataset”, I will include Use Cases for each column.

Website – Analyze what website has the most videos, analyze trends based on the website. URL – Webscrape the URLs to obtain metadata from the models or scrape comments (“https://pornhub.com/comment/show?id={video_id}}&limit=10&popular=1&what=video”). 😉 Title – Train a LLM to generate your own titles. See below. Tags – Analyze the tags based on plattform, which ones appear the most, etc. Upload Date – Analyze preferences based on upload date. Video ID – Useful for webscraping comments, etc.

Large Language Model

I have trained a Large Language Model on all English titles. I won’t publish it, but I’ll show you examples of what you can do with The Big Porn Dataset.

Generated titles:

F…ing My Stepmom While She Talks Dirty Ho.ny Latina Slu..y Girl Wants Ha..core An.l S.x Solo teen p…y play B.g t.t teen gets f….d hard S.xy E..ny Girlfriend

(I censored them because… no.)

Note: This dataset contains sensitive content and is intended solely for research and educational purposes. 😉 Please ensure compliance with all relevant regulations and guidelines when using this data. Use responsibly. 😊

More information on Huggingface and Twitter:

https://huggingface.co/datasets/Nikity/Big-Porn

https://x.com/itsnikity

submitted by /u/itsnikity
[link] [comments]

Launched An Amazon Product Search API

Hey everyone,

I’ve just published a new API on RapidAPI for searching Amazon products, and I’d love to get your feedback. If you’re working on any e-commerce, market analysis, or comparison projects, this could be a helpful tool for you.

What it does:

Real-time Product Search: Fetch detailed Amazon product information based on keywords, categories, or ASINs. Comprehensive Data: Access pricing, availability, ratings, and more across various product categories.

Why I built it:

I noticed a gap in easy access to Amazon’s massive product catalog for smaller developers and side projects, so I decided to create this API to fill that gap. It’s designed to be straightforward and developer-friendly, aiming to save time and effort when integrating Amazon product data.

Thanks for taking the time to check this out!

I’m excited to hear what this community thinks.

submitted by /u/Affectionate-Olive80
[link] [comments]

Seeking SVG Dataset For Image Retrieval Cbir

I’m working on a project involving Content-Based Image Retrieval (CBIR) and I’m specifically looking for datasets in SVG format. Most datasets I’ve found are in raster formats (like JPG or PNG), but I need scalable vector graphics for my experiments. Has anyone come across an SVG dataset suitable for CBIR? Any suggestions or research papers on SVG-based image retrieval would be greatly appreciated!

submitted by /u/Ornery-Vacation-5632
[link] [comments]

Periodically Updated Dataset Of All Public Repositories On GitHub With Their Description

Does it exist? I am aware of GitHub Archive on Big Query and presumably it could be used to get this dataset but it would be really inefficient because GitHub Archive contains all “events” on GitHub like git push, commits, issues etc. I will need to read the entire dataset to get all the public repositories.

There is another dataset on big query publicly hosted by Google containing all packages on Pypi, Maven, npm etc but I also need repositories which are not necessarily packages.

Any help is appreciated.

submitted by /u/GullibleEngineer4
[link] [comments]

Coordinate System For NREL Wind Resource Database

I’m working with geospatial windspeed data from the NREL Wind Resource Database, but it’s not clear what coordinate reference system is being used. I found on their GitHub that they use a “modified Lambert-conic” system, but none of the various Lambert-conic EPSGs or PROJ strings I’ve found online seem to be correct.

Does anyone know how I can find out what’s the exact CRS they used? Thanks 🙂

submitted by /u/Broseph729
[link] [comments]

Pornhub Dataset: Over 700K Video Urls And More!

The Pornhub Dataset provides a comprehensive collection of data sourced from ph, encompassing various details from MANYYY videos available on the platform. The file consists of 742.133 lines of videos.

This dataset contains a diverse array of languages, with video titles indicating that it is 53 different languages.

Note: This dataset contains sensitive content and is intended solely for research and educational purposes. 😉 Please ensure compliance with all relevant regulations and guidelines when using this data. Use responsibly. 😊

Pornhub Dataset ❤️

submitted by /u/itsnikity
[link] [comments]

Calling AI Engineers: Offer To Build A Dataset From Scratch For Fine Tuning LLMs

Hi there,

I’m the Co-Founder of a startup specialised in creating custom datasets for AI.

We are currently growing and willing to invest in a few datasets we will offer to the AI community. Up to 3 datasets will be built and made available on HuggingFace through the months.

Thus I thought about asking the community. What dataset you think is difficult to find and would help your LLM fine tuning Use Cases? Our clients ask us a lot of coding datasets (e.g. prompt & responses about how to develop in C++), but this could be anything.

Let me know your thoughts!

Cheers.

submitted by /u/Any-Adagio-6174
[link] [comments]

[REQUEST] Dataset Of Archaeological Site Photos Before (and After) Excavation

Hi all,

I’m working on a project to develop a system for detecting potential archaeological sites from photos. To train this system, I’m looking for a dataset of photos of archaeological sites taken before and after excavation.

The idea is to have a dataset that shows the visual changes in the landscape and terrain before an archaeological dig. This could help the model learn to recognize visual cues and patterns that indicate the presence of buried archaeological features.

Thank you

submitted by /u/AdEmpty878
[link] [comments]

Mouse Tracking For Bot Detection In CAPTCHA Systems

Purpose:

We are seeking a comprehensive dataset that includes mouse movement data for the purpose of distinguishing between human users and automated bots in web-based CAPTCHA systems. The goal is to develop and refine machine learning models that can accurately identify bot-like behavior based on mouse interaction patterns, enhancing the security and effectiveness of CAPTCHA systems.

Dataset Requirements:

Mouse Movement Data: Raw data capturing mouse coordinates, velocity, acceleration, and direction changes as users interact with a web page.

Click Event Data; Records of click positions, timing, and frequency to analyze the decision-making process and interaction speed.

Human vs. Bot Interaction: Clear distinction between data generated by human users and data generated by automated scripts (bots). This will allow for supervised learning and model training.

Time-Series Data: Sequential data capturing the timestamp of each mouse event to analyze the flow and pattern of movements.

Behavioral Biometrics: Data capturing user-specific behaviors that might indicate human-like randomness or bot-like precision in interactions.

Variety of Interactions: Diverse interaction scenarios, including different types of CAPTCHA challenges (e.g., image recognition, text entry) and general web browsing activities.

submitted by /u/RareNeedleworker832
[link] [comments]

Popular Data Sets Bringing Down My Resume?

Tldr: should I avoid popular data set topics, just specific popular data sets, or neither?

I’ve heard that using common, popular, or “basic” data sets for your projects looks bad on the resume.

Idk if this means I should avoid specific popular data sets (ex/ a twitter set from Kaggle), or avoid all data sets of a popular topic (ex/ all twitter sets, whether or not from Kaggle)

I have 2 projects on my resume. One is a sentiment analysis using hotel reviews. I don’t think the specific data set is very popular, but I’m worried that the general topic of sentiment analysis on travel reviews might be too popular of a topic for a resume project, according so some.

Does my project qualify as too popular/basic to show to recruiters?

For context, I am a new grad with little relevant work experience. I figured that having a project that is very “basic” but well-made is better than a lack of projects.

submitted by /u/Pomegranate6077
[link] [comments]