Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Need Dataset For X-Ray Images Of Fractures

Hi, we’re working on a medical imaging project for Fracture detection through X-Ray Images, performing segmentation and then classification of fractures in an X-Ray. So far we’ve struggled at finding good datasets, and I was hoping for some suggestions or resources where I can find annotated X-Ray images for fractures.

submitted by /u/wajahatsatti018
[link] [comments]

The Big Porn Dataset – Over 20 Million Video URLs

The Big Porn Dataset is the largest and most comprehensive collection of adult content available on the web. With an amount of 23.686.411 Video URLs it exceeds possibly every other Porn Dataset.

I got quite a lot of feedback. I’ve removed unnecessary tags (some I couldn’t include due to the size of the dataset) and added others.

Use Cases

Since many people said my previous dataset was a “useless dataset”, I will include Use Cases for each column.

Website – Analyze what website has the most videos, analyze trends based on the website. URL – Webscrape the URLs to obtain metadata from the models or scrape comments (“https://pornhub.com/comment/show?id={video_id}}&limit=10&popular=1&what=video”). 😉 Title – Train a LLM to generate your own titles. See below. Tags – Analyze the tags based on plattform, which ones appear the most, etc. Upload Date – Analyze preferences based on upload date. Video ID – Useful for webscraping comments, etc.

Large Language Model

I have trained a Large Language Model on all English titles. I won’t publish it, but I’ll show you examples of what you can do with The Big Porn Dataset.

Generated titles:

F…ing My Stepmom While She Talks Dirty Ho.ny Latina Slu..y Girl Wants Ha..core An.l S.x Solo teen p…y play B.g t.t teen gets f….d hard S.xy E..ny Girlfriend

(I censored them because… no.)

Note: This dataset contains sensitive content and is intended solely for research and educational purposes. 😉 Please ensure compliance with all relevant regulations and guidelines when using this data. Use responsibly. 😊

More information on Huggingface and Twitter:

https://huggingface.co/datasets/Nikity/Big-Porn

https://x.com/itsnikity

submitted by /u/itsnikity
[link] [comments]

Launched An Amazon Product Search API

Hey everyone,

I’ve just published a new API on RapidAPI for searching Amazon products, and I’d love to get your feedback. If you’re working on any e-commerce, market analysis, or comparison projects, this could be a helpful tool for you.

What it does:

Real-time Product Search: Fetch detailed Amazon product information based on keywords, categories, or ASINs. Comprehensive Data: Access pricing, availability, ratings, and more across various product categories.

Why I built it:

I noticed a gap in easy access to Amazon’s massive product catalog for smaller developers and side projects, so I decided to create this API to fill that gap. It’s designed to be straightforward and developer-friendly, aiming to save time and effort when integrating Amazon product data.

Thanks for taking the time to check this out!

I’m excited to hear what this community thinks.

submitted by /u/Affectionate-Olive80
[link] [comments]

Seeking SVG Dataset For Image Retrieval Cbir

I’m working on a project involving Content-Based Image Retrieval (CBIR) and I’m specifically looking for datasets in SVG format. Most datasets I’ve found are in raster formats (like JPG or PNG), but I need scalable vector graphics for my experiments. Has anyone come across an SVG dataset suitable for CBIR? Any suggestions or research papers on SVG-based image retrieval would be greatly appreciated!

submitted by /u/Ornery-Vacation-5632
[link] [comments]

Periodically Updated Dataset Of All Public Repositories On GitHub With Their Description

Does it exist? I am aware of GitHub Archive on Big Query and presumably it could be used to get this dataset but it would be really inefficient because GitHub Archive contains all “events” on GitHub like git push, commits, issues etc. I will need to read the entire dataset to get all the public repositories.

There is another dataset on big query publicly hosted by Google containing all packages on Pypi, Maven, npm etc but I also need repositories which are not necessarily packages.

Any help is appreciated.

submitted by /u/GullibleEngineer4
[link] [comments]

Coordinate System For NREL Wind Resource Database

I’m working with geospatial windspeed data from the NREL Wind Resource Database, but it’s not clear what coordinate reference system is being used. I found on their GitHub that they use a “modified Lambert-conic” system, but none of the various Lambert-conic EPSGs or PROJ strings I’ve found online seem to be correct.

Does anyone know how I can find out what’s the exact CRS they used? Thanks 🙂

submitted by /u/Broseph729
[link] [comments]

Pornhub Dataset: Over 700K Video Urls And More!

The Pornhub Dataset provides a comprehensive collection of data sourced from ph, encompassing various details from MANYYY videos available on the platform. The file consists of 742.133 lines of videos.

This dataset contains a diverse array of languages, with video titles indicating that it is 53 different languages.

Note: This dataset contains sensitive content and is intended solely for research and educational purposes. 😉 Please ensure compliance with all relevant regulations and guidelines when using this data. Use responsibly. 😊

Pornhub Dataset ❤️

submitted by /u/itsnikity
[link] [comments]

Calling AI Engineers: Offer To Build A Dataset From Scratch For Fine Tuning LLMs

Hi there,

I’m the Co-Founder of a startup specialised in creating custom datasets for AI.

We are currently growing and willing to invest in a few datasets we will offer to the AI community. Up to 3 datasets will be built and made available on HuggingFace through the months.

Thus I thought about asking the community. What dataset you think is difficult to find and would help your LLM fine tuning Use Cases? Our clients ask us a lot of coding datasets (e.g. prompt & responses about how to develop in C++), but this could be anything.

Let me know your thoughts!

Cheers.

submitted by /u/Any-Adagio-6174
[link] [comments]

[REQUEST] Dataset Of Archaeological Site Photos Before (and After) Excavation

Hi all,

I’m working on a project to develop a system for detecting potential archaeological sites from photos. To train this system, I’m looking for a dataset of photos of archaeological sites taken before and after excavation.

The idea is to have a dataset that shows the visual changes in the landscape and terrain before an archaeological dig. This could help the model learn to recognize visual cues and patterns that indicate the presence of buried archaeological features.

Thank you

submitted by /u/AdEmpty878
[link] [comments]

Mouse Tracking For Bot Detection In CAPTCHA Systems

Purpose:

We are seeking a comprehensive dataset that includes mouse movement data for the purpose of distinguishing between human users and automated bots in web-based CAPTCHA systems. The goal is to develop and refine machine learning models that can accurately identify bot-like behavior based on mouse interaction patterns, enhancing the security and effectiveness of CAPTCHA systems.

Dataset Requirements:

Mouse Movement Data: Raw data capturing mouse coordinates, velocity, acceleration, and direction changes as users interact with a web page.

Click Event Data; Records of click positions, timing, and frequency to analyze the decision-making process and interaction speed.

Human vs. Bot Interaction: Clear distinction between data generated by human users and data generated by automated scripts (bots). This will allow for supervised learning and model training.

Time-Series Data: Sequential data capturing the timestamp of each mouse event to analyze the flow and pattern of movements.

Behavioral Biometrics: Data capturing user-specific behaviors that might indicate human-like randomness or bot-like precision in interactions.

Variety of Interactions: Diverse interaction scenarios, including different types of CAPTCHA challenges (e.g., image recognition, text entry) and general web browsing activities.

submitted by /u/RareNeedleworker832
[link] [comments]

Popular Data Sets Bringing Down My Resume?

Tldr: should I avoid popular data set topics, just specific popular data sets, or neither?

I’ve heard that using common, popular, or “basic” data sets for your projects looks bad on the resume.

Idk if this means I should avoid specific popular data sets (ex/ a twitter set from Kaggle), or avoid all data sets of a popular topic (ex/ all twitter sets, whether or not from Kaggle)

I have 2 projects on my resume. One is a sentiment analysis using hotel reviews. I don’t think the specific data set is very popular, but I’m worried that the general topic of sentiment analysis on travel reviews might be too popular of a topic for a resume project, according so some.

Does my project qualify as too popular/basic to show to recruiters?

For context, I am a new grad with little relevant work experience. I figured that having a project that is very “basic” but well-made is better than a lack of projects.

submitted by /u/Pomegranate6077
[link] [comments]

Business Transformation Assets And Artefacts

🚀 Business Transformation Assets Sale: Premium Guides & Reference Materials 🚀

Unlock the secrets behind successful business transformations with exclusive assets from top-tier consultancy firms like Accenture, JPMorgan & Chase, EY, PwC, Deloitte, and KPMG!

📂 What’s Included? Business Transformation Assets for 18 Key Business Functions:

Commerce Cyber Data & Analytics Finance Global Business Service Human Resources Information Technology Internal Audit Legal Marketing Procurement Resilience Risk Sales Service Service Management Framework Supply Chain Management Sustainability

📊 Assets Provided:

Target Operating Models Guides Reference Materials (Process Taxonomies, Maturity Model Scale, etc.) Engagement Artefacts

🔧 Supported Technological Platforms:

Tech Agnostic Ivalua Coupa SAP Salesforce Workday Microsoft ServiceNow Okta

🌟 Why Buy?

Lifetime Access: One-time purchase with lifetime access to a Google Drive containing all the assets.

Comprehensive Coverage: All the tools and guides you need to revolutionize your business across multiple functions.

Proven Success: Backed by the methodologies and frameworks from leading consultancy firms.

Price: 0.05 BTC

PM if interested

submitted by /u/OrganicGoo
[link] [comments]

Constrained Faces With Ages Datasets

Hello,

I’m looking for datasets that contains faces of people with their age. Ideally the photos should be constrained, like in passports for instance, and should contain a wide range of ages, from 10 or even lower to at least 40. I would be really interested in constrained videos too instead of simple photos. Do you have any suggestions ?

Thanks.

submitted by /u/bastmed
[link] [comments]

Dream Data Set? Mine Would Be Local Traffic Data

every time i drive i find myself wondering what kind of data goes into decisions like stoplight vs stop sign, roundabout, etc. Or like how much collective time is wasted due to an accident. as a kid i used to think about how if an accident caused a 30 minute delay for 500 cars, that was collectively 250 hours of waste. never knew what to do with that data, lol. but anyway yeah i’ve always wanted to get access to data like this.

anyone got any other dream data sets? or even just something that’s super inaccessible if it does technically exist

submitted by /u/bhousecjs
[link] [comments]

How To Compare Two Data Sets From The Same Time And Proximate Location

Hi there, my first post not sure if this is the sub for it,

So I am working on a weather datasets (taken from stats can:https://climate.weather.gc.ca/index_e.html), The dataset I am working with has some missing values that I wish to fill using another dataset from a similar location. For this I found two other datasets from similar location, but both report slightly different numbers (as expected).

I wanna figure out if these differences are significant enough for me to not choose these datasets. How do I go about this? Do I use t test individually on each column? or ANOVA?

submitted by /u/Nepoleon_bone_apart
[link] [comments]

Looking For Researchers And Members Of AI Development Teams

We are looking for researchers and members of AI development teams who are at least 18 years old with 2+ years in the software development field to take an anonymous survey in support of my research at the University of Maine. This may take 20-30 minutes and will survey your viewpoints on the challenges posed by the future development of AI systems in your industry. If you would like to participate, please read the following recruitment page before continuing to the survey. Upon completion of the survey, you can be entered in a raffle for a $25 amazon gift card.

https://docs.google.com/document/d/1Jsry_aQXIkz5ImF-Xq_QZtYRKX3YsY1_AJwVTSA9fsA/edit

submitted by /u/wildercb
[link] [comments]