Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

[Dataset] 19,762 Garbage Images In 10 Classes For AI And Sustainability

Hi everyone,

I’ve just released a new version of the Garbage Classification V2 Dataset on Kaggle. This dataset contains 19,762 high-quality images categorized into 10 classes of common waste items:

Metal: 1020 Glass: 3061 Biological: 997 Paper: 1680 Battery: 944 Trash: 947 Cardboard: 1825 Shoes: 1977 Clothes: 5327 Plastic: 1984

Key Features:

Diverse Categories: Covers common household waste items. Balanced Distribution: Suitable for robust ML model training. Real-World Applications: Ideal for AI-based waste management, recycling programs, and educational tools.

🔗 Dataset Link: Garbage Classification V2

This dataset has already been featured in the research paper, “Managing Household Waste Through Transfer Learning.” Let me know how you’d use this in your projects or research. Your feedback is always welcome!

submitted by /u/Downtown_Bag8166
[link] [comments]

Need Images Of Human Arms For Dataset

Hey! I am in the process of creating a dataset for detecting human skin/arms from a close range.

I have gathered about 500 images and drawn polygons around the arms from a close range, I did this by taking photos of my own arms and asking my friends to take similar pictures but I think I still need about 500 more images. Is there anyway I could get more similar images quickly?

Open to posting job ads, is there a place to ask for images of this sort?

I have attached an imgur of images im looking for. thanks for reading!

Notes: I have already scowered all the stock images on google, as well as gone through every “arm” related dataset on roboflow

https://imgur.com/a/arm-XZGHgTP – Here are reference image

submitted by /u/blur69xd
[link] [comments]

[Dataset] Testing The “Pinnacle EV Betting” Theory: FanDuel Vs Pinnacle NFL Line Accuracy (2020-2023)

Dataset Referenced: https://github.com/bentodd1/FanDuelVsPinnacle/blob/master/line_comparison.csv

Background: While building smartbet.name, I noticed many betting sites claim you can do EV betting by following Pinnacle’s lines. I decided to test this by comparing Pinnacle and FanDuel NFL lines, with surprising results.

Key Findings:

Dataset: 1,039 NFL games (2020-2023) Lines from both books captured week before games FanDuel showed better predictive accuracy

Results Breakdown:

Line Accuracy: Identical predictions: 457 games (43.98%) FanDuel more accurate: 302 games (29.07%) Pinnacle more accurate: 280 games (26.95%) Average Absolute Error: Pinnacle: 9.51 points FanDuel: 9.05 points Average Hours Before Game: Pinnacle: 88.1 hours FanDuel: 58.0 hours

Dataset Access:

Full Dataset: line_comparison.csv Analysis Code: Jupyter Notebook

Methodology: The exact analysis can be seen in the Jupyter notebook. I created the database while using smartbet.name .

These findings challenge conventional wisdom about Pinnacle’s supposed edge in market efficiency.

submitted by /u/bentodd1
[link] [comments]

Help Finding Data: Measure Of Tourism

Hi guys, I’m doing my dissertation on the effect of precipitation on different factors of tourism within Ireland. I’m really struggling to find the dataset I need. I’m looking for any sort of measure of tourism eg. Visitor numbers, hotel occupancy, estimated tourist expenditure (anything at this point) that spans about 10 years, is monthly data, and also a regional scope of Ireland (Dublin, west coast, east coast ect.) I’ve been searching for a while now and have a few datasets but nothing perfect. Please let me know if you have any tips or even know of a dataset which may help. Thanks!

submitted by /u/MessBig6240
[link] [comments]

Looking For Prescription Data Of Medicine In Different Countries

The Netherlands publishes the amount of each drug prescribed and dispensed in a certain time periode (https://www.gipdatabank.nl/). For a small comparison in which drugs are used in which country I need the same data from other countries (at least the G20 countries).

Had some rough battles with the NHS site for example, but can’t really find the data in the same way, organized by ATC. Any pointers on where to look?

submitted by /u/Koopabro
[link] [comments]

Choosing One Financial Institution Over Other Ones

Hi! I would appreciate any help in advance! The question we like to answer is:

why consumers choose one financial institution over another for mortgage loans. Factors to consider include interest rates, fees, reputation, trust, loan terms, customer service, approval speed, product offerings, convenience, recommendations, financial stability, and special offers.

Therefore I need datasets that explicitly have consumers side, whether or not choosing one institution. One I found interesting is HDMA datasets that has one class of applicants who are approved for a loan but did not accepted the loan. It’s interesting, but has not much new to say or significantly different factors than other ones like those who accepted the loan or got denied. I was wondering if there are other datasets that might have consumers side of view showing factors that impact consumers decisions? Anything that might expand my perspective, basically. Thanks!

submitted by /u/Responsible-Ice-874
[link] [comments]

Ecommerce Product Dataset With Image URLs

Hey everyone!

I’ve recently put together a free repository of ecommerce product datasets—it’s publicly available at https://github.com/octaprice/ecommerce-product-dataset.

Currently, there are only two datasets (both from Amazon’s bird food category, each with around 1,800 products), which include attributes like product categories, images, prices, brand names, reviews, and even product image URLs.

The information available in the dataset can be especially useful for anyone doing machine learning or data science stuff — price prediction, product categorization, or image analysis.

The plan is to add more datasets on a regular basis.

I’d love to hear your thoughts on which websites or product categories you’d find interesting for the next releases.

I can pretty much collect data from any site (within reason!), so feel free to drop some ideas. Also, let me know if there are any additional fields/attributes you think would be valuable to include for research or analysis.

Thanks in advance for any feedback, and I look forward to hearing your suggestions!

submitted by /u/LessBadger4273
[link] [comments]

Help Needed To Build A Database Of Attractions Across India 🌏🇮🇳

Hi everyone,

I’m working on a project to create a comprehensive database of tourist attractions across India—everything from iconic landmarks to hidden gems. My goal is to make travel easier and more personalized for travelers. I’ll not resell it, but still going to use in planning software for commercial purposes.

I need data columns like Location details (city, state), coords, images.

My Challenges:

Scraping data: I’ve considered scraping websites, but I’m not sure of the legality or technical challenges. Using APIs: Google Maps API is great but expensive for the scale I need. Are there any free or low-cost alternatives? Collaborative sources: Is there any open-source or community-driven data for Indian attractions?

I’ve tried scraping OSM but didn’t got appropriate results. A lot of the data needs extensive verification to be useful.

submitted by /u/Ravishkumar2005
[link] [comments]

Access To Endometriosis Dataset For My Thesis

Hello everyone,

I’m currently working on my bachelor’s thesis., which focuses on the non-invasive diagnosis of endometriosis using biomarkers like microRNAs and machine learning. My goal is to reproduce existing studies and analyze their methodologies.

For this, I am looking for datasets from endometriosis patients (e.g., miRNA sequencing data from blood, saliva, or tissue samples) that are either publicly available or can be accessed upon request. Does anyone have experience with this or know where I could find such datasets? Ive checked GEO and reached out to authors of a relevant paper (still waiting for a response).

If anyone has tips on where to find such datasets or has experience with similar projects, I’d be incredibly grateful for your guidance!

Thank you so much in advance!

submitted by /u/Various-Cry-228
[link] [comments]

NCAA Tournament Dataset – Worth Anything?

I have a clean dataset with the last 20+ years of ncaa tournament games (round, seeds, result, score) along with ~100 traditional and advanced team stats from multiple public sources as they were pre-tournament. I’ve done a lot of feature engineering and can add those metrics in too (ex: 3-pt % to opponents 3-pt defense, raw and normalized by diff SOS type approaches).

It’s nothing crazy extensive (no player stats, injuries, trends) but it’s cleaner and more comprehensive than anything I’ve found available for free download / scraping.

I put the scripts together a few years ago with non-trivial code effort and manual QC (name formatting etc). It wouldn’t be particularly difficult to reproduce for a decent programmer. I’m sure AI has made that type of process more accessible but it’d still take some time for most.

Having never sold a dataset is there any value here? I’m not expecting much but the work is already done.

I’ve started the process of including regular season games (stats at game time) if that would help but probably won’t finish without understanding value. Same for game lines / betting info but only if the dataset is useless without them. They’re messier to pull.

submitted by /u/yourfinepettingduck
[link] [comments]

Long Shot- Sitemaps For Every Website Out There?

Does anyone know of a dataset (free or paid) which contains the sitemaps of all the websites on the web?

Yes I know that tens of millions of websites update their sitemaps daily. I know that not every website has a sitemap. I know that a decent chunk (10-20% by volume will be for p*rn). I know that this data takes up a lot of space (250-350tb based on my calculations).

The closest dataset I’m familiar with is common crawl, but they only capture 10% of the web at best and they focus more on full pages and less on sitemaps.

I know the odds of this being available is pretty slim, but I wanted to see if anyone has come across a huge sitemap list like this before.

P.S. I have a 1.5PB homelab and have the means to store all this data as well as process it. So it might be a non-standard request, but i’m asking for real reasons, not a hypothetical.

submitted by /u/9302462
[link] [comments]

🚀 Content Extractor With Vision LLM – Open Source Project

I’m excited to share Content Extractor with Vision LLM, an open-source Python tool that extracts content from documents (PDF, DOCX, PPTX), describes embedded images using Vision Language ModelsI’m excited to share Content Extractor with Vision LLM, an open-source Python tool that extracts content from documents (PDF, DOCX, PPTX), describes embedded images using Vision Language Models, and saves the results in clean Markdown files.

This is an evolving project, and I’d love your feedback, suggestions, and contributions to make it even better!

✨ Key Features

Multi-format support: Extract text and images from PDF, DOCX, PPTX. Advanced image description: Choose from local models (Ollama’s llama3.2-vision) or cloud models (OpenAI GPT-4 Vision). Two PDF processing modes: Text + Images: Extract text and embedded images. Page as Image: Preserve complex layouts with high-resolution page images. Markdown outputs: Text and image descriptions are neatly formatted. CLI interface: Simple command-line interface for specifying input/output folders and file types. Modular & extensible: Built with SOLID principles for easy customization. Detailed logging: Logs all operations with timestamps.

🛠️ Tech Stack

Programming: Python 3.12 Document processing: PyMuPDF, python-docx, python-pptx Vision Language Models: Ollama llama3.2-vision, OpenAI GPT-4 Vision

📦 Installation

Clone the repo and install dependencies using Poetry. System dependencies like LibreOffice and poppler are required for processing specific file types.

Detailed setup instructions: GitHub Repo

🚀 How to Use

Clone the repo and install dependencies. Start the Ollama server: ollama serve. Pull the llama3.2-vision model: ollama pull llama3.2-vision. Run the tool:bashCopy codepoetry run python main.py –source /path/to/source –output /path/to/output –type pdf Review results in clean Markdown format, including extracted text and image descriptions.

💡 Why Share?

This is a work in progress, and I’d love your input to:

Improve features and functionality Test with different use cases Compare image descriptions from models Suggest new ideas or report bugs

📂 Repo & Contribution

GitHub: Content Extractor with Vision LLM

Feel free to open issues, create pull requests, or fork the repo for your own projects.

🤝 Let’s Collaborate!

This tool has a lot of potential, and with your help, it can become a robust library for document content extraction and image analysis. Let me know your thoughts, ideas, or any issues you encounter!

Looking forward to your feedback, contributions, and testing results.

, and saves the results in clean Markdown files.

This is an evolving project, and I’d love your feedback, suggestions, and contributions to make it even better!

✨ Key Features

Multi-format support: Extract text and images from PDF, DOCX, PPTX. Advanced image description: Choose from local models (Ollama’s llama3.2-vision) or cloud models (OpenAI GPT-4 Vision). Two PDF processing modes: Text + Images: Extract text and embedded images. Page as Image: Preserve complex layouts with high-resolution page images. Markdown outputs: Text and image descriptions are neatly formatted. CLI interface: Simple command-line interface for specifying input/output folders and file types. Modular & extensible: Built with SOLID principles for easy customization. Detailed logging: Logs all operations with timestamps.

🛠️ Tech Stack

Programming: Python 3.12 Document processing: PyMuPDF, python-docx, python-pptx Vision Language Models: Ollama llama3.2-vision, OpenAI GPT-4 Vision

📦 Installation

Clone the repo and install dependencies using Poetry. System dependencies like LibreOffice and poppler are required for processing specific file types.

Detailed setup instructions: GitHub Repo

🚀 How to Use

Clone the repo and install dependencies. Start the Ollama server: ollama serve. Pull the llama3.2-vision model: ollama pull llama3.2-vision. Run the tool:bashCopy codepoetry run python main.py –source /path/to/source –output /path/to/output –type pdf Review results in clean Markdown format, including extracted text and image descriptions.

💡 Why Share?

This is a work in progress, and I’d love your input to:

Improve features and functionality Test with different use cases Compare image descriptions from models Suggest new ideas or report bugs

📂 Repo & Contribution

GitHub: Content Extractor with Vision LLM

Feel free to open issues, create pull requests, or fork the repo for your own projects.

🤝 Let’s Collaborate!

This tool has a lot of potential, and with your help, it can become a robust library for document content extraction and image analysis. Let me know your thoughts, ideas, or any issues you encounter!

Looking forward to your feedback, contributions, and testing results.

submitted by /u/Electrical-Two9833
[link] [comments]

Data Hunt: Reports Made To California Child Protective Services By Quarter-Year

Greetings.

I’ve been searching for days, seeking high and low, for a dataset matching what I described in the title.

From what I’ve found, there is a wealth of information for counts pertaining to number of children with 1 or more allegations, but not much for counts and/or totals for allegations themselves.

The best resource seems to be the California Child Welfare Indicators Project. In the report index I linked, you’ll see two reports that I found (at first) to be the most promising. Under the Fundamentals heading, there’s Allegations: Child Maltreatment Allegations – Child Count. It’s close, but because they’re again counting children and not allegations, I can’t use it. The other report, under CWS Rates, is Allegation Rates: Child Maltreatment Allegation Rates. It seems so close, but when I look at the options under Report Output, they list the rates (obviously), the total child population, and children with allegations. Looking at the descriptions for the data, it appears I can’t even infer the totals using the incidence rates, but I may be wrong.

Lastly, the report I was most excited about is found under Process Measures; the one labeled 2B. It’s titled “Referrals by Time to Investigation” and I thought that, since every report to CPS requires a response, that this was what I was looking for. Alas, this report only totals allegations that are deemed worthy of an in-person investigation.

So, here I am seeking the help of the Dataset community. Does anyone have any recommendations where I might look to find total reports made to CPS? Have I already found it among the reports listed at the CCWIP and just don’t realize it?

Should I reach out to them and just ask for the data?

I appreciate any help the community can provide.

Many thanks.

submitted by /u/Wiredawn
[link] [comments]

Where Can I Get The Employment Dataset By City Worldwide?

Hi, I am searching for open data for which I can analyze what kind of jobs are more prevalent in each city worldwide? (ex. more software engineer jobs in London than Paris, more cleaner jobs in Seoul than London, etc). Does anyone have idea where I can get these types of data? I found some 1.3m job openings data in Linkedin from kaggle, but this seems to contain the information only from Canada, united states and united kingdom.

submitted by /u/No-Search4434
[link] [comments]

2025 NCAA Basketball API Giveaway – Real-time & Post-game Data

Hey Reddit! 👋

Happy New Year! To kick off 2025, we’re giving away 90 days of free access to our NCAA Basketball API to the first 20 people who sign up by Friday, January 10. This isn’t a sales pitch—there’s no commitment, no credit card required—just an opportunity for those of you who love building, experimenting, and exploring with sports data.

Here’s what you’ll get for all conferences:

Real-time game stats Post-game stats Season aggregates

Curious about the API? You can check out the full documentation here: API Documentation.

We know there are tons of creative developers, analysts, and data enthusiasts here on Reddit who can do amazing things with access to this kind of data, and we’d love to see what you come up with. Whether you’re building an app, testing a project, or just curious to explore, this is for you.

If you’re interested, join our discord to signup. Spots are limited to the first 20, so don’t wait too long!

We’re really excited to see how you’ll use this. If you have any questions, feel free to ask in the comments or DM us.

submitted by /u/rollinginsights
[link] [comments]

Do You Have Any Real-world Datasets For Photovoltaic Systems

Hello everyone… May I ask if anyone have any real-world datasets about photovoltaic? I am gonna use this for a school research project about the effectiveness of machine-learning based photovoltaic system for predictive maintenance. I currently use synthetic data, however I am not that confident in its validity, and it might be the reason for us to be cooked in our defense…


[link] [comments]

Need Help And Opinion Regarding To The Synthetic Data We Used In A School Research Study We Conducted.

import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.preprocessing import MinMaxScaler from sklearn.metrics import classification_report, accuracy_score, confusion_matrix, ConfusionMatrixDisplay, precision_recall_curve import tensorflow as tf from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, Dropout from tensorflow.keras.callbacks import EarlyStopping import matplotlib.pyplot as plt # Generate synthetic data np.random.seed(42) data = { “Temperature (°C)”: np.random.uniform(15, 45, 1000), # Ambient temperature “Irradiance (W/m²)”: np.random.uniform(100, 1200, 1000), # Solar irradiance “Voltage (V)”: np.random.uniform(280, 400, 1000), # Voltage output “Current (A)”: np.random.uniform(4, 12, 1000), # Current output } # Create DataFrame df = pd.DataFrame(data) df[“Power (W)”] = df[“Voltage (V)”] * df[“Current (A)”] df[“Fault”] = np.where((df[“Power (W)”] < 2000) | (df[“Voltage (V)”] < 320), 1, 0) # Fault criteria # Preprocess data features = [“Temperature (°C)”, “Irradiance (W/m²)”, “Voltage (V)”, “Current (A)”] target = “Fault” X = df[features] y = df[target] # Train-test split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Scale features scaler = MinMaxScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Build ANN model model = Sequential([ Dense(128, input_dim=X_train_scaled.shape[1], activation=’relu’), Dropout(0.3), Dense(64, activation=’relu’), Dense(1, activation=’sigmoid’) # Sigmoid for binary classification ]) model.compile(optimizer=’adam’, loss=’binary_crossentropy’, metrics=[‘accuracy’]) # Early stopping early_stopping = EarlyStopping(monitor=’val_loss’, patience=5, restore_best_weights=True) # Train ANN model history = model.fit( X_train_scaled, y_train, epochs=50, batch_size=32, validation_split=0.2, verbose=1, callbacks=[early_stopping] ) # Evaluate model y_pred = (model.predict(X_test_scaled) > 0.5).astype(“int32”) print(“ANN Accuracy:”, accuracy_score(y_test, y_pred)) print(“Classification Report:n”, classification_report(y_test, y_pred)) # Confusion Matrix cm = confusion_matrix(y_test, y_pred) disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=[0, 1]) disp.plot(cmap=”Blues”) plt.title(“Confusion Matrix (ANN)”) plt.show() # Precision-Recall Curve y_scores = model.predict(X_test_scaled).ravel() precision, recall, _ = precision_recall_curve(y_test, y_scores) plt.plot(recall, precision, marker=’.’, label=”ANN”) plt.title(“Precision-Recall Curve”) plt.xlabel(“Recall”) plt.ylabel(“Precision”) plt.legend() plt.show() # Plot training history plt.plot(history.history[‘accuracy’], label=’Train Accuracy’) plt.plot(history.history[‘val_accuracy’], label=’Validation Accuracy’) plt.title(“Training and Validation Accuracy (ANN)”) plt.xlabel(“Epoch”) plt.ylabel(“Accuracy”) plt.legend() plt.show() Does the synthetic data generated in this code, particularly the ranges for temperature, irradiance, voltage, and current, as well as the fault definition criteria, realistically reflect the operational parameters and fault conditions of photovoltaic systems? Could someone with expertise in photovoltaic system analysis validate whether this data and fault classification logic are appropriate and credible for use in a school research project? (Our research is about studying the effectiveness of machine learning-based photovoltaic systems for predictive maintenance). I tried implementing real-world data with this research, however with limited time and resources, I think using synthetic data would be the best option for this research.


[link] [comments]