Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Need Help And Opinion Regarding To A School Research Study We Conducted.

import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.preprocessing import MinMaxScaler from sklearn.metrics import classification_report, accuracy_score, confusion_matrix, ConfusionMatrixDisplay, precision_recall_curve import tensorflow as tf from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, Dropout from tensorflow.keras.callbacks import EarlyStopping import matplotlib.pyplot as plt # Generate synthetic data np.random.seed(42) data = { “Temperature (°C)”: np.random.uniform(15, 45, 1000), # Ambient temperature “Irradiance (W/m²)”: np.random.uniform(100, 1200, 1000), # Solar irradiance “Voltage (V)”: np.random.uniform(280, 400, 1000), # Voltage output “Current (A)”: np.random.uniform(4, 12, 1000), # Current output } # Create DataFrame df = pd.DataFrame(data) df[“Power (W)”] = df[“Voltage (V)”] * df[“Current (A)”] df[“Fault”] = np.where((df[“Power (W)”] < 2000) | (df[“Voltage (V)”] < 320), 1, 0) # Fault criteria # Preprocess data features = [“Temperature (°C)”, “Irradiance (W/m²)”, “Voltage (V)”, “Current (A)”] target = “Fault” X = df[features] y = df[target] # Train-test split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Scale features scaler = MinMaxScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Build ANN model model = Sequential([ Dense(128, input_dim=X_train_scaled.shape[1], activation=’relu’), Dropout(0.3), Dense(64, activation=’relu’), Dense(1, activation=’sigmoid’) # Sigmoid for binary classification ]) model.compile(optimizer=’adam’, loss=’binary_crossentropy’, metrics=[‘accuracy’]) # Early stopping early_stopping = EarlyStopping(monitor=’val_loss’, patience=5, restore_best_weights=True) # Train ANN model history = model.fit( X_train_scaled, y_train, epochs=50, batch_size=32, validation_split=0.2, verbose=1, callbacks=[early_stopping] ) # Evaluate model y_pred = (model.predict(X_test_scaled) > 0.5).astype(“int32”) print(“ANN Accuracy:”, accuracy_score(y_test, y_pred)) print(“Classification Report:n”, classification_report(y_test, y_pred)) # Confusion Matrix cm = confusion_matrix(y_test, y_pred) disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=[0, 1]) disp.plot(cmap=”Blues”) plt.title(“Confusion Matrix (ANN)”) plt.show() # Precision-Recall Curve y_scores = model.predict(X_test_scaled).ravel() precision, recall, _ = precision_recall_curve(y_test, y_scores) plt.plot(recall, precision, marker=’.’, label=”ANN”) plt.title(“Precision-Recall Curve”) plt.xlabel(“Recall”) plt.ylabel(“Precision”) plt.legend() plt.show() # Plot training history plt.plot(history.history[‘accuracy’], label=’Train Accuracy’) plt.plot(history.history[‘val_accuracy’], label=’Validation Accuracy’) plt.title(“Training and Validation Accuracy (ANN)”) plt.xlabel(“Epoch”) plt.ylabel(“Accuracy”) plt.legend() plt.show() Does the synthetic data generated in this code, particularly the ranges for temperature, irradiance, voltage, and current, as well as the fault definition criteria, realistically reflect the operational parameters and fault conditions of photovoltaic systems? Could someone with expertise in photovoltaic system analysis validate whether this data and fault classification logic are appropriate and credible for use in a school research project? (Our research is about studying the effectiveness of machine learning-based photovoltaic systems for predictive maintenance). I tried implementing real-world data with this research, however with limited time and resources, I think using synthetic data would be the best option for this research.


[link] [comments]

Does Anyone Have A Real-world Datasets For Photovoltaic Systems?

May I ask if anyone have any real-world datasets about photovoltaic? I am goint to use it for a school research project. Which is about the effectiveness of machine-learning based photovoltaic system for predictive maintenance. I currently use synthetic data, however I am not that confident in its validity. Any reccomendations, suggestions, and opinions are highly encouraged.

submitted by /u/Interesting-Peak7420
[link] [comments]

Request For Before And After Database

’m on the lookout for a dataset that contains individual-level data with measurements taken both before and after an event, intervention, or change. It doesn’t have to be from a specific field—I’m open to anything in areas like healthcare, economics, education, or social studies.

Ideally, the dataset would include a variety of individual characteristics, such as age, income, education, or health status, along with outcome variables measured at both time points so I can analyze changes over time.

It would be great if the dataset is publicly available or easy to access, and it should preferably have enough data points to support statistical analysis. If you know of any databases, repositories, or specific studies that match this description, I’d really appreciate it if you could share them or point me in the right direction.

Thanks so much in advance for your help! 😊

submitted by /u/New_Campaign_6516
[link] [comments]

Need A High Quality / High Granularity Data On Wealth (not Income!) Distribution In The United States, Over A Period Of Time If Possible But Present-day Only Would Be Appreciated As Well.

I’m looking specifically for granularity in terms of wealth percentage. There’s tons of datasets that go something like top .1%/1%/10%/50%/90% or so, but I’d really need something that goes AT LEAST by individual percent (as in top 1%, 2%, 3%, 4%, all the way down to the bottom 99%), if not fractions of a percent as well. Or any dataset where I’d be able to calculate those statistics from.

Thank you in advance! Any leads towards such a data set would be greatly appreciated!

submitted by /u/Showy_Boneyard
[link] [comments]

How Do You Do A Sample Size Calculation?

How do you calculate sample size based on odds ratios and confidence intervals?

Using SPSS, you can do sample size based on what test you are using so I am using one way ANOVA and that wanted Standard deviation and mean but all previous articles have odds ratios and CIs so how do I calculate sample size?

submitted by /u/AV0902
[link] [comments]

Recipes / Food / Dish DataSet With Name, Ingredients, Recipe And Precise Region Of The Dish

Hello,

I’m looking for a couple of hours, i can’t find a dataset that will provide me like 5k+ dishes/recipes that will include the name, the ingredients, the description and the precise region like: Pizza Margarita will be Napoli.

I’m not sure i found all the datasets website yet, if you have any info or any advices to find something similar or a way to scrape a website that includes those informations i’m up for it.

Thanks

submitted by /u/MambaRealMVP
[link] [comments]

How To Combine A Time Series Dataset And An Image Dataset

I have two datasets that relate to each other. The first dataset consists of images on one column and the time stamp and voltage level at that time. the second dataset is the weather forecast, solar irradiance, and other features ( 10+). the data provided is for each 30 mins of each day for 3 years, while the images are pictures of the sky for each minute of the day. I need help to direct me to the way that I should combine these datasets into one and then later train it with a machine/deep learning-based model analysis where the output is the forecast of the voltage level based on the features.

In my previous experiences, I never dealt with Time Series datasets so I am asking about the correct way to do this, any recommendations are appreciated.

submitted by /u/throw55500m
[link] [comments]

Acquiring “Real World” Synthetic Data Sets Out Of Stripe, Hubspot, Salesforce, Shopify, Etc.

Hi all:

We’re building an exploratory data tool, and we’re hoping to simulate a data warehouse that has data from common tools, like Stripe and Hubspot. The data would be “fake” but simulate the real world.

Does anyone have any clever ideas on how to acquire data sets which are “real world” like this?

The closest thing I can think of is someone using a data synthesizer like gretel.ai or a competitor on a real world data set and being willing to share it.

Thanks,

submitted by /u/thelionofverdun
[link] [comments]

Advice Needed: Best Way To Access Real Estate Data For Free Tool Development

Hi,

I’m working on developing a free tool to help homeowners and buyers better navigate the real estate market. To make this tool effective, I need access to the following data:

Dates homes were listed and sold Home features (e.g., square footage, lot size, number of bedrooms/bathrooms, etc.) Information about homes currently on the market

I initially hoped to use the Zillow API, but unfortunately, they’re not granting access. Are there any other free or low-cost data sources or APIs that you’d recommend for accessing this type of information?

Your insights and suggestions would mean a lot. Thanks in advance for your help!

submitted by /u/Ykohn
[link] [comments]

The Biggest Free & Open Football Results & Stats Dataset

Hello!

I want to point out the dataset that I created, including tens of thousands of historical football (soccer) match data that can be used for better understanding of the game or for training machine learning models. I am putting this up for free as an open resource, as per now it is the biggest openly and freely available football match result & stats & odds dataset in the world, with most of the data derived from Football-Data.co.uk:

https://github.com/xgabora/Club-Football-Match-Data-2000-2025

submitted by /u/AdkoSokdA
[link] [comments]

Swedish Conversation/dialog Datasets

I’ve been looking for datasets consisting of chats, conversations, or dialogues in Swedish, but it has been tough finding Swedish datasets. The closest solutions I have come up with are:

Building a program to record and transcribe conversations from my daily life at home.

Scraping Reddit comments or Discord chats.

Downloading subtitles from movies.

The issue with movie subtitles is that, without the context of the movie, the lines often feel disconnected or lack a proper flow. Anyone have better ideas or resources for Swedish conversational datasets?

I am trying to build an intention/text classification model. Do you have any ideas what I could/should do or where to search?

For those wondering, I am trying to build a simple Swedish NLP model as a hobby project.

Happy newyear!!

submitted by /u/Wallido17
[link] [comments]

NBA Historical Dataset: Box Scores, Player Stats, And Game Data (1949–Present) 🚀

Hi everyone,

I’m excited to share a dataset I’ve been working on for a while, now available for free on Kaggle! This comprehensive dataset includes detailed historical NBA data, meticulously collected and updated daily. Here’s what it offers:

Player Box Scores: Statistics for every player in every game since 1949. Team Box Scores: Complete team performance stats for every game. Game Details: Information like home/away teams, winners, and even attendance and arena data (where available). Player Biographies: Heights, weights, and positions for all players in NBA history. Team Histories: Franchise movements, name changes, and more. Current Schedule: Up-to-date game times and locations for the 2024-2025 season.

I was inspired by Wyatt Walsh’s basketball dataset, which focuses on play-by-play data, but I wanted to create something focused on player-level box scores. This makes it perfect for:

Fantasy Basketball Enthusiasts: Analyze player trends and performance for better drafting and team-building strategies. Sports Analysts: Gain insights into long-term player or team trends. Data Scientists & ML Enthusiasts: Use it for machine learning models, predictions, and visualizations. Casual NBA Fans: Dive deep into the stats of your favorite players and teams.

The dataset is packaged as a .sql file for database users, and .csv files for ease of access. It’s updated daily with the latest game results to keep everything current.

If you’re interested, check it out here: https://www.kaggle.com/datasets/eoinamoore/historical-nba-data-and-player-box-scores/

I’d love to hear your feedback, suggestions, or see any cool insights you derive from it! Let me know what you think, and feel free to share this with anyone who might find it useful.

Cheers.

submitted by /u/Low-Assistance-325
[link] [comments]

Normalized Database Dataset For Data Modeling

I’m interested in doing some data modeling on normalized database datasets. ecommerce, financial, really anything would probably be fine. I would like some sort of referential integrity so that foreign keys match up to primary keys.

Looking for recommendations.

I’ve already played with TPCH, looking for other suggestions.

submitted by /u/drunk_goat
[link] [comments]

Seeking Dataset: Private Company Valuations & Exit Multiples (Deal-Level & Industry Benchmarks)

Hi everyone,

I’m on the hunt for datasets or sources that offer insights into private company valuations, particularly exit multiples and benchmark data.

Here’s what I’m ideally looking for:

Exit multiples (e.g., revenue multiples, EBITDA multiples) on a deal-by-deal basis as well as industry-wide benchmarks. Data on geography-specific valuation metrics or benchmarks. Industry breakdowns to identify trends in specific sectors. Datasets or reports that cover private equity exits or M&A activity trends.

If you’re aware of any resources that provide a solid level of granularity, I’d be incredibly grateful for the help!

So far, I’ve explored platforms like PitchBook and CB Insights, but I’m curious if anyone knows of more detailed alternatives or supplementary datasets.

Likewise, if there are any public datasets, or even specific reports (e.g., whitepapers, academic studies, or proprietary research) that can provide similar insights, please send them my way.

Thank you in advance for any suggestions or pointers!

submitted by /u/Global-Departure3046
[link] [comments]

How To Generate Text Dataset Using LLama 3.1? [Synthetic]

So I am working on my semester mini-project. It’s titled “Indianism Detection in Texts Using Machine Learning” (yeah, I just randomly made it up during idea submissions). Now the problem is, there’s no such dataset for this in the entire world. To counter this, I came up with a pipeline to convert a normal (correct) English phrase into English with Indianisms using my local LLama 3.1 and then save both the correct and converted sentences into a dataset with labels, respectively.

I also created a simple pipeline for it (a kind of constitutional AI) but can’t seem to get any good responses. Could anyone suggest something better? (I’m 6 days away from the project submission deadline.)

I explained the current pipeline in this GitHub repo’s README. Check it out:
https://github.com/iamDyeus/Synthetica

submitted by /u/dyeusyt
[link] [comments]

I’m Working On A Tool That Allows Anyone To Create Any Dataset They Want With Just Titles

I work full-time at a startup where I collect structured data with LLMs, and wanted to create a tool that does this for everyone. The idea is to eventually create a luxury system that can create any dataset you want with unique data points, no matter how large, and hallucination free. If you’re interested in a tool like this, check out the website I just made to collect signups.

batchdata.ai

submitted by /u/D4isyy
[link] [comments]

Looking For Annual Datasets Of Any Kind For African Cities

Hi guys,

I am writing a paper on the changes in vulnerability of african cities and I’ve had a problem with finding data. I am looking for indicators that are annual (at least 30 years back) of any kind, although economic or environmental ones are more needed. While it is not difficult to find such data for african countries, african cities are borderline impossible. The only resource I found was Global Data Lab which is kind of the perfect example of what I am looking for:

example

Again, any data in this form is appreciated though I’m aware how hard it is to find.

submitted by /u/Used-Ad1876
[link] [comments]

Our 3D Traffic Light And Sign Dataset Is Available On Kaggle

If you have much free time during the holiday season and want to play with 3D traffic lights and sign detection, our new Kaggle dataset is what you need!

The dataset consists of accurate and temporally consistent 3D bounding box annotations for traffic lights and signs, effective up to a range of 200 meters.

https://www.kaggle.com/datasets/tamasmatuszka/aimotive-3d-traffic-light-and-sign-dataset

submitted by /u/MatuszkaT
[link] [comments]

Does Anyone Know Where To Find A Dataset With Website Traffic Data?

Hi everyone,

I’m looking for some data to practice analyzing website performance. Specifically, I’d like information on metrics like time spent on page, number of pages viewed, and similar stats. My goal is to do some basic analysis—nothing too advanced.

Ideally, I’d love to work with e-commerce website data, but if that’s not available, data from any type of website would be great!

Does anyone know where I can find datasets like this?

submitted by /u/Pedro17f
[link] [comments]

🚗 Open-Source Car Dataset For Price Prediction! 📊

Hi everyone! 👋

We’re excited to share a dataset we’ve been working on that could be helpful for anyone interested in exploring machine learning and data analysis.

🔍 Why Use This Dataset?

Perfect for beginner-friendly ML projects. Ideal for experimenting with algorithms like linear regression, decision trees, or neural networks. Great for data visualization to identify trends in car pricing.

🚀 How to Get the Dataset

The dataset is hosted on https://www.kaggle.com/datasets/qubdidata/auto-market-dataset/data.

🛠️ Example Use Cases

Building a car price prediction model. Analyzing the relationship between features like mileage and price. Comparing the performance of ML models on this dataset.

🤝 Community Collaboration

This is an open-source project, so feel free to:

Contribute additional data points or clean the dataset. Share your analysis or models built using the data. Provide feedback to improve the dataset.

Let’s make this a valuable resource for the community! 🚗✨

Looking forward to seeing what you create. If you have any questions or suggestions, drop them in the comments below. 👇

submitted by /u/Qubdi
[link] [comments]

I’ve Collected A Dataset Of 1M+ App Store And Play Store Entries – Anyone Interested?

Hey everyone,

For my personal research, I’ve compiled a dataset containing over a million entries from both the App Store and Play Store. It includes details about apps, and I thought it might be useful for others working in related fields like app development, market analysis, or tech trends.

If anyone here is interested in using it for your own research or projects, let me know! Happy to discuss the details.

Cheers!

submitted by /u/26th_Official
[link] [comments]