Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Light Pollution Dataset For Data Visualization

I would like to obtain a usable dataset on light pollution: tracking the increase brightness in United States cities. I have not been able to locate a suitable dataset. Lots of maps and visualizations, but not a dataset I can work with myself in python and R. Any recommendations and leads are appreciated. Thanks!

submitted by /u/SupremoSpider
[link] [comments]

Need Ideas For Data Science School Project

My friend and I are looking for a fun dataset to use for our end of year project. The goal is to make a random forest and then use that to make predictions about unseen instances.

We aren’t entirely sure where to look for data sets or what we want to do, so all recommendations are welcome! Thanks in advance!

submitted by /u/DeltaShadow4
[link] [comments]

Need Help Opening A Massive .dbo (45GB) — Any Advice?

Hey everyone! I’ve got this gigantic file, ePCR.dbo.MedicalRecord, sitting at a whopping 45.4 GB, and I’m stumped on how to open it. 😅 I tried using DBeaver, but I keep hitting an OutOfMemoryError, even after bumping up the memory settings. It seems like it’s way too big for DBeaver to handle.

Does anyone have any experience with these kinds of files or know any tricks for working with huge .dbo files? Ideally, I’d like to export the data to a CSV so I can actually dig into it, but I’m open to any advice or tool suggestions. I’m not even 100% sure what program originally created this file, so I’m working with limited info here.

Image: File Properties

Any help would be awesome — thanks in advance! 🙏

submitted by /u/alb53
[link] [comments]

PhysioNet Account Registration And Other Sources Of EHR Dataset

Hi, I’m developing a project in which I need electronic health record (EHR) data to identify high risk patient and for early intervention.

Got to know about MIMIC-IV and trying to create account on PhysioNet, however unable to proceed since no emails were received regarding account creation/ activation, even in spam folder. Anyone had the same issue? Any ways to resolve this?

And any other sources of EHR dataset around? Preferably include lab data and patient clinical note to assist with early prediction among others.

Any help and suggestion are much appreciated.

submitted by /u/Radiant_Blue_Eyes
[link] [comments]

How To Avoid Your LLM Leaking Sensitive Data

Hello, dataset community! I wanted to share a project my team has been working on — access control for RAG (a native capability of our authorization solution). I thought it would make sense to share it here and get your feedback.

Most architectures centralize data, making it hard to segregate specific data that AI models can access. Loading corporate data into a central vector store and using this alongside LLM, gives those interacting with the AI agent root-access to the entire dataset. That can lead to privacy violations and compliance issues.

Here’s what Cerbos does (our permission-aware data filtering):

When a user asks a question to an AI chatbot, our solution – Cerbos, enforces existing permission policies to ensure the user has permission to invoke an agent. Before retrieving data, Cerbos creates a query plan that defines which conditions must be applied when fetching data to ensure it is only the records the user can access based on their role, department, region, or other attributes. Then Cerbos provides an authorization filter to limit the information fetched from your vector database or other data stores. Allowed information is used by LLM to generate a response, making it relevant and fully compliant with user permissions.

PS. You could use our open source authorization solution, Cerbos PDP, to see this use case in action. And here’s our documentation.

Would love to get your thoughts and feedback on this, if you have a moment.

submitted by /u/diggVSredditt
[link] [comments]

Ticker-Linked Finance Datasets (HuggingFace)

GitHub Repository

News Sentiment: Ticker-matched and theme-matched news sentiment datasets. Price Breakout: Daily predictions for price breakouts of U.S. equities. Insider Flow Prediction: Features insider trading metrics for machine learning models. Institutional Trading: Insights into institutional investments and strategies. Lobbying Data: Ticker-matched corporate lobbying data. Short Selling: Short-selling datasets for risk analysis. Wikipedia Views: Daily views and trends of large firms on Wikipedia. Pharma Clinical Trials: Clinical trial data with success predictions. Factor Signals: Traditional and alternative financial factors for modeling. Financial Ratios: 80+ ratios from financial statements and market data. Government Contracts: Data on contracts awarded to publicly traded companies. Corporate Risks: Bankruptcy predictions for U.S. publicly traded stocks. Global Risks: Daily updates on global risk perceptions. CFPB Complaints: Consumer financial complaints data linked to tickers. Risk Indicators: Corporate risk scores derived from events. Traffic Agencies: Government website traffic data. Earnings Surprise: Earnings announcements and estimates leading up to announcements. Bankruptcy: Predictions for Chapter 7 and Chapter 11 bankruptcies in U.S. stocks.

We just launched an open investment data initiative. For academic users, these datasets are free to download from Hugging Face.

All of our datasets will be progressively made available for free at a 6-month lag for all research purposes.

Sov.ai plans on having 100+ investment datasets by the end of 2026 as part of our standard $285 plan. This implies that we will deliver a ticker-linked patent dataset that would otherwise cost $6,000 per month for the equivalent of $6 a month.

submitted by /u/OppositeMidnight
[link] [comments]

Annotated Dataset For Explaining The Reason In AI Vs Real Image Detection

I am currently working on a problem statement in which I need to classify between real and ai generated images and then give explanation for the classification. The first part is quite easy and the for the second part I found some research papers but none of them give the links for annotated dataset for fine-tuning model. can anyone help me find datasets which have good annotations for this purpose.

SynArtifact: Classifying and Alleviating Artifacts in Synthetic Images via Vision-Language Model (they mention a dataset on page 4 but didn’t give any links)

submitted by /u/Background-Trainer37
[link] [comments]

Drone And Fighter Dataset For Yolov5 Model

I am trying to make an ai model using yolov5n that will detect drones and fighter aircraft but can’t seem to find any good dataset for both classes or for one of them ,ive searched in alot of places but dataset always gives low accuracy and my target accuracy is 95 % Any one knows this dataset or any worked on a similar project please 🥹

submitted by /u/donia_00
[link] [comments]

[dataset Request] What’s The Best Way To Collect Political News Articles From Several Sources?

Hey folks! I’m a software engineer working on a data science personal project and could use some help collecting my data.

I want to collect a database political news articles specifically related to US politics from several sources (i.e NPR, CNN, AP News, Reuters etc) ideally from the past 30 days or so for my initial POC.

I’ve done some research on several news API’s, a lot of them don’t categorize articles and in addition a lot of them offer search or today’s headlines, but not a stream of all articles from a particular source. If i had my ideal API, i could get all political articles from a particular set of news sources.

I wanted to reach out to see if there’s any existing datasets that I could use, or if there’s any other advice that folks had. Thank you!

submitted by /u/The-FrozenHearth
[link] [comments]

Looking For Facebook Ads Performance Data

Hey Datasets,

I was wondering if anyone has an idea where I could find a medium to large-sized dataset (10,000 – 100,000 ads) on Facebook ads performance.

I’m looking for data with details like:

start date, end date, category, campaign objective, used budget, reach, impressions, clicks, target country, target audience age, target audience gender, target audience interest

I know there’s the Facebook ads API, but it doesn’t allow access to this data unless the ads are your own.

Any help or suggestions would be appreciated. Thanks!

submitted by /u/Equivalent-Bear-4329
[link] [comments]

Vibration Signals W/ Tachometer Datasets?

Hey everyone. I am a mech engineer student currently doing some work on order tracking of vibration signals for predictive maintenance of low RPM machines. To optimize my order tracking algorithm, I’m in dire need of a dataset that consists of:

vibration signals (displacement, velocity or acceleration) of bearings, gears or other cyclostationary elements

the tachometer signal of a rotating shaft, either stationary or non-stationary conditions are fine

the machine in question spins at low RPMs, preferably <120 RPM

The last point is not obligatory, as long as it has the tacho signals it’ll help. If you know anything, it’d deeply appreciate it!

submitted by /u/_smoothendoplasmic
[link] [comments]

Invoice Dataset With Varying Template

I would like to request to everyone in the group to please guide me on how to find and if you know then where to find a dataset consisting of invoices of different styles coming from different organization with each organization generating a different kind of invoice and all those invoices has to be in a pdf format.

submitted by /u/HotSignature492
[link] [comments]

[self-promotion] A Tool For Finding & Using Open Data

Recently I built a dataset of hundreds of millions of tables, crawled from the Internet and open data providers, to train an AI tabular foundation model. Searching through the datasets is super difficult, b/c off-the-shelf tech just doesn’t exist for searching through messy tables at that scale.

So I’ve been working on this side project, Gini. It has subsets of FRED and data.gov–I’m trying to keep the data manageably small so I can iterate faster, while still being interesting. I picked a random time slice from data.gov so there’s some bias towards Pennsylvania and Virginia. But if it looks worthwhile, I can easily backfill a lot more datasets.

Currently it does a table-level hybrid search, and each result has customizable visualizations of the dataset (this is hit-or-miss, it’s just a proof-of-concept).

I’ve also built column-level vector indexes with some custom embedding models I’ve made. It’s not surfaced in the UI yet–the UX is difficult. But it lets me rank results by “joinability”–I’ll add it to the UI this week. Then you could start from one table (your own or a dataset you found via search) and find tables to join with it. This could be like “enrichment” data, joining together different years of the same dataset, etc.

Eventually I’d like to be able to find, clean & prep & join, and build up nice visualizations by just clicking around in the UI.

Anyway, if this looks promising, let me know and I’ll keep building. Or tell me why I should give up!

https://app.ginidata.com/

Fun tech details: I run a data pipeline that crawls and extracts tables from lots of formats (CSVs, HTML, LaTeX, PDFs, digs inside zip/tar/gzip files, etc.) into a standard format, post-processes the tables to clean them up and classify them and extract metadata, then generate embeddings and index them. I have lots of other data sources already implemented, like I’ve already extracted tables from all research papers in arXiv so that you can search research tables from papers.

(I don’t make any money from this and I’m paying for this myself. I’d like to find a sustainable business model, but “charging for search” is not something I’m interested in…)

submitted by /u/9us
[link] [comments]

[Dataset Request] Looking For Animal Behavior Detection Dataset With Bounding Boxes

Hi everyone, I’m a college student working on an animal behavior detection and monitoring project. I’m specifically looking for datasets that include:

Photos/videos of animals Bounding box annotations Behavior labels/classifications

Most datasets I’ve found either have just the images/videos without bounding boxes, or have bounding boxes but no behavior labels. I need both for my project. For example, I’m looking for data where:

Animals are marked with bounding boxes Their behaviors are labeled (e.g., eating, running, sleeping, hunting) Preferably with temporal annotations for videos

Has anyone worked with such datasets or can point me in the right direction? Any suggestions would be greatly appreciated! Thanks in advance!

submitted by /u/Suspicious-Twist9647
[link] [comments]

Need Help Finding A Voice Or Speech Dataset With The Following Criteria

Need a voice dataset for research where a person must speak same sentence or a word in different x locations with noise

Example: Person 1 says “hello” in different locations where: no background noise, location with background noise 1,2,3..x (example: in a car, park, office etc..)

Like this I need n number of persons and x number of voice data spoken in different locations with noise

I found one database which is VALID Database: https://web.archive.org/web/20170719171736/http://ee.ucd.ie:80/validdb/datasets.html

“` 106 Subjects

1 Studio and 4 Office conditions recordings for each, uttering the sentance

“Joe Took Father’s Green Shoebench Out” “`

But I’m not able to download it. Please help me find a suitable dataset.. Thanks in advance!

submitted by /u/arg05r
[link] [comments]

Requesting National Inpatient Sample Data From HCUP

I just submitted an order for Nationwide NIS data, however, since I am trying to get student pricing, I had to submit an email verifying my current enrollment. I got an auto-response email saying that they’ll get back to me 5-7 business days which is really incompatible with my timeline. But I suspect I could get a quicker response time since I’m just seeking a standard approval (not asking a question).

I’m wondering if anyone else can offer insight into how long it took to successfully receive the data. And perhaps suggestions for any alternative datasets I could use (I’m looking for discharge-level data that includes information like hospital zipcode). Also wouldn’t mind advice on working with the data.I’m planning on converting it to format suitable for SQL Querying due (I know this is unusual but I’m working within the constraints of essentially a class project).

submitted by /u/Naur_Regrets
[link] [comments]

🌟 Open Investment Datasets: Free And Growing On GitHub/Huggingface

Hey r/datasets community!

I’m thrilled to share an exciting new resource for all you data enthusiasts, researchers, and finance aficionados out there. https://github.com/sovai-research/open-investment-datasets

🔍 What’s New?

Sov.ai has just launched the Open Investment Data Initiative! We’re building the industry’s first open-source investment datasets tailored for rigorous research and innovative projects. Whether you’re into AI, ML, quantitative finance, or just love diving deep into financial data, this is for you.

📅 Free Access with a 6-Month Lag

All our 20 datasets will be available for free with a 6-month lag for non-commercial research purposes. This means you can access high-quality, ticker-linked data without breaking the bank. For commercial use, we offer a subscription plan that makes premium data affordable (more on that below).

📈 What We Offer

By the end of 2026, Sov.ai aims to provide 100+ investment datasets, including but not limited to:

📰 News Sentiment: Ticker-matched and theme-matched sentiment analysis from various news sources. 📈 Price Breakout Predictions: Daily updates predicting upward price movements for US equities. 🔍 Insider Flow Prediction: Over 60 insider trading features ideal for machine learning models. 💼 Institutional Trading: In-depth analysis of institutional investment behaviors and strategies. 📢 Lobbying Data: Detailed data on corporate lobbying activities, linked to specific tickers. 💊 Pharma Clinical Trials: Unique dataset tagging clinical trials with predicted success outcomes. ⚠️ Corporate Risks: Bankruptcy predictions (Chapter 7 & 11) for over 13,000 US publicly traded stocks. …and many more!

🤝 Get Involved!

We’re looking for firms and individuals to join us as co-architects or sponsors on this journey. Your support can help us expand our offerings and maintain the quality of our data. Interested? Reach out to us here or connect via our LinkedIn, GitHub, and Hugging Face profiles.

🧪 Example Use Cases

Here’s how easy it is to get started with our datasets using the Hugging Face datasets library:

from datasets import load_dataset

# Example: Load News Sentiment Dataset

df_news_sentiment = load_dataset(“sovai/news_sentiment”, split=”train”).to_pandas()

# Example: Load Price Breakout Dataset

df_price_breakout = load_dataset(“sovai/price_breakout”, split=”train”).to_pandas()

# Add more datasets as needed…

submitted by /u/OppositeMidnight
[link] [comments]

Need Help On Extracting The NIHSS From The MIMIC-III Dataset

Hey guys, I am currently working on a Project about the use of Machine Learning for Stroke rehabilitation, and i want to exctract informations, like the NIHSS Score, from Medical Datasets. I found an Article where someone Already did that and even provides the Code on Github. But my problem is, i don´t know where to insert the MIMIC-III Dataset, (I already got that) which consists of several .csv documents, in the code, so that is is running correctly. There is no ReadMe or any file that explains how to run the code correctly or prepare the Dataset. Maybe someone did that or can help me with that.

Link to the Article: https://physionet.org/content/stroke-scale-mimic-iii/1.0.0/

Link to the Github repo: https://github.com/huangxiaoshuo/NIHSS_IE

(sorry for the bad language i am not an english native speaker)

submitted by /u/yasin_dlw
[link] [comments]

Scraped Every Parcel In United States

Hey everyone, me and my co worker are software engineers and were working on a side project that required parcel data for all of the united states. We quickly saw that it was super expensive to get access to this data, so we naively thought we would scrape it ourselves over the next month. Well anyways, here we are 10 months later. We created an API so other people could have access to it much cheaper. I would love for you all to check it out: https://www.realie.ai/data-api. There is a free tier, and you can pull 500 records per call on the free tier meaning you should still be able to get quite a bit of data to review. If you need a higher limit, message me for a promo code.

Would love any feedback, so we can make it better for people needing this property data. Also happy to transfer to S3 bucket for anyone working on projects that require access to the whole dataset.

Our next challenge is making these scripts automatically run monthly without breaking the bank. We are thinking azure functions? Would love any input if people have other suggestions. Thanks!

submitted by /u/Equivalent-Size3252
[link] [comments]