Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

[PAID] Magazines Dataset, Economist, Vanity Fair, The Atlantic And More

Magazines dataset of all the past issues of following magazines:

Economist (1997 to current issue) The Atlantic (1857 to current issue) Vanity Fair (1913 to current issue) MIT Technology Review (1997 to current issue) TIME (1923 to current issue)

There are a few more magazines in the pipeline (Newyorker, NY Times Mag and a few more), which will be added.

Format: Data is available in JSON and epub format, pdfs can be generated on demand.

NOTE: Vanity Fair shutdown in 1936 and relaunched in 1983, so data between these dates isn’t available for it.

If you’ve any queries or want to buy, please dm me.

submitted by /u/waqarHocain
[link] [comments]

Selling Preprocesed And Cleaned Job Description Dataset (Latest LinkedIn And Indeed STEM Postings From US). The Dataset Contains Both Uncleaned And Preprocessed Data For AI Training. Please Let Me Know If Anyone Would Like It, I’m Trying To Raise Some Money For My Startup. Thanks!!!

Hey!

I have around 700K lines of job description processed for AI and ML training. This extracting just the requirements and responsibilities, splitting them into individual lines, correcting all grammatical mistakes, extracting keywords into software skills and experience, classifying the job description, and adding an H1B filter to it.

The dataset is from LinkedIn and Indeed, I scrape and process around 15K everyday. I also have uncleaned and purely scraped data that is 60K everyday. They are all STEM jobs in the US.

I have attached an example of both datasets with this. You can find them here.

I’m trying to raise around $2000 for my startup and this would help me a lot. However, its no pressure I’m not trying to solicitate, just trying to sell some good dataset.

Let me know if anyone has any questions, and please no hate.

Thanks!

submitted by /u/assassinator444
[link] [comments]

Thanks For The Support! New API To Bypass Cloudflare Turnstile Is Live

A few months ago, I launched my cheap scraping API, and I’m happy to share that 79 users are already using it! 🙌

I’ve received lots of requests asking for an API to bypass Cloudflare Turnstile, and I’m excited to announce that it’s now live! 🎉

Plus, the new API supports custom headers, giving you more flexibility for your scraping needs.

Thanks a ton for all the support!

Let me know if you have any feedback or further requests!

submitted by /u/Affectionate-Olive80
[link] [comments]

[Research] Mushroom Description Dataset

Hi

As my final year uni project, I am building an app that will attempt to classify wild mushrooms, and I would like to build a ‘page’ with an image of the mushroom and some basic info like genus and edibility. Does anyone know of any such dataset?

For context, I have an AI model which is trained with Mushroom Observer’s Machine Learning dataset. I tried to use their Name/Descriptions csv but it is clunky and does not contain images.

Thanks for any help

submitted by /u/Gostinker
[link] [comments]

Need A Data Set That Uses Social Media

Hi, I am currently working on a project which focuses on the influence that social media has on cryptocurrency price fluctuations. Does anyone know where I might be able to find a dataset to help me with this or if a way in which I can collect data from social media myself? Thanks

submitted by /u/GeorgeW427
[link] [comments]

Grocery Price API V2 In The Works – Which Stores Should We Add Next?

Hey r/datasets!

A few months back, I launched a Grocery Price API, and I just wanted to start by saying a big thank you to everyone who subscribed and supported it early on. 🙏

The response has been amazing!

Based on feedback, I’m now diving into V2 to add more stores and make the API even more comprehensive.

I’d love your input:

What are the top grocery stores you’d like to see included?

Whether it’s big national chains or popular local spots, drop your suggestions below!

Thanks again, and I’m excited to keep building this with the community’s needs in mind!

submitted by /u/Affectionate-Olive80
[link] [comments]

Light Pollution Dataset For Data Visualization

I would like to obtain a usable dataset on light pollution: tracking the increase brightness in United States cities. I have not been able to locate a suitable dataset. Lots of maps and visualizations, but not a dataset I can work with myself in python and R. Any recommendations and leads are appreciated. Thanks!

submitted by /u/SupremoSpider
[link] [comments]

Need Ideas For Data Science School Project

My friend and I are looking for a fun dataset to use for our end of year project. The goal is to make a random forest and then use that to make predictions about unseen instances.

We aren’t entirely sure where to look for data sets or what we want to do, so all recommendations are welcome! Thanks in advance!

submitted by /u/DeltaShadow4
[link] [comments]

Need Help Opening A Massive .dbo (45GB) — Any Advice?

Hey everyone! I’ve got this gigantic file, ePCR.dbo.MedicalRecord, sitting at a whopping 45.4 GB, and I’m stumped on how to open it. 😅 I tried using DBeaver, but I keep hitting an OutOfMemoryError, even after bumping up the memory settings. It seems like it’s way too big for DBeaver to handle.

Does anyone have any experience with these kinds of files or know any tricks for working with huge .dbo files? Ideally, I’d like to export the data to a CSV so I can actually dig into it, but I’m open to any advice or tool suggestions. I’m not even 100% sure what program originally created this file, so I’m working with limited info here.

Image: File Properties

Any help would be awesome — thanks in advance! 🙏

submitted by /u/alb53
[link] [comments]

PhysioNet Account Registration And Other Sources Of EHR Dataset

Hi, I’m developing a project in which I need electronic health record (EHR) data to identify high risk patient and for early intervention.

Got to know about MIMIC-IV and trying to create account on PhysioNet, however unable to proceed since no emails were received regarding account creation/ activation, even in spam folder. Anyone had the same issue? Any ways to resolve this?

And any other sources of EHR dataset around? Preferably include lab data and patient clinical note to assist with early prediction among others.

Any help and suggestion are much appreciated.

submitted by /u/Radiant_Blue_Eyes
[link] [comments]

How To Avoid Your LLM Leaking Sensitive Data

Hello, dataset community! I wanted to share a project my team has been working on — access control for RAG (a native capability of our authorization solution). I thought it would make sense to share it here and get your feedback.

Most architectures centralize data, making it hard to segregate specific data that AI models can access. Loading corporate data into a central vector store and using this alongside LLM, gives those interacting with the AI agent root-access to the entire dataset. That can lead to privacy violations and compliance issues.

Here’s what Cerbos does (our permission-aware data filtering):

When a user asks a question to an AI chatbot, our solution – Cerbos, enforces existing permission policies to ensure the user has permission to invoke an agent. Before retrieving data, Cerbos creates a query plan that defines which conditions must be applied when fetching data to ensure it is only the records the user can access based on their role, department, region, or other attributes. Then Cerbos provides an authorization filter to limit the information fetched from your vector database or other data stores. Allowed information is used by LLM to generate a response, making it relevant and fully compliant with user permissions.

PS. You could use our open source authorization solution, Cerbos PDP, to see this use case in action. And here’s our documentation.

Would love to get your thoughts and feedback on this, if you have a moment.

submitted by /u/diggVSredditt
[link] [comments]

Ticker-Linked Finance Datasets (HuggingFace)

GitHub Repository

News Sentiment: Ticker-matched and theme-matched news sentiment datasets. Price Breakout: Daily predictions for price breakouts of U.S. equities. Insider Flow Prediction: Features insider trading metrics for machine learning models. Institutional Trading: Insights into institutional investments and strategies. Lobbying Data: Ticker-matched corporate lobbying data. Short Selling: Short-selling datasets for risk analysis. Wikipedia Views: Daily views and trends of large firms on Wikipedia. Pharma Clinical Trials: Clinical trial data with success predictions. Factor Signals: Traditional and alternative financial factors for modeling. Financial Ratios: 80+ ratios from financial statements and market data. Government Contracts: Data on contracts awarded to publicly traded companies. Corporate Risks: Bankruptcy predictions for U.S. publicly traded stocks. Global Risks: Daily updates on global risk perceptions. CFPB Complaints: Consumer financial complaints data linked to tickers. Risk Indicators: Corporate risk scores derived from events. Traffic Agencies: Government website traffic data. Earnings Surprise: Earnings announcements and estimates leading up to announcements. Bankruptcy: Predictions for Chapter 7 and Chapter 11 bankruptcies in U.S. stocks.

We just launched an open investment data initiative. For academic users, these datasets are free to download from Hugging Face.

All of our datasets will be progressively made available for free at a 6-month lag for all research purposes.

Sov.ai plans on having 100+ investment datasets by the end of 2026 as part of our standard $285 plan. This implies that we will deliver a ticker-linked patent dataset that would otherwise cost $6,000 per month for the equivalent of $6 a month.

submitted by /u/OppositeMidnight
[link] [comments]

Annotated Dataset For Explaining The Reason In AI Vs Real Image Detection

I am currently working on a problem statement in which I need to classify between real and ai generated images and then give explanation for the classification. The first part is quite easy and the for the second part I found some research papers but none of them give the links for annotated dataset for fine-tuning model. can anyone help me find datasets which have good annotations for this purpose.

SynArtifact: Classifying and Alleviating Artifacts in Synthetic Images via Vision-Language Model (they mention a dataset on page 4 but didn’t give any links)

submitted by /u/Background-Trainer37
[link] [comments]

Drone And Fighter Dataset For Yolov5 Model

I am trying to make an ai model using yolov5n that will detect drones and fighter aircraft but can’t seem to find any good dataset for both classes or for one of them ,ive searched in alot of places but dataset always gives low accuracy and my target accuracy is 95 % Any one knows this dataset or any worked on a similar project please 🥹

submitted by /u/donia_00
[link] [comments]

[dataset Request] What’s The Best Way To Collect Political News Articles From Several Sources?

Hey folks! I’m a software engineer working on a data science personal project and could use some help collecting my data.

I want to collect a database political news articles specifically related to US politics from several sources (i.e NPR, CNN, AP News, Reuters etc) ideally from the past 30 days or so for my initial POC.

I’ve done some research on several news API’s, a lot of them don’t categorize articles and in addition a lot of them offer search or today’s headlines, but not a stream of all articles from a particular source. If i had my ideal API, i could get all political articles from a particular set of news sources.

I wanted to reach out to see if there’s any existing datasets that I could use, or if there’s any other advice that folks had. Thank you!

submitted by /u/The-FrozenHearth
[link] [comments]

Looking For Facebook Ads Performance Data

Hey Datasets,

I was wondering if anyone has an idea where I could find a medium to large-sized dataset (10,000 – 100,000 ads) on Facebook ads performance.

I’m looking for data with details like:

start date, end date, category, campaign objective, used budget, reach, impressions, clicks, target country, target audience age, target audience gender, target audience interest

I know there’s the Facebook ads API, but it doesn’t allow access to this data unless the ads are your own.

Any help or suggestions would be appreciated. Thanks!

submitted by /u/Equivalent-Bear-4329
[link] [comments]