Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Are There Any Substance Abuse Usage Dataset

Hey folks! I’m required to fetch some data (textual) on “conversations”, and “messages” on substance use.
e.g. “Smoking crack hits me with an intense wave of euphoria.”, “I enjoy doing cocaine”, etc.

I’ve been trying to find such data but have failed so far, what I’ve discovered mostly relates to datasets on an individual addict or drug being used, but none of them matches the requirement above.

I would really appreciate it if you guys could suggest a dataset from any repository, kaggle/hugging face, or anything else that could help me.

submitted by /u/Kian5658
[link] [comments]

Looking For Global Political Tension Data

Hi all, I’m doing a research project on global conflicts and in particular the cyber impact. I am looking for a dataset which I can use to create a matrix of which countries have ‘political issues’ with each other.
I can find a lot of information on the major conflicts, but getting outside the top 10 gets a bit challenging.

Has anyone seen any data I could use to summarise global political tensions by country?

submitted by /u/fred_t_d
[link] [comments]

Search For A Cool Dataset For Learning Analysis With Python

Hey, I have to write a paper about applied data analysis and for that I am searching for a interesting dataset. I interestingliy can not think of any data by myself, I tried random Google Searches but didn’t find any cool data for now. I think the one prequesite my professor set (he wants to learn something new from the analysis) made me weirdly judge all datasets as ‘unworthy’ if you know what I mean.

Are there any cool datasets from which my professor with background in datascience can learn? (optionally if would be nice if they where fun to work with and not a litteral pain to normalize but yeah just optionally xD)

submitted by /u/matth_l
[link] [comments]

Where Can I Find A Company’s Financial Data FOR FREE? (if It’s Legally Possible)

I’m trying my best to find a company’s financial data for my research’s financial statements for Profit and Loss, Cashflow Statement, and Balance Sheet. I already found one, but it requires me to pay them $100 first. I’m just curious if there’s any website you can offer me to not spend that big (or maybe get it for free) for a company’s financial data. Thanks…

submitted by /u/C0deit-Michael
[link] [comments]

Looking For A YOLO/Darknet-compatible Dataset That Can Be Used To Scan An Image/video And Identify Specific Body Parts

Hey all,

I’m working on a number of devices where I’d like to use machine learning and live video to identify specific parts of the human body.

This is a sex-positive project, and therefore rather than have a classifier that censors anything it thinks might be nudity, I’m looking for a dataset that will enable me to identify nipples, penises, vaginas, and other potentially erogenous zones on people of all genders, colours, and body shapes.

It feels to me that it should be possible, but I’m new to creating/training models and not sure where to start, so figure standing on the shoulders of others is probably a good place!

submitted by /u/No-Art1323
[link] [comments]

Song Dataset With Mood/Vibe Parameters

I have an idea for a personal project and I could use some help finding a dataset.

Project:

I would like to make a playlist generator where I can specify different moods at different points of time in the paylist. So something along the lines of 1h Chill, 1h Pop, 1h Dance. Obviously I would like mush more refinement that I showed in the example. My thought was that I could find paths between different song types so that the genre transitions are smooth.

Maybe this already exists?

Dataset:

What I am looking for is a long list dataset with obviously the main parameters (name, artist, year etc) but also things like popularity, danceability, singablity, nostalgia factor, high vs low energy, happiness, tempo, and more.

Does a dataset like this exist? I also thought it could be possible to use sentiment analysis on the lyrics to generate some of these parameters.

Let me know if you have any ideas

submitted by /u/hindenboat
[link] [comments]

Is There A Dataset Listing Death/birth Dates?

Is there a dataset that contains both the birth and death dates of real people?

This may be a bit of a morbid topic, but I’ve been talking to my wife about people dying close to their birthdays, and since I tend to do silly projects as a way to keep my knowledge alive, I figured an analysis of this data might tell us something (preferably that there’s no correlation lol).

However, all government databases I found only provide aggregated data, such as death and birth rates, unfortunately. I know this may involve some data security and privacy concerns, but I would really just need these two linked dates to do the analysis, no names or anything.

If anyone has access to a structure like this, or perhaps an API that can make this data available, I would be very grateful. I promise to bring this complete study to reddit as soon as I finish it.

submitted by /u/alchamiwa
[link] [comments]

Dataset With Categorical And Numerical Variables Both

Hi, I’m looking for a dataset which at least three numerical variables and two categorical variables. It should be easy enough to look for, but I’m having trouble finding any which match the requirements. Any suggestions for resources where I can look?

The dataset is for a project, we aren’t allowed to use in built or made up data, or from places like kaggle etc.

submitted by /u/viridiancityy
[link] [comments]

[self-promotion] Giving Back To The Datasets Community With Some Free Data!

Hey guys,

I just wanted to share our project called Potarix (https://potarix.com/). It’s an AI-powered web scraping/data extraction tool that can pull data from any website. You can use it at (https://app.potarix.com).

I wanted to give back to this community, so we’ve given everyone that signs up 5$ of credits. Scraping each page takes up $0.10 of your credits. You are not charged for unsuccessful scrapes! That should let you get data from 50 web pages.

So far, we’ve used this project (with some added features) to help clients:

Scrape betting data from the NFL, NBA, and NCAA. Scrape all the Google reviews for each business in San Francisco Scrape business contact information on Google Maps for every single business in the Houston area

Looking ahead, we built some stuff in-house that we’d love to include in the SAAS platform shortly. We’ve built functionality to click, type, scroll, etc. on the page. AI also tends to be wrong sometimes, so we created a tweakable script in the backend, to control the agent’s actions. That way, you’re in control and can bring the script to 100% accuracy. We’ve also seen people battling to build infrastructure for their large-scale scraping projects. We wanna autonomously let folk set up parallelization and choose the infra for their project so everything is scraped as quickly and succinctly as possible from the SAAS.

If any of these future features sound interesting, feel free to book some time, and we can discuss how we can help you with these now!

submitted by /u/youngkilog
[link] [comments]

Multi-sources Rich Social Media Dataset – A Full Month Of Global Chatters!

Hey, data enthusiasts and web scraping aficionados!
We’re thrilled to share a massive new social media dataset that just dropped on Hugging Face! 🚀

Access the Data:

👉Exorde Social Media One Month 2024

What’s Inside?

Scale: 270 million posts collected over one month (Nov 14 – Dec 13, 2024) Methodology: Total sampling of the web, statistical capture of all topics Sources: 6000+ platforms including Reddit, Twitter, BlueSky, YouTube, Mastodon, Lemmy, and more Rich Annotations: Original text, metadata, emotions, sentiment, top keywords, and themes Multi-language: Covers 122 languages with translated keywords Unique features: English top keywords, allowing super-quick statistics, trends/time series analytics! Source: At Exorde Labs, we are processing ~4 billion posts per year, or 10-12 million every 24 hrs.

Why This Dataset Rocks

This is a goldmine for:

Trend analysis across platforms Sentiment/emotion research (algo trading, OSINT, disinfo detection) NLP at scale (language models, embeddings, clustering) Studying information spread & cross-platform discourse Detecting emerging memes/topics Building ML models for text classification

Whether you’re a startup, data scientist, ML engineer, or just a curious dev, this dataset has something for everyone. It’s perfect for both serious research and fun side projects. Do you have questions or cool ideas for using the data? Drop them below.

We’re processing over 300 million items monthly at Exorde Labs—and we’re excited to support open research with this Xmas gift 🎁. Let us know your ideas or questions below—let’s build something awesome together!

Happy data crunching!

Exorde Labs Team – A unique network of smart nodes collecting data like never before

submitted by /u/Exorde_Mathias
[link] [comments]

Looking For Fraud Detection Datasets

I am writing a book chapter on fraud detection using machine learning. I found that most of the current research is rather hard for a person actually building models to apply, every paper likes to highlight the lack of good datasets but no one provides a collection of good datasets that people reading their paper can use

I think that if I include some good datasets for people to train their models on in my chapter, then that will be a very good contribution from my side.

Do you know any good datasets that are used for this, or where I can look for such datasets?

I am honestly clueless when it comes to collecting and finding good datasets for industry grade applications, and I will be really grateful for any help that I get🙏🙏

submitted by /u/mystic-aditya
[link] [comments]

NFL Data Help For Expected Hypothetical Completion Probability

Currently trying to predict the 2025 super bowl winner for a college final presentation. Trying to use Expected Hypothetical Completion Probability from Big Data Bowl 2019 to help by seeing which teams best optimize their playbook for EHCP and if there is a correlation between that and how often they win / complete but having trouble finding a data source.

The EHCP metric requires two main types of data:

1. Play-by-Play Data:

Includes high-level information like down, distance, time remaining, score differential, and whether the pass was completed.

2. Player Tracking Data:

Tracks the location of players and the ball during each play.

Key elements:

Receiver and defender positions. Ball location during the pass. Receiver separation, speed, and direction.

I was directed to pff.com and https://nextgenstats.nfl.com/ so far but I am having trouble coming up with entire data sets for exactly what I need. Anything helps so please let me know!

submitted by /u/B2_CROPFARMER
[link] [comments]

Institutional Data Initiative Plans To Release A Dataset “5 Times That Of Book3” In Early 2025

https://institutionaldatainitiative.org/

https://www.wired.com/story/harvard-ai-training-dataset-openai-microsoft/

Harvard University announced Thursday it’s releasing a high-quality dataset of nearly one million public-domain books that could be used by anyone to train large language models and other AI tools. The dataset was created by Harvard’s newly formed Institutional Data Initiative with funding from both Microsoft and OpenAI. It contains books scanned as part of the Google Books project that are no longer protected by copyright… with classics from Shakespeare, Charles Dickens, and Dante included alongside obscure Czech math textbooks and Welsh pocket dictionaries… In addition to the trove of books, the Institutional Data Initiative is also working with the Boston Public Library to scan millions of articles from different newspapers now in the public domain, and it says it’s open to forming similar collaborations down the line.

submitted by /u/furrypony2718
[link] [comments]

Lookin For Additional US National Pollutants & Animal Movement Datasets

Looking to do some analyses on animal movement in relation to pollutants and anthropogenic landscape features. I have a few datasets/sites collected already, but wondering if I’m missing anything. In particular looking for higher resolution lead/cognition-impairing or mutagenic substances and rodenticide.

Datasets below incase its of use for anyone —

Animal Movement:

Movebank: https://www.movebank.org/cms/movebank-main

Animal Telemetry Network: https://portal.atn.ioos.us/#map

Pollutants:

Enviroatlas: https://enviroatlas.epa.gov/enviroatlas/interactivemap/

Uranium mines: https://andthewest.stanford.edu/2020/uranium-mine-sites-in-the-united-states/

Oil Refineries: https://atlas.eia.gov/datasets/eia::petroleum-refineries-1/explore?location=33.922439%2C-118.375771%2C10.55

Superfund sites: https://www.epa.gov/superfund/search-superfund-sites-where-you-live

PFAS: https://www.ewg.org/interactive-maps/pfas_contamination/map/

Heavy Metals: https://www.sciencedirect.com/science/article/pii/S0048969724011112

ATTAINS water inventory: https://www.epa.gov/waterdata/get-data-access-public-attains-data
NATA /AQS air quality: https://aqs.epa.gov/aqsweb/documents/data_api.html#annual
Toxic release: https://www.epa.gov/toxics-release-inventory-tri-program

submitted by /u/latrans_canis_
[link] [comments]

Can We Automate Data Quality Assessment Process For Small Datasets?

Recently, my friend and I have been thinking of working on a side project (for our portfolios) to automate data quality assessment for small tabular datasets that you often find in kaggle.

We acknowledge that such a tool can’t be 100% accurate but it can definitely help nontech people and tech people to get started with working on their datasets. We aim to have a platform where the user will upload a dataset, the system will identify anomalies and give suggestions to the user with different ways to fix that anomaly (e.g. imputation of missing value, fixing an email that doesn’t follow the email pattern, etc).

I would love to discuss the project further and get your thoughts on it. We have been researching similar projects and we found Cocoon, they use proceed column by column, and for each column they have a series of anomalies to fix using an LLM. But we want to have statistical methods for numerical columns, and use LLM only when it’s needed. Can anyone help?

submitted by /u/Better_Resource_4765
[link] [comments]