Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Need Free-Text Data. Willing To Pay.

I’m looking for a large free-text data sets to train a model that will identify and redact sensitive data. Would be awesome if it was already annotated/labeled. Some entity types I’m interested in:

Location, email, name, CC, CVV, Exp, date, product, username, password, passport #, time.

Anything helps.

submitted by /u/tombenom
[link] [comments]

Does Anyone Know Where I Can Find Data For J1 League (Japanese Soccer)?

Hi guys,

I’m a college student and I’m interested in writing a paper about the J1 League. I would love to look into the impact of nationality on playing time but I’m having a hard time gathering data. I know jleague.co and transfermarkt exist, but I can’t seem to find a way to download any statistics from either website.

Would anyone know where I can download data from the J1 League?

submitted by /u/useless_brownie
[link] [comments]

Having Trouble Loading A Dataset Into Google Colab

I am trying to load an OpenNeuro dataset into Google Colab to train a model. Based on the website, the dataset size is said to be 13.46 GB, which can definitely be accommodated by the free version of Colab which usually has around 50 GB of free disk space. I first attempted to download it using AWS CLI by running

!pip install awscli !aws s3 sync –no-sign-request s3://openneuro.org/ds003949 ds003949-download/

But the process terminated as Colab ran out of disk space.

I then attempted to download with openneuro-py, and shrink my download range to just the derivatives folder.

!pip install openneuro-py !openneuro-py download –dataset=ds003949 –include=derivatives/*

Again, I ran out of disk space before the download finished.

I am new to OpenNeuro so I don’t know how their datasets work exactly, or how to get the “true” dataset size. I tried loading a smaller 6 GB dataset into Colab with the above methods, and the dataset size did match what was stated on the website). I have minimal storage on my local hardware so I would like to try getting it loaded into Colab first before I attempt that route.

Would appreciate some help or advice on what I did wrong from anyone with experience working with OpenNeuro or neuroimaging data. Thanks!

submitted by /u/botsunny
[link] [comments]

Dataset Of Outgroup Vs Ingroup/neutral Questions

I’m looking for datasets containing questions that people ask to “opponents” along with questions that they ask to other people in similar situations. Examples of what I’m looking for include lawyers asking questions to their own witnesses and cross-examining other witnesses, politicians in hearings asking questions to supporters of different political parties, and detectives asking for information from suspects and from each other. I’d like to analyze any changes people make in asking questions to their “opponents” vs other people as a baseline.

submitted by /u/geartrains
[link] [comments]

Dataset For Benchmarking Recruiting Software For Bias

Hey all! I was doing some research on companies offering AI solutions for recruiting. I remember seeing a company mentioning that they were benchmarking their algorithm’s results to make sure there was no bias (as it relates to diversity) using some public dataset.

Unfortunately, I forgot to save the link and have been having trouble remembering what that dataset was. I would greatly appreciate it if you could tell me what the dataset could have been.

Thanks!

submitted by /u/opposity
[link] [comments]

Need Boarding School Or Stay Over Camp (ideally) Data For Funnel Analysis

I apologize in advance for the vague request, but I need to build a Tableau dashboard and present it for an interview. Unfortunately I wasn’t given any firm requirements or data when I asked, except that it needs to support funnel analysis. My Google searches for data haven’t been successful either. The data would ideally deal with maximizing capacity at a boarding school or stay over camp, but it doesn’t have to as long as the data support funnel analysis. I’m still pretty new in BI, so I’m not sure which data would best facilitate this. Thanks in advance for any help!

submitted by /u/skittles_grabber
[link] [comments]

Blood Transfusion Service Center Dataset

This dataset from the Blood Transfusion Service Center in Hsin-Chu City, Taiwan, explores blood donation behavior as a classification problem. Collected every three months from 748 randomly selected donors, it includes attributes like recency, frequency, monetary value, and time. The dataset is ideal for studying and predicting blood donation behavior, pretty cool for classification tasks focused on understanding influencing factors.

You can find it here: https://sellagen.com/item/650207244d7ce7e8220cbec5

submitted by /u/nobilis_rex_
[link] [comments]

Looking For Dataset For Autism Rates

Has anyone come across any datasets dealing with autism rates? I want to work on a personal project since I am close to the subject of autism but I have not come across any large data sets

Specifically it would be nice if the information is broken down by year, country, etc and shows how it is progressing

submitted by /u/aerost0rm
[link] [comments]

Why Do So Many Publicly Available Datasets Open In Such Inconvenient/unusable Formats?

Trying to just view the CDC datasets, and the only format it seems to open in is text document. Why!?!? I can’t tell a single thing that’s going on, not even the variables being measured, because it just looks like blocks of text arranged haphazardly in the notepad app

Some other datasets from GitHub contains EDF files and text files again, which are also super inconvenient

Like where is the option for csv or spreadsheet, or basically anything that’s readily viewable and understandable? Why isn’t that the default? I was expecting that viewing the data files would be the easier part of trying to write a research paper, but no

Also if anyone knows how to get this CDC dataset into a viewable format, please let me know! Thanks

submitted by /u/Classic-Asparagus
[link] [comments]

[self-promotion] Free Company Dataset (±17M Records)

BigPicture.io, the company I work for, has just released the latest version of their open-source company dataset, and it’s now available for download. I’ve been in Reddit for a while now, and think that this community might find it useful.

Check it out here: https://docs.bigpicture.io/docs/free-datasets/companies/

You need to sign up first, as we’ve had problems with bots and an AWS bill one month that nearly killed us.

Please feel free to provide your feedback/suggestions as we’re always aiming to improve our services.

submitted by /u/master_in_something
[link] [comments]

Any List Of All Agencies Submitting/not Submitting Reports To The FBI’s UCR Or NIBRS?

Looking for just a list that contains two kinds of information about the FBI’s uniform crime reports (UCR) or the newer NIBRS (the incident-based reporting system, can’t remember what it stands for):

Which agencies (e.g., police departments, etc.) contributed data to the UCR and/or NIBRS Which agencies did NOT do that (e.g., last year)

I’m hunting around the FBI’s UCR website looking for this and haven’t found it, yet. Anyone have this info?

submitted by /u/bobbyfiend
[link] [comments]

Data MarketPlace, Is It A Good Idea?

I think the current iteration of the data marketplace sucks. You have to know a specific place, where you want to get your data from. The variety of data sets available in a specific platform also varies so much. Also, it is incredibly difficult for a non-technical person to get their hands on the data. If a business user wants to access data they have to jump through a lot of hoops to download the data. Is it a good idea to start a marketplace that solves all these problems? Did anyone try to do this before?

submitted by /u/Responsible_Bell_772
[link] [comments]

Looking For Group Competition Dataset With Variying Team Compisition Of Limited Individuals Pool

I’m looking for a dataset of sports, games or video games events with two teams of multiple players (ideally 5 to 10) facing each other with the individual composition of each team being a different combination of a limited pool of players. And of course the final score/outcome of the event.

Like if 23 players had played 100 games of counter strike together : who is playing, what is each team’s composition (not always the same 5 dudes facing the other same 5 dudes) and what is the result + maybe how long did it last ?

All I can find are datasets with teams with fixed or little variying composition like the european football dataset or broad results without individual differenciation of the team members like league of legends ranked games datasets.

Doesn’t have to be highly skilled players. It could be the dataset of one’s kid’s football games at recess.

Any idea if such à dataset exists ? I’m currently trying to make my own by recording my own practice games but at the rate of once a week this will take forever.

submitted by /u/Heliantine
[link] [comments]

Looking For An Incomplete Dataset That Should Be Messy Or Contain Various Data Quality Issues.

Hello, Reddit community,

I’m working on a project that focuses on query-oriented data cleaning with human expert involvement, and I’m in search of a suitable dataset to support this research. The dataset should ideally contain messy or incomplete data.

If you know of any relevant datasets or sources where I can find such data, I would greatly appreciate your assistance. Additionally, if you have any suggestions or insights on where to look for datasets with data quality issues, please feel free to share them.

Thank you in advance for your help and suggestions!

submitted by /u/thelifeofZ080
[link] [comments]

Looking For Gaza Bomb Locations & Times. Any Data Out There?

I’m looking for a dataset that has geolocation coordinates (e.g., latitude & longitude) for bombs dropped on Gaza, especially in the past few weeks, but older, as well. Ideally, I’d like a column with location and a matched column with date/time, and any other information is gravy.

Any ideas? I’ve been searching online, trying to follow sources back for reports in WaPo, Reuters, Axios, AP, etc., but they all seem to lead to dead ends (e.g., proprietary data not shared online).

submitted by /u/bobbyfiend
[link] [comments]

Trying To Get Database Of All Homes With A Heat Pump

Hello, so I am trying to do some real estate-related research, and am particularly trying to understand types of buildings and locations that are most likely to have houses that have certain “green” and sustainability-related features, such as certain energy efficient appliances. I do not intend for this to be a discussion about the overall sustainability and performance of heat pumps, but I am trying to find a way to obtain a database of as many houses as I can across that US that have a heat pump, or just within California. The whole US would be great, but I am most interested in California for the moment. This is real estate-related, because heat pumps are just a hot topic in general in the eco-friendly home space. I know there are certain data sources like RECS data sets that have stats on heat pump adoption, but these values are only at the census division level. I want to see how heat pump homes are distributed much more locally and granularly so that I can understand which cities, regions, districts, neighborhoods, climate zones, etc. have higher clusters of heat pumps installed than others. Additionally, I want to understand the types of homes that have heat pumps, so that I can understand if there are any trends to take note of. I at first thought this idea was absurd and this data was just unobtainable, but then it was just suggested that I take a look at Zillow’s API ,which can be used to pull real estate home data that includes (sometimes) the HVAC system of a home. So I am wondering if maybe I could actually leverage this to get a read on where heat pump households are located within California. But also, I am wondering if there are other data sources I could use for this, I am thinking like construction permit databases or tax assessor databases, where I could filter results for houses where a permit was taken out for a heat pump installation. The idea would be to match all these data points to an address, so that I can map out heat pump homes across the state with GIS. Does this sound reasonable? Would anyone here perhaps have any suggestions on how I could approach this research challenge? Thank you!

submitted by /u/teledude_22
[link] [comments]