Can anyone provide me the datasets – NPDI Dataset, Pornography 2K Dataset & GGOI Dataset
submitted by /u/ank_007____
[link] [comments]
Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?
Can anyone provide me the datasets – NPDI Dataset, Pornography 2K Dataset & GGOI Dataset
submitted by /u/ank_007____
[link] [comments]
I’m trynna do a simple project using R on ball games and and sportsbook projections; Looking at trends that usually make props go up or down and how the results usually turn out. Any ideas? Pointers for datasets that could match that perfectly?
I could focus on basketball and soccer to make the project simpler since they are the two sports I watch the most.
submitted by /u/Tight_Investigator14
[link] [comments]
I was looking at njparcels website. I’d like similar data for Texas, is it possible to get from gov websites – legally?
submitted by /u/svg_12345
[link] [comments]
Techmap.io just published new datasets of Job Postings from Ireland on Kaggle. You can find them here: https://www.kaggle.com/techmap/datasets
Job Postings from Ireland (October 2020) – 58MB Job Postings from Ireland (October 2021) – 56MB Job Postings from Ireland (October 2022) – 101MB
submitted by /u/Techmap_io
[link] [comments]
Hey everyone, I’m taking a machine learning class in college and I want to build an R model that predicts Klay Thompson’s performance in NBA games. The problem is I can’t find a cleaned dataset with data from all 716 nba games he’s played, with all the covariates such as 3 pointers, rebounds, assists, free throws, etc. I found all this info on statmuse.com and that website that has a record of all the games he’s played but I need help compiling them into a csv. Can anyone help me do this?
submitted by /u/driftqueenjulie
[link] [comments]
Hi /r/datasets
For a school oroject I’m working on, I need data about ESG scores (preferably detailed for each pillar) for several companies (particularly European ones but anything goes) , supplementary data about different ESG criteria can be useful too Unfortunately, most data sources about this are very expensive or hardly useful… So any suggestions of accessible datasets like these would be very appreciated! Thanks in advance for any help!
PS : datasets about operational risks for companies can be interesting too
submitted by /u/floflo79
[link] [comments]
s it always a riddle to find the data sets of a research paper?
or it is that some dont show them?
for example here, https://encyclopedia.pub/entry/2267
shouldnt they mention whether it is shown or not ?
submitted by /u/Professional_Yak9979
[link] [comments]
Looking for a dataset of electronic invoices with the following specs:
Type: Electronic invoices, not scanned docs, US invoices preferably
File Type: Pdf or jpg/png…
Quantity: At least 500 total invoices, preferably over 1,000
Additional details: The dataset needs to contain both correct and incorrect invoices. Incorrect invoices would be invoices that contain errors, inaccuracies or issues in them. Correct invoices need to have a tag in the name that indicates they are correct, same thing for the incorrect invoices. Not sure if this is the best move but I would be ok with having 2 separate datasets, 1 dataset of correct invoices and another dataset of incorrect invoices.
I am also open to suggestions of sites or resources that have invoices for web scrapping purposes.
I am willing to provide additional details if it helps.
Thanks in advance!
submitted by /u/souley16
[link] [comments]
Does anyone know where I could get obesity rates by zip or county? I would need them by a level more detailed than the state level. Thank you
submitted by /u/jbr2811
[link] [comments]
i somehow ended up in a data analytics class where I need to prepare a proposal for an investigation related to fraud and the prof has basically given us no insight. I need a data set that i can run at least three different supervised or semi-supervised analytical techniques on. I was thinking something related to spam email but i really don’t know what I’m looking for. Struggling to come up with good ideas. preferably simple, any help is greatly appreciated
submitted by /u/xnickg77
[link] [comments]
I think I already know the answer but want to get other opinions.
I have two large data sets that I had access to in the past: 1 was shared with me on Github and is still available on their profile – Its real data but redacted for HIPAA reasons.
Another Data set I had been given access to for during my Capstone project – Its also redacted and does not have any direct patient identifiers (Medical recor numbers but this means nothing to me or This is the only thing I’m worried about)
Would it be appropriate for me to re-use these data sets and put them up on my portfolio with data visualizations and as ‘data cleaning’ projects?
Any advice is appreciated
submitted by /u/Potential_Lettuce
[link] [comments]
E.g. I’d like data of all of Khabib’s fights in the UFC, and data on his opponents. Most notably what their rank was in their respective weight class at the time of the fight, their record at the time, etc
submitted by /u/alpachino4
[link] [comments]
Does anyone know of datasets that provide data on boycotts? Things like start/end dates, financial impact, industry/ companies impacted, scope of boycott (sq. miles or # of people), type of product, and/ or reason for boycott.
submitted by /u/Neighborhooddataguy
[link] [comments]
I’ve been searching for it but all I’ve found are a couple datasets from any specific country, but nothing global, neither free or paid.
What I need is something like: “country – city name – beach name”, it doesn’t have to be a perfect list of world beaches, but at least it should serve as a starting point.
submitted by /u/montesremotedev
[link] [comments]
The information provided in these data has been submitted to the California Safe Cosmetics Program (CSCP) at the California Department of Public Health (CDPH). The primary goal of the CSCP is to gather data on unsafe and potentially hazardous components in cosmetic products available for sale in California and make this information accessible to the public.
Under the California Safe Cosmetics Act, manufacturers, packers, and/or distributors are required to submit a list of all cosmetic products that contain any ingredients known or suspected to cause cancer, birth defects, or other developmental or reproductive harm to the CSCP, as indicated on the product label, for all cosmetic products sold in California.
Companies with reportable ingredients in their products must provide information to the CSCP if they meet the following criteria:
They have annual aggregate sales of cosmetic products of one million dollars or more They have sold cosmetic products in California on or after January 1, 2007.
To view the data: https://app.gigasheet.com/spreadsheet/Cosmetic-Company-Chemicals/26ed23e9_77da_4708_b5da_8bb23c6efcff
Source: https://catalog.data.gov/dataset/chemicals-in-cosmetics-7d6ab
submitted by /u/sheetheadd
[link] [comments]
The data sets which I have right now are too big to be loaded on Google sheets and Rstudio. Suggest me ways to load and work on the data.
submitted by /u/Easy-Inflation3123
[link] [comments]
I have a project due where I need to make 5 different linear regressions in Python on a cyber security topic such as cyberattacks, fake news, cyber intrusions, identity theft, malware, etc. I need a dataset with 200 lines and is a csv file. I know how to do the code but finding a good data set with numeric values is so hard!
submitted by /u/AmericanArsenal17
[link] [comments]
I created a dataset for analyzing crypto price data across a large number of coins traded on Ethereum.
The dataset can be viewed and downloaded from Kaggle here: https://www.kaggle.com/datasets/martkir/historical-ohlc-crypto-price-data-for-1900-coins
I also uploaded the code on Github if you want to reproduce the dataset and/or download fresh data. Link here: https://github.com/martkir/crypto-prices-download
I created the dataset because I couldn’t find a good / free place to download historical price data that was granular (1 min resolution) for a large enough cross section of coins.
Centralized exchanges (e.g. Binance, Kraken) have APIs but only for a small subset of tokens – which misses a lot of the small-cap coins traded on DEXs with interesting statistical properties.
Anyway, hope some of you find this dataset useful 🙂
submitted by /u/112129
[link] [comments]
This might not be the best place for this question. Pointing me to a better forum would be appreciated if that’s true.
I live in Seattle, WA, which has a reputation for being rainy. But it’s not a well deserved one. There are cities in Florida that get more rain than us, for example.
After living here for 20 years, I’m convinced that what makes Seattle noteworthy is rather how dark it is. But any time I try to research this, it’s a dead end. All sources of data break things down into the binary of cloudy / sunny. Usually by day. One infographic I found at least had the nuance to use hours of sunshine.
I’m looking for a source to break cities down by average lux over the course of a year. With a smooth range from 120,000 lux to 10,000 for full daylight, and a range of 1,000 to 5 lux for cloud cover, and assumably 10,000 to 1,000 for some sort or partial cloud cover, it seems like there’s a ton of nuance possible here beyond “sunny” or “cloudy”.
With 10% or so of Americans being impacted by seasonal affective disorder, I’m confused why this information isn’t more in demand. I want to look at the big picture of average yearly light exposure.
But I also want my weather app to predict lux for tomorrow. How bright will it be at noon? I want people to have access to the vocabulary of lux like we’ve recently developed the vocabulary of air quality. “Wow, yesterday only got up to 10 lux in Seattle!”
It seems more significant to me than what time sunrise and sunset are, or what the humidity is, but I can’t find evidence that anyone is tracking this information at all 🫤
Can anyone point me to the secret database of global lux records?
submitted by /u/tigerproofrock
[link] [comments]
This post is self-promotional, but I genuinely feel it can offer value to this community to discuss our plans, expose our free datasets, and take feedback on what datasets would like to see on Snowflake:
https://www.snowflake.com/blog/snowflake-invests-cybersyn-bringing-unique-data-products-to-marketplace/ https://www.cybersyn.com/blog-series-a/
Find all of our products directly here: https://app.snowflake.com/marketplace/listings/Cybersyn%2C%20Inc
submitted by /u/aiatco2
[link] [comments]
Hi, we are currently testing the effect of circadian rhythms on short term recall. The instructions are pretty simple. Download this app (https://apps.apple.com/us/app/short-term-memory/id804088277), play levels 4 and 8 using only 15 seconds to memorize the items. Record how many items you were able to recall for each level. The caveat is that you need to do this once in the evening, and once in the morning. That is the whole purpose of the experiment. Thank you for the participation! You can post your results in the comments or DM me.
submitted by /u/Trevor-Dustin
[link] [comments]
Im trying to analyze the “nbasal” dataset based on position.
when I run this line:
model1 = lm(wage ~ exper, data = center_players) # regression on center players
summary(model1)
The output is this
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1184.64 355.26 3.335 0.00174 **
exper 80.21 51.22 1.566 0.12450
when I run this:
model2 = lm(wage ~ exper + points, data = center_players)
summary(model2)
the output is this:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -33.43 244.90 -0.136 0.8921
exper 71.13 29.86 2.382 0.0217 *
points 149.05 16.02 9.306 7.3e-12 ***
I don’t understand how each point increases salary by 149.05 and the intercept become -33. can someone explain this to me.
submitted by /u/Expensive-Still7318
[link] [comments]
Hi all, I’m new to working with the PPMI dataset for my research project and require SBR values (LC, RC, LP, RP) and CSF markers (ptau, total tau, beta, alpha-syn). I’m finding it really confusing as of from where can I get the CSV files for the same. COuld someone help meout. It’s kinda urgent
submitted by /u/unicorn262001
[link] [comments]
Nice idea to use chatGPT. It would be great if someone took on the task of creating an open datasets, so that resources wouldn’t be wasted on work that has already been done.
submitted by /u/KMiNT21
[link] [comments]