Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Anthropic RLHF Dataset: Human Preference Data (+ Errors I Found)

Hello friends!

I recently found this RLHF-style dataset while browsing Hugging Face Datasets. With Reinforcement Learning from Human Feedback (RLHF) becoming the primary way to train AI assistants, it’s great to see organizations like Anthropic making their RLHF dataset publicly available (released as: hh-rlhf).

Like other RLHF datasets, every example in this one includes an input prompt and two outputs generated by the LLM: a chosen output and a rejected output, where a human-rater preferred the former over the latter.

submitted by /u/cmauck10
[link] [comments]

Dataset For Hyper-partisan Or Politically Valenced Misinformation Articles In The UK.

Im looking for a dataset containing fact-checked news articles relevant for the UK political context. This is for a study manipulating politically congruent vs incongruent misinformation and attitudes towards it.

Ive been looking around for ages and am pretty sure that none exists (this kind of thing is understudied in the UK) but would very much appreciate suggestions of places to look, thanks 🙂

submitted by /u/Grouchy_Preparation1
[link] [comments]

Real World Sales Datasets? Any Good Datasets That I Could Use For My Power BI Portfolio As I Interview For Jobs?

I want to create a few Power BI dashboards for my public analytics portfolio site and am looking for sales datasets. I want to use real world sales data (not mock data) and am trying to find sales data that would interest a wide variety of audiences since I’ll be interviewing at a variety of different companies/organizations for my 1st official data analytics job. A dataset that is fairly “generic” and straightforward that won’t require a lot of explanation ahead of time (for example, something “generic” like Amazon sales data, except I assume Amazon doesn’t release their confidential sales data LOL).

I’m also looking at a lot of datasets on Kaggle, GitHub, etc, but I wanted to check if there were any other good sales datasets that you would recommend for this purpose (an entry-level analytics portfolio). I would greatly appreciate it! 😊

Any ideas?

submitted by /u/Expert-Rhubarb-987
[link] [comments]

Can Someone Please Help Me Compile Klay Thompson Data Into A Csv

Hey everyone, I’m taking a machine learning class in college and I want to build an R model that predicts Klay Thompson’s performance in NBA games. The problem is I can’t find a cleaned dataset with data from all 716 nba games he’s played, with all the covariates such as 3 pointers, rebounds, assists, free throws, etc. I found all this info on statmuse.com and that website that has a record of all the games he’s played but I need help compiling them into a csv. Can anyone help me do this?

submitted by /u/driftqueenjulie
[link] [comments]

Looking For Accessible ESG Datasets For School Project

Hi /r/datasets

For a school oroject I’m working on, I need data about ESG scores (preferably detailed for each pillar) for several companies (particularly European ones but anything goes) , supplementary data about different ESG criteria can be useful too Unfortunately, most data sources about this are very expensive or hardly useful… So any suggestions of accessible datasets like these would be very appreciated! Thanks in advance for any help!

PS : datasets about operational risks for companies can be interesting too

submitted by /u/floflo79
[link] [comments]

Looking For Dataset Of Correct And Incorrect Electronic Invoices

Looking for a dataset of electronic invoices with the following specs:

Type: Electronic invoices, not scanned docs, US invoices preferably

File Type: Pdf or jpg/png…

Quantity: At least 500 total invoices, preferably over 1,000

Additional details: The dataset needs to contain both correct and incorrect invoices. Incorrect invoices would be invoices that contain errors, inaccuracies or issues in them. Correct invoices need to have a tag in the name that indicates they are correct, same thing for the incorrect invoices. Not sure if this is the best move but I would be ok with having 2 separate datasets, 1 dataset of correct invoices and another dataset of incorrect invoices.

I am also open to suggestions of sites or resources that have invoices for web scrapping purposes.

I am willing to provide additional details if it helps.

Thanks in advance!

submitted by /u/souley16
[link] [comments]

Looking For A Good Fraud Data Set For A Class Project, Not Very Knowledgeable.

i somehow ended up in a data analytics class where I need to prepare a proposal for an investigation related to fraud and the prof has basically given us no insight. I need a data set that i can run at least three different supervised or semi-supervised analytical techniques on. I was thinking something related to spam email but i really don’t know what I’m looking for. Struggling to come up with good ideas. preferably simple, any help is greatly appreciated

submitted by /u/xnickg77
[link] [comments]

Is It Ethical Or I Guess Allowed For Me To Use A Prior Data Set For Practice?

I think I already know the answer but want to get other opinions.

I have two large data sets that I had access to in the past: 1 was shared with me on Github and is still available on their profile – Its real data but redacted for HIPAA reasons.

Another Data set I had been given access to for during my Capstone project – Its also redacted and does not have any direct patient identifiers (Medical recor numbers but this means nothing to me or This is the only thing I’m worried about)

Would it be appropriate for me to re-use these data sets and put them up on my portfolio with data visualizations and as ‘data cleaning’ projects?

Any advice is appreciated

submitted by /u/Potential_Lettuce
[link] [comments]

Does Any World Beaches Dataset Exist?

I’ve been searching for it but all I’ve found are a couple datasets from any specific country, but nothing global, neither free or paid.

What I need is something like: “country – city name – beach name”, it doesn’t have to be a perfect list of world beaches, but at least it should serve as a starting point.

submitted by /u/montesremotedev
[link] [comments]

Reported Chemicals In Makeup Dataset

The information provided in these data has been submitted to the California Safe Cosmetics Program (CSCP) at the California Department of Public Health (CDPH). The primary goal of the CSCP is to gather data on unsafe and potentially hazardous components in cosmetic products available for sale in California and make this information accessible to the public.

Under the California Safe Cosmetics Act, manufacturers, packers, and/or distributors are required to submit a list of all cosmetic products that contain any ingredients known or suspected to cause cancer, birth defects, or other developmental or reproductive harm to the CSCP, as indicated on the product label, for all cosmetic products sold in California.

Companies with reportable ingredients in their products must provide information to the CSCP if they meet the following criteria:

They have annual aggregate sales of cosmetic products of one million dollars or more They have sold cosmetic products in California on or after January 1, 2007.

To view the data: https://app.gigasheet.com/spreadsheet/Cosmetic-Company-Chemicals/26ed23e9_77da_4708_b5da_8bb23c6efcff

Source: https://catalog.data.gov/dataset/chemicals-in-cosmetics-7d6ab

submitted by /u/sheetheadd
[link] [comments]