Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

English Premier League First Half Vs Second Half Data By Match

Hi! Does anyone know where I could get detailed data on English Premier League soccer games that shows stats broken for the first and second half of each match?

I see datasets that has scores at half-time and full-time, but I’m after more detailed stats (possession, shots on target, etc.)

Mostly after recent data (2022-2023 season) but would be open to historic as well.

Would appreciate it if someone could point me in the right direction!

submitted by /u/questily
[link] [comments]

Looking For Time Of Birth Data Or Datasets

Hello everyone, I’m new to this site so I hope I’m posting in the right section.

I am looking for data regarding the time and date of birth of large amounts of people. I have tried to look on the HHS website and the Natality data they published but I couldn’t find any information regarding the time of birth.

Is there perhaps another way for me to find that, somewhere? Many thanks!

submitted by /u/cxvdxuxj
[link] [comments]

[self-promotion] Feedback Needed: Building Git For Data That Commits Only Diffs (for Storage Efficiency On Large Repositories), Even Without Full Checkouts Of The Datasets

I would really appreciate feedback on a version control for tabular datasets I am building, the Data Manager.

Main characteristics:

Like DVC and Git LFS, integrates with Git itself. Like DVC and Git LFS, can store large files on AWS S3 and link them in Git via an identifier. Unlike DVC and Git LFS, calculates and commits diffs only, at row, column, and cell level. For append scenarios, the commit will include new data only; for edits and deletes, a small diff is committed accordingly. With DVC and Git LFS, the entire dataset is committed again, instead: committing 1 MB of new data 1000 times to a 1 GB dataset yields more than 1 TB in DVC (a dataset that increases linearly in size between 1 GB and 2 GB, committed 1000 times, results in a repository of ~1.5 TB), whereas it sums to 2 GB (1 GB original dataset, plus 1000 times 1 MB changes) with the Data Manager. Unlike DVC and Git LFS, the diffs for each commit remain visible directly in Git. Unlike DVC and Git LFS, the Data Manager allows committing changes to datasets without full checkouts on localhost. You check out kilobytes and can append data to a dataset in a repository of hundreds of gigabytes. The changes on a no-full-checkout branch will need to be merged into another branch (on a machine that does operate with full checkouts, instead) to be validated, e.g., against adding a primary key that already exists. Since the repositories will contain diff histories, snapshots of the datasets at a certain commit have to be recreated to be deployable. These can be automatically uploaded to S3 and labeled after the commit hash, via the Data Manager.

Links:

https://news.ycombinator.com/item?id=35930895 https://news.ycombinator.com/item?id=35806843

This paradigm enables hibernating or cleaning up history on S3 for old datasets, if these are deleted in Git and snapshots of earlier commits are no longer needed. Individual data entries can also be removed for GDPR compliance using versioning on S3 objects, orthogonal to git.

I built the Data Manager for a pain point I was experiencing: it was impossible to (1) uniquely identify and (2) make available behind an API multiple versions of a collection of datasets and config parameters, (3) without overburdening HDDs due to small, but frequent changes to any of the datasets in the repo and (4) while being able to see the diffs in git for each commit in order to enable collaborative discussions and reverting or further editing if necessary.

Some background: I am building natural language AI algorithms (a) easily retrainable on editable training datasets, meaning changes or deletions in the training data are reflected fast, without traces of past training and without retraining the entire language model (sounds impossible), and (b) that explain decisions back to individual training data.

I look forward to constructive feedback and suggestions!

submitted by /u/Usual-Maize1175
[link] [comments]

Excel Sheet Data Processing Help – Helicopter FOIA Separating Data Excel Sheet

Data Processing

Hey, I need help processing data. My friend offered me a helicopter ride (met through someone) in a certain city in the US in January of 2022 … Lost contact of person who connected me the helicopter dude never gave me his name 😭 (he has an electrical engineering lisence from I’m assuming Florida… He owns a house in this city in south Florida)

Fast forward. I requested a FOIA (freedom of information act) of all helicopters in that city January 2022 … Less that 20 total. Easy. What happens.

My FOIA came in ANDDD according to the FOIA letter they couldn’t separate the rotorcrafts (helicopters) from the fixed wing (small planes 😭) from January 2022.

January 2022 was a VERY BUSY month for planes… It’s going to be an insane amount of data.

(But it’s probably over 10 pages with 50 aircraft info per page on 11 pt. Font)

How do I sort the helicopters out of the data. It was like 15 helis maximum.

HOWEVER…

You can also download all the persons with pilot licenses. This guy has a pilot license (owns and flies his helicopter)… On a sheet that identifies type of lisence

As like p/h for helicopter…. How do you sort all the helicopter owners from this sheet?

It’s as an excel sheet.

Please advise!!

submitted by /u/Soggy-Nectarine-3578
[link] [comments]

Where Can I Download Cairo Dataset ?

Cairo – Cairo University’s dataset consists of a total of 610 questions which are 10 answers for 61 questions. These are collected from only one chapter of the official Egyptian curriculum for the Environmental Science course. The average length of a student’s answer is 2.2 sentences, 20 words, or 103 characters. The dataset contains a collection of students’ responses along with their grades that vary between 0 and 5 according to an assessment of two human evaluators. An English version of the Cairo University data set is also available to research this area. This dataset can be downloaded from the webpage

The link refers to http://www.aucegypt.edu/src/datasets.htm, but unfortunately the link is dead. And I can’t find any other link.

Basically I need dataset of questions, correct answer, student’s answers, and their grade (graded by human). I want to compare my method of automatic answer grading of short answer. So, if you know any other familiar dataset, please let me know.

Thank you.

submitted by /u/yokowasis2
[link] [comments]

Value Of 2.8 Million African Student ID Pictures

Being a datahoarder I stumbled on a way to harvest student ID pictures from an exam authority in sub-saharan africa. No illegal hacking involved, just exploiting a predictable URL format.

Have now gathered 2.8 million of them, about 90gb, spanning about a decade of student exams. Typical ID format, face & shoulders only, often quite small (20-50kb), no metadata besides year, exam type & region.

Is there any monetary value to this? Any open source projects that need such data?

submitted by /u/Joonicks
[link] [comments]

London Stock Exchange Daily Prices Wanted

I am looking for some historic stock exchange prices to analyse. I notice a few sites seem to have them for sale, but does anyone know of any open source or community-created ones? I’d prefer the LSE, but any stock exchange would do for first look.

I would like a dataset of about 10 years worth of daily prices, for 100 or more stocks. The smaller sets I have seen tend to have values for opening, closing, low, high and volume.

I want to try some trading strategies on historic data.

submitted by /u/brainburger
[link] [comments]

Fed Funds Rate (FFR) Futures Historical Data?

I have a nice little ipynb doing data analysis on fed futures rates. However, the available data is rolling, and i haven’t been logging results to a DB to save them for myself.

Is there a way i can access all historical FFR data?

For reference, this is what i’m using: https://www.cmegroup.com/markets/interest-rates/cme-fedwatch-tool.html?redirect=/trading/interest-rates/countdown-to-fomc.html

Edit: I’m using automation to scrape all the files from the “download” link/tab on the left

submitted by /u/throwawayrandomvowel
[link] [comments]

Dataset Of Examples Of Logical Fallacies?

I’m working on a project that is going to require a dataset of logical fallacies (and their classification). This has been quite a tricky task so far and so far have come across just one linked to the paper “Logical Fallacy Detection” by Z Jin (2022). So if anyone is aware of any other examples or possible websites to scrape that would help, thanks!

submitted by /u/CrossingPearl
[link] [comments]

Trying To Create A Spam Voicemail Dataset

Hey guys, I am working on a project to help predict if a voicemail is spam! I am building the dataset, and I have around 300 voicemails, almost half are spam and the others are not. I want to create a dataset of at least 500-1000 voicemails.

So I am requesting that anyone share their spam voicemails and/or normal voicemails (which can be non-personal). It can be in any audio format and shared however you are comfortable with!

submitted by /u/thebatgamer
[link] [comments]

English Words “familiarity” Dataset.

There are plenty of Word Frequency lists but plurals, adjectives, adverbs of the same word end up in different positions in these lists.

I’m looking for a dataset or a way to create a dataset that has all forms or one word clumped together so it’s less about frequency and more about how familiar the word (and its different forms) is if that makes sense.

For instance, i have a list whete the word “have” is at 25th place, “has” at 39 and “had” at 105. Clearly, anyone who knows one of these words would know the other two as well.

Apologies if I did not get my point across clearly. Any help is appreciated. Thanks!

submitted by /u/haskpro1995
[link] [comments]

Stanford Cars (cars196) Contains Many Fine-Grained Errors

Hey Redditors,

I know the cars196 dataset is nothing new, but I wanted to share some label errors and outliers that I found within it.

It’s interesting to note that the primary goal of the original paper that curated/used this dataset was “fine-grained categorization” meaning discerning the differences between something like a Chevrolet Cargo Van and a GMC Cargo Van. I found numerous examples of images that exhibit very nuanced mislabelling which is directly counterintuitive to the task they sought to research.

Here are a few examples of nuanced label errors that I found:

Audi TT RS Coupe labeled as an Audi TT Hatchback Audi S5 Convertible labeled as an Audi RS4 Jeep Grand Cherokee labeled as a Dodge Durango

I also found examples of outliers and generally ambiguous images:

multiple cars in one image top-down style images vehicles that didn’t belong to any classes.

I found these issues to be pretty interesting, yet I wasn’t surprised. It’s pretty well known that many common ML datasets exhibit thousands of errors.

If you’re interested in how I found them, feel free to read about it here.

submitted by /u/cmauck10
[link] [comments]

Which Templates Do You Like The Most?

This is for all data enthusiasts out there!
We’ve launched Plug & Play Data Templates on Product Hunt today! 🥳
Our data templates are a step-by-step walkthrough for 50+ use-cases with pre-baked, interactive SQL queries- covering 5 critical categories- Product Analytics, Customer Analytics, Sales Analytics, Marketing Analytics and Finance Analytics.
Please check us out here! 👉🏻https://www.producthunt.com/posts/plug-play-data-templates

submitted by /u/AirbookIO
[link] [comments]

BuzzFeed News “Trending” Strip, 2018–2023

The file contains 3.1 million rows, each representing one article observed at one point in time.

The file uses these columns:

timestamp: The time (in UTC) of the fetch. All articles from the same fetch will have the same timestamp. position: The article’s zero-indexed position in the trending strip, from left to right. text: The text of the link used to highlight the article. Note: Sometimes the same article is associated with different text at different points in time. url: The link’s URL. Note: Sometimes (although relatively rarely) the URL for the same underlying article changes over time.

Note: Although the script generally ran every five minutes, there are some gaps in the data, accounting for roughly 3% of the total time period covered. These gaps owe to two main factors: technical complications (such as server downtime) and periods during which the website swapped out the trending strip with breaking news alerts, single-story highlights, or other notices. Unfortunately, I did not have the foresight to collect data that would distinguish between those scenarios.

submitted by /u/brianckeegan
[link] [comments]

Looking For Feedback On The New Standards, Data Sources And Methods Hub / Dites-nous Ce Que Vous En Pensez Du Carrefour Des Normes, Sources De Données Et Méthodes [self-promotion]

Statistics Canada added new features to enhance the overall data user experience on the Standards, Data Sources and Methods Hub. With its improved design, new frequently asked question section and quick access links to resources, the hub is meant to be a one-stop shop for data users, statisticians and others for:

variables and classifications survey methodology key aspects of data quality direct access to questionnaires.

Explore the hub and tell us what you think, so we can make sure this page meets your needs!

Visit the Standards, Data Sources and Methods Hub.

[We are Canada’s national statistical agency. We are here to engage with Canadians and provide them with high-quality statistical information that matters! Publishing in a subreddit does not imply we endorse the content posted by other redditors.]

***

Des améliorations ont été apportées au Carrefour des normes, sources de données et méthodes de Statistique Canada pour rendre l’expérience utilisateur plus conviviale. Avec sa conception améliorée, sa nouvelle section Foire aux questions et ses liens d’accès rapide aux ressources, ce carrefour se veut un guichet unique pour les utilisateurs de données, les statisticiens et autres, qui y trouveront tout ce dont ils ont besoin sur :

les variables et les classifications; la méthodologie d’enquête; la qualité des données; l’accès direct aux questionnaires.

Explorez le Carrefour et dites-nous ce que vous en pensez, nous voulons nous assurer qu’il répond à vos besoins!

Carrefour des normes, sources de données et méthodes.

[Nous sommes l’organisme national de statistique du Canada. Nous sommes ici pour discuter avec les Canadiens et les Canadiennes et leur fournir des renseignements statistiques de grande qualité qui comptent! Le fait de publier dans un sous-reddit ne signifie pas que nous approuvons le contenu affiché par d’autres utilisateurs de Reddit.]

submitted by /u/StatCanada
[link] [comments]

NBA March Madness Or Other Internal-to-company Sports-prediction Related Datasets

I’m wanting to answer a question about whether companies who run sports-guessing competitions make good predictions in aggregate, despite the fact that there will be many people in the organisation that don’t care about sports at all, and just pick at random.

What I’m looking for is the data from a company that ran a tipping competition for some sports competition where I can analyse the answers. (e.g. if it was the NBA, I can look that up; if it was your internal squash competition, that’s OK as long as it has who won, as well as the predictions of who would win.)

In return I’ll hopefully be able to confirm that there’s a way of maximising your total return on next year’s competition.

submitted by /u/solresol
[link] [comments]

Thesis Help Statistically Significant Data

Hey guys I need some help. I’m using statista and working on my thesis.

I have data that’s 1 and 0 (present and not present) and I’m trying to figure out if the data is statistically significant but there isn’t a normal distribution. I’m not sure what to do. Any help would be appreciated.

submitted by /u/Kapotter
[link] [comments]

What Are You Using Your Datasets Actually For?

Hi, my apologies for possibly asking the wrong questions here. I am a total newbie to all things machine learning, have just discovered kaggle and such and I’m a bit stuck with a silly question: I’ve discovered like a million different datasets on there, but I’m just wondering how people are putting these sets to good use. For instance there’s a big dataset about the Titanic. I can’t fathom a realistic use case where this dataset would prove to be useful. I guess I don’t understand just yet where the ‘machine learning’ aspect in datasets like these come into play. What is it exactly you are predicting with these?

Can somebody please enlighten me what I’m obviously missing here? I really want to know.

Thank you

submitted by /u/VHS124
[link] [comments]