Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Trying To Find An ‘official’ Premier League Dataset, Covering Disciplinary Events (fouls + Yellow/red Cards) By Matchday

I’m looking for a dataset covering disciplinary events (fouls, yellow cards and red cards) by fixture, stadium and matchday (date) in the Premier League, ideally from season 1992/93 (though I expect this to be super unrealistic) up to 2022/23 (though any dataset covering around ten years would be great). this data is out there online in bits and pieces, but is fan-collected and not from official sources (which would be required for semi-formal research). has anyone had any luck with getting this kind of dataset before, or have any suggestions as to who I could contact? I emailed the premier league a few months ago but haven’t received a replythanks!!

submitted by /u/oof-oofs
[link] [comments]

3000 Microwave Ovens From Popular E-commerce Sites

I received about 3000 listings when I played around with the NPM package [ecommerce-scraper-js](https://www.npmjs.com/package/ecommerce-scraper-js). Here’s the resulting dataset, if you’re interested. I tried to get 1000 microwave oven listings from each website. But there are not always so many products in practice. There are:

– 480 listings from Amazon;

– 1000 listings from eBay;

– 180 listings from Google Shopping;

– 299 listings from The Home Depot;

– 1000 listings from Walmart.

In total, I received 2959 microwave oven listings.

With this parser, you can get any listings (or selected listing info). Check the docs for more detail, it’s elementary, like:

“`javascript

import { config, amazon, walmart, ebay, homeDepot, googleShopping } from “ecommerce-scraper-js”;

config.API_KEY = “your_api_key_from_serpApi”;

amazon.getListings().then(console.log);

walmart.getListings().then(console.log);

ebay.getListings().then(console.log);

homeDepot.getListings().then(console.log);

googleShopping.getListings().then(console.log);

“`

You can load the dataset from [Kaggle](https://www.kaggle.com/datasets/mykhailozub/3000-microwave-ovens-from-popular-e-commerce-sites)

submitted by /u/Character_Equal_2732
[link] [comments]

[self-promotion] 7500 Hotels From Airbnb, Booking, And Hotels.com

I made a hotel parser on JS (hotels-scraper-js) and checked for usefulness. Here’s the resulting dataset, if you’re interested. For tests, I chose 5 European capitals: Berlin, London, Madrid, Paris, and Rome β€” 500 hotels from each site for each city. (In theory, there should be 500, but there are not always so many free rooms on the selected dates in practice so the results may be slightly less). You can get the hotel data you need with this parser. Check the docs for more detail, it’s very simple, like:

“`javascript import { airbnb, booking, hotelsCom } from “hotels-scraper-js”;

airbnb.getHotels(“YOUR_SEARCH_PARAMS”).then(console.log); booking.getHotels(“YOUR_SEARCH_PARAMS”).then(console.log); hotelsCom.getHotels(“YOUR_SEARCH_PARAMS”).then(console.log); “`

You can load the dataset from Kaggle

submitted by /u/Character_Equal_2732
[link] [comments]

3.1M BuzzFeed News β€œTrending” Headlines 2018–2023

BuzzFeed Inc. has shut down BuzzFeed News, but this data set captures editor selected headlines that were featured on their website (a script ran every five minutes for years capturing this data).

This file contains 3.1 million rows, each representing one headline article observed at one point in time: https://app.gigasheet.com/spreadsheet/BuzzFeed-News–Trending–Strip–2018-2023/b8c8cff0_3227_4585_b86b_976d0a7410da

Note: this file includes duplicates.
timestamp: The time (in UTC) of the fetch. All articles from the same fetch will have the same timestamp.
position: The article’s zero-indexed position in the trending strip, from left to right.
text: The text of the link used to highlight the article. Note: Sometimes the same article is associated with different text at different points in time.
url: The link’s URL. Note: Sometimes (although relatively rarely) the URL for the same underlying article changes over time.

Data source: https://github.com/jsvine/buzzfeed-news-trending-strip/

submitted by /u/n1nja5h03s
[link] [comments]

Looking For Crop Yield Dataset Suitable For PCA

I’m looking for datasets suitable for PCA examples – ideally something like crop yields as a function of soil nutrients.

Back in 1998-1999, I took an applied statistics course and the instructor demonstrated PCA through a dataset on crop yields. (No I don’t remember if it was wheat or corn or something more specific.)

The set had measured various soil nutrients across a field (potassium, calcium, sodium, phosphorous, nitrogen, etc.) and the idea was to perform PCA regression .. which if memory serves the first PC looked like pH (e.g., all the cations had positive coefficients, phosphorous and nitrogen had negative coefficients).

I’ve looked through a bunch of dataset archives with no luck. If anyone knows this source, or something similar, I’d be really grateful. Thanks in advance for any help.

submitted by /u/geoffh2016
[link] [comments]

What Are Some Good Publicly Available Real-time Data Sources?

This is a cross-post from r/dataengineering via recommendation from the comments there.

I am trying to curate and crowdsource a list of real-time datasets and sources into an “Awesome List” in a GitHub repo – https://github.com/bytewax/awesome-public-real-time-datasets. It is something I found difficult when building hobby projects or trying to learn about streaming data.

If you have any recommendations please share a link in the comments or open a PR in the repo :).

submitted by /u/math-bw
[link] [comments]

List Of Code Generation Datasets (open Source)

I’ve compiled a list of datasets that can be used to train LLMs to generate code from text. Let me know if there is any dataset that I’ve missed!

WikiSQL

A large crowd-sourced dataset for developing natural language interfaces for relational databases.

TheVault

The Vault dataset is a comprehensive, large-scale, multilingual parallel dataset that features high-quality code-text pairs derived from The Stack, the largest permissively-licensed source code dataset.

CodeContests

CodeContests is a competitive programming dataset for machine-learning. This dataset was used when training AlphaCode.

The Pile

The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.

CodeSearchNet

CodeSearchNet corpus is a dataset of 2 milllion (comment, code) pairs from opensource libraries hosted on GitHub. It contains code and documentation for several programming languages.

GitHub Code

The GitHub Code dataset consists of 115M code files from GitHub in 32 programming languages with 60 extensions totaling in 1TB of data.

MBPP

The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on.

CodeXGLUE

A collection of code intelligence tasks and a platform for model evaluation and comparison

submitted by /u/04RR
[link] [comments]

Dataset Of Every Electronic Device Of The Past 20 Years

Hi there,

I am looking for a dataset of every electronic device of the past 20 years, think of appliances like Radio, Televisions, Phones, Printers, Baby Cameras, and so on, the dataset needs to contain technical specifications like power outage/input etc and EAN numbers.

Anyone know how I can obtain such a dataset (and if it does not exist I can scrape it as long as I have a reliable source).

I would be able to share the dataset if someone gives me instructions how to get them.

Best regards

submitted by /u/mmnagra31
[link] [comments]

Where Can I Find Resources To Make A MIDI Dataset Of Guitar Tablatures?

I’m making a dataset of MIDI files of guitar accompaniment to various melodies. Ultimate Guitar is the only resource I have found for this so far, but it doesn’t actually provide MIDI files for its tabs; in the Pro plan, it provides tab sheets and those must be converted to MIDI with a transcription software. Provided there are no such existing datasets, where could I scrape for these MIDIs?

submitted by /u/Outrageous_Signal_48
[link] [comments]

289k Medium Articles At Your Fingertips! πŸš€

Hello everyone! πŸ‘‹

I’m thrilled to share an exciting update with all of you today. We’ve just completed a remarkable data project, and the result is nothing short of extraordinary. Introducing our colossal dataset of 289k Medium Articles! πŸŽ‰πŸ”₯

Dataset Overview:

This incredible collection is the culmination of our meticulous efforts, as we scoured 35 different publications, capturing the evolution of their articles from inception to 26 May 2023. Imagine the vast wealth of knowledge waiting to be explored!

What’s in the Dataset?

Contained within a convenient 1.7GB zip file, the dataset is organized into 35 folders, each corresponding to a specific Medium publication. Dive into these folders, and you’ll discover thousands of JSON files packed with article-related information, including titles, authors, word counts, reading times, claps, comments, publication details, and much more. It’s a data enthusiast’s dream come true! πŸ€“πŸ’‘

Unleashing the Power of Metadata:

But wait, there’s more! We’ve gone the extra mile to provide you with comprehensive metadata for each article. From the text itself to markups, embeds, links, and other contextual information, this dataset empowers you to delve into the nuances of content and unlock deeper insights. The possibilities are endless! πŸ“„πŸ”βœ¨

Fueling Research and Innovation:

Whether you’re a data scientist, a researcher, or an innovator, this dataset is a game-changer. It opens up new avenues for groundbreaking research in natural language processing, content analysis, user behavior patterns, and more. Let your curiosity run wild and see where this treasure trove of knowledge takes you! πŸš€πŸ”¬πŸ’₯

How to Get Access:

If you’re as excited as we are about this dataset, we’d love to share it with you. Simply reach out to us at [nishu@mediumapi.com](mailto:nishu@mediumapi.com), and our team will guide you through the process of obtaining this
invaluable resource. Let’s embark on a journey of discovery together! πŸ“§πŸ’»

Responsible Data Usage:

With great data comes great responsibility. We kindly request that all users utilize this dataset strictly for research purposes and in accordance with Medium’s terms and conditions. Let’s maintain ethical
data practices and respect the intellectual property rights of content creators. πŸ™πŸ”’

Join the Knowledge Revolution:

We believe that knowledge should be shared and accessible to all. This dataset represents a major step toward democratizing information and fostering innovation. Together, we can push the boundaries of what’s possible and create a brighter future. Join us on this thrilling
adventure! 🌍πŸ’ͺπŸ’‘

Let’s ignite a spark of discovery, unravel hidden insights, and propel the world of research and innovation forward. Reach out, grab your slice of this remarkable dataset, and embark on a journey that will redefine the limits of knowledge!

submitted by /u/medium-api
[link] [comments]

English Premier League First Half Vs Second Half Data By Match

Hi! Does anyone know where I could get detailed data on English Premier League soccer games that shows stats broken for the first and second half of each match?

I see datasets that has scores at half-time and full-time, but I’m after more detailed stats (possession, shots on target, etc.)

Mostly after recent data (2022-2023 season) but would be open to historic as well.

Would appreciate it if someone could point me in the right direction!

submitted by /u/questily
[link] [comments]

Looking For Time Of Birth Data Or Datasets

Hello everyone, I’m new to this site so I hope I’m posting in the right section.

I am looking for data regarding the time and date of birth of large amounts of people. I have tried to look on the HHS website and the Natality data they published but I couldn’t find any information regarding the time of birth.

Is there perhaps another way for me to find that, somewhere? Many thanks!

submitted by /u/cxvdxuxj
[link] [comments]

[self-promotion] Feedback Needed: Building Git For Data That Commits Only Diffs (for Storage Efficiency On Large Repositories), Even Without Full Checkouts Of The Datasets

I would really appreciate feedback on a version control for tabular datasets I am building, the Data Manager.

Main characteristics:

Like DVC and Git LFS, integrates with Git itself. Like DVC and Git LFS, can store large files on AWS S3 and link them in Git via an identifier. Unlike DVC and Git LFS, calculates and commits diffs only, at row, column, and cell level. For append scenarios, the commit will include new data only; for edits and deletes, a small diff is committed accordingly. With DVC and Git LFS, the entire dataset is committed again, instead: committing 1 MB of new data 1000 times to a 1 GB dataset yields more than 1 TB in DVC (a dataset that increases linearly in size between 1 GB and 2 GB, committed 1000 times, results in a repository of ~1.5 TB), whereas it sums to 2 GB (1 GB original dataset, plus 1000 times 1 MB changes) with the Data Manager. Unlike DVC and Git LFS, the diffs for each commit remain visible directly in Git. Unlike DVC and Git LFS, the Data Manager allows committing changes to datasets without full checkouts on localhost. You check out kilobytes and can append data to a dataset in a repository of hundreds of gigabytes. The changes on a no-full-checkout branch will need to be merged into another branch (on a machine that does operate with full checkouts, instead) to be validated, e.g., against adding a primary key that already exists. Since the repositories will contain diff histories, snapshots of the datasets at a certain commit have to be recreated to be deployable. These can be automatically uploaded to S3 and labeled after the commit hash, via the Data Manager.

Links:

https://news.ycombinator.com/item?id=35930895 https://news.ycombinator.com/item?id=35806843

This paradigm enables hibernating or cleaning up history on S3 for old datasets, if these are deleted in Git and snapshots of earlier commits are no longer needed. Individual data entries can also be removed for GDPR compliance using versioning on S3 objects, orthogonal to git.

I built the Data Manager for a pain point I was experiencing: it was impossible to (1) uniquely identify and (2) make available behind an API multiple versions of a collection of datasets and config parameters, (3) without overburdening HDDs due to small, but frequent changes to any of the datasets in the repo and (4) while being able to see the diffs in git for each commit in order to enable collaborative discussions and reverting or further editing if necessary.

Some background: I am building natural language AI algorithms (a) easily retrainable on editable training datasets, meaning changes or deletions in the training data are reflected fast, without traces of past training and without retraining the entire language model (sounds impossible), and (b) that explain decisions back to individual training data.

I look forward to constructive feedback and suggestions!

submitted by /u/Usual-Maize1175
[link] [comments]

Excel Sheet Data Processing Help – Helicopter FOIA Separating Data Excel Sheet

Data Processing

Hey, I need help processing data. My friend offered me a helicopter ride (met through someone) in a certain city in the US in January of 2022 … Lost contact of person who connected me the helicopter dude never gave me his name 😭 (he has an electrical engineering lisence from I’m assuming Florida… He owns a house in this city in south Florida)

Fast forward. I requested a FOIA (freedom of information act) of all helicopters in that city January 2022 … Less that 20 total. Easy. What happens.

My FOIA came in ANDDD according to the FOIA letter they couldn’t separate the rotorcrafts (helicopters) from the fixed wing (small planes 😭) from January 2022.

January 2022 was a VERY BUSY month for planes… It’s going to be an insane amount of data.

(But it’s probably over 10 pages with 50 aircraft info per page on 11 pt. Font)

How do I sort the helicopters out of the data. It was like 15 helis maximum.

HOWEVER…

You can also download all the persons with pilot licenses. This guy has a pilot license (owns and flies his helicopter)… On a sheet that identifies type of lisence

As like p/h for helicopter…. How do you sort all the helicopter owners from this sheet?

It’s as an excel sheet.

Please advise!!

submitted by /u/Soggy-Nectarine-3578
[link] [comments]

Where Can I Download Cairo Dataset ?

Cairo – Cairo University’s dataset consists of a total of 610 questions which are 10 answers for 61 questions. These are collected from only one chapter of the official Egyptian curriculum for the Environmental Science course. The average length of a student’s answer is 2.2 sentences, 20 words, or 103 characters. The dataset contains a collection of students’ responses along with their grades that vary between 0 and 5 according to an assessment of two human evaluators. An English version of the Cairo University data set is also available to research this area. This dataset can be downloaded from the webpage

The link refers to http://www.aucegypt.edu/src/datasets.htm, but unfortunately the link is dead. And I can’t find any other link.

Basically I need dataset of questions, correct answer, student’s answers, and their grade (graded by human). I want to compare my method of automatic answer grading of short answer. So, if you know any other familiar dataset, please let me know.

Thank you.

submitted by /u/yokowasis2
[link] [comments]

Value Of 2.8 Million African Student ID Pictures

Being a datahoarder I stumbled on a way to harvest student ID pictures from an exam authority in sub-saharan africa. No illegal hacking involved, just exploiting a predictable URL format.

Have now gathered 2.8 million of them, about 90gb, spanning about a decade of student exams. Typical ID format, face & shoulders only, often quite small (20-50kb), no metadata besides year, exam type & region.

Is there any monetary value to this? Any open source projects that need such data?

submitted by /u/Joonicks
[link] [comments]