Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

3.1M BuzzFeed News “Trending” Headlines 2018–2023

BuzzFeed Inc. has shut down BuzzFeed News, but this data set captures editor selected headlines that were featured on their website (a script ran every five minutes for years capturing this data).

This file contains 3.1 million rows, each representing one headline article observed at one point in time: https://app.gigasheet.com/spreadsheet/BuzzFeed-News–Trending–Strip–2018-2023/b8c8cff0_3227_4585_b86b_976d0a7410da

Note: this file includes duplicates.
timestamp: The time (in UTC) of the fetch. All articles from the same fetch will have the same timestamp.
position: The article’s zero-indexed position in the trending strip, from left to right.
text: The text of the link used to highlight the article. Note: Sometimes the same article is associated with different text at different points in time.
url: The link’s URL. Note: Sometimes (although relatively rarely) the URL for the same underlying article changes over time.

Data source: https://github.com/jsvine/buzzfeed-news-trending-strip/

submitted by /u/n1nja5h03s
[link] [comments]

Looking For Crop Yield Dataset Suitable For PCA

I’m looking for datasets suitable for PCA examples – ideally something like crop yields as a function of soil nutrients.

Back in 1998-1999, I took an applied statistics course and the instructor demonstrated PCA through a dataset on crop yields. (No I don’t remember if it was wheat or corn or something more specific.)

The set had measured various soil nutrients across a field (potassium, calcium, sodium, phosphorous, nitrogen, etc.) and the idea was to perform PCA regression .. which if memory serves the first PC looked like pH (e.g., all the cations had positive coefficients, phosphorous and nitrogen had negative coefficients).

I’ve looked through a bunch of dataset archives with no luck. If anyone knows this source, or something similar, I’d be really grateful. Thanks in advance for any help.

submitted by /u/geoffh2016
[link] [comments]

What Are Some Good Publicly Available Real-time Data Sources?

This is a cross-post from r/dataengineering via recommendation from the comments there.

I am trying to curate and crowdsource a list of real-time datasets and sources into an “Awesome List” in a GitHub repo – https://github.com/bytewax/awesome-public-real-time-datasets. It is something I found difficult when building hobby projects or trying to learn about streaming data.

If you have any recommendations please share a link in the comments or open a PR in the repo :).

submitted by /u/math-bw
[link] [comments]

List Of Code Generation Datasets (open Source)

I’ve compiled a list of datasets that can be used to train LLMs to generate code from text. Let me know if there is any dataset that I’ve missed!

WikiSQL

A large crowd-sourced dataset for developing natural language interfaces for relational databases.

TheVault

The Vault dataset is a comprehensive, large-scale, multilingual parallel dataset that features high-quality code-text pairs derived from The Stack, the largest permissively-licensed source code dataset.

CodeContests

CodeContests is a competitive programming dataset for machine-learning. This dataset was used when training AlphaCode.

The Pile

The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.

CodeSearchNet

CodeSearchNet corpus is a dataset of 2 milllion (comment, code) pairs from opensource libraries hosted on GitHub. It contains code and documentation for several programming languages.

GitHub Code

The GitHub Code dataset consists of 115M code files from GitHub in 32 programming languages with 60 extensions totaling in 1TB of data.

MBPP

The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on.

CodeXGLUE

A collection of code intelligence tasks and a platform for model evaluation and comparison

submitted by /u/04RR
[link] [comments]

Dataset Of Every Electronic Device Of The Past 20 Years

Hi there,

I am looking for a dataset of every electronic device of the past 20 years, think of appliances like Radio, Televisions, Phones, Printers, Baby Cameras, and so on, the dataset needs to contain technical specifications like power outage/input etc and EAN numbers.

Anyone know how I can obtain such a dataset (and if it does not exist I can scrape it as long as I have a reliable source).

I would be able to share the dataset if someone gives me instructions how to get them.

Best regards

submitted by /u/mmnagra31
[link] [comments]

Where Can I Find Resources To Make A MIDI Dataset Of Guitar Tablatures?

I’m making a dataset of MIDI files of guitar accompaniment to various melodies. Ultimate Guitar is the only resource I have found for this so far, but it doesn’t actually provide MIDI files for its tabs; in the Pro plan, it provides tab sheets and those must be converted to MIDI with a transcription software. Provided there are no such existing datasets, where could I scrape for these MIDIs?

submitted by /u/Outrageous_Signal_48
[link] [comments]

289k Medium Articles At Your Fingertips! 🚀

Hello everyone! 👋

I’m thrilled to share an exciting update with all of you today. We’ve just completed a remarkable data project, and the result is nothing short of extraordinary. Introducing our colossal dataset of 289k Medium Articles! 🎉🔥

Dataset Overview:

This incredible collection is the culmination of our meticulous efforts, as we scoured 35 different publications, capturing the evolution of their articles from inception to 26 May 2023. Imagine the vast wealth of knowledge waiting to be explored!

What’s in the Dataset?

Contained within a convenient 1.7GB zip file, the dataset is organized into 35 folders, each corresponding to a specific Medium publication. Dive into these folders, and you’ll discover thousands of JSON files packed with article-related information, including titles, authors, word counts, reading times, claps, comments, publication details, and much more. It’s a data enthusiast’s dream come true! 🤓💡

Unleashing the Power of Metadata:

But wait, there’s more! We’ve gone the extra mile to provide you with comprehensive metadata for each article. From the text itself to markups, embeds, links, and other contextual information, this dataset empowers you to delve into the nuances of content and unlock deeper insights. The possibilities are endless! 📄🔍✨

Fueling Research and Innovation:

Whether you’re a data scientist, a researcher, or an innovator, this dataset is a game-changer. It opens up new avenues for groundbreaking research in natural language processing, content analysis, user behavior patterns, and more. Let your curiosity run wild and see where this treasure trove of knowledge takes you! 🚀🔬💥

How to Get Access:

If you’re as excited as we are about this dataset, we’d love to share it with you. Simply reach out to us at [nishu@mediumapi.com](mailto:nishu@mediumapi.com), and our team will guide you through the process of obtaining this
invaluable resource. Let’s embark on a journey of discovery together! 📧💻

Responsible Data Usage:

With great data comes great responsibility. We kindly request that all users utilize this dataset strictly for research purposes and in accordance with Medium’s terms and conditions. Let’s maintain ethical
data practices and respect the intellectual property rights of content creators. 🙏🔒

Join the Knowledge Revolution:

We believe that knowledge should be shared and accessible to all. This dataset represents a major step toward democratizing information and fostering innovation. Together, we can push the boundaries of what’s possible and create a brighter future. Join us on this thrilling
adventure! 🌍💪💡

Let’s ignite a spark of discovery, unravel hidden insights, and propel the world of research and innovation forward. Reach out, grab your slice of this remarkable dataset, and embark on a journey that will redefine the limits of knowledge!

submitted by /u/medium-api
[link] [comments]

English Premier League First Half Vs Second Half Data By Match

Hi! Does anyone know where I could get detailed data on English Premier League soccer games that shows stats broken for the first and second half of each match?

I see datasets that has scores at half-time and full-time, but I’m after more detailed stats (possession, shots on target, etc.)

Mostly after recent data (2022-2023 season) but would be open to historic as well.

Would appreciate it if someone could point me in the right direction!

submitted by /u/questily
[link] [comments]

Looking For Time Of Birth Data Or Datasets

Hello everyone, I’m new to this site so I hope I’m posting in the right section.

I am looking for data regarding the time and date of birth of large amounts of people. I have tried to look on the HHS website and the Natality data they published but I couldn’t find any information regarding the time of birth.

Is there perhaps another way for me to find that, somewhere? Many thanks!

submitted by /u/cxvdxuxj
[link] [comments]

[self-promotion] Feedback Needed: Building Git For Data That Commits Only Diffs (for Storage Efficiency On Large Repositories), Even Without Full Checkouts Of The Datasets

I would really appreciate feedback on a version control for tabular datasets I am building, the Data Manager.

Main characteristics:

Like DVC and Git LFS, integrates with Git itself. Like DVC and Git LFS, can store large files on AWS S3 and link them in Git via an identifier. Unlike DVC and Git LFS, calculates and commits diffs only, at row, column, and cell level. For append scenarios, the commit will include new data only; for edits and deletes, a small diff is committed accordingly. With DVC and Git LFS, the entire dataset is committed again, instead: committing 1 MB of new data 1000 times to a 1 GB dataset yields more than 1 TB in DVC (a dataset that increases linearly in size between 1 GB and 2 GB, committed 1000 times, results in a repository of ~1.5 TB), whereas it sums to 2 GB (1 GB original dataset, plus 1000 times 1 MB changes) with the Data Manager. Unlike DVC and Git LFS, the diffs for each commit remain visible directly in Git. Unlike DVC and Git LFS, the Data Manager allows committing changes to datasets without full checkouts on localhost. You check out kilobytes and can append data to a dataset in a repository of hundreds of gigabytes. The changes on a no-full-checkout branch will need to be merged into another branch (on a machine that does operate with full checkouts, instead) to be validated, e.g., against adding a primary key that already exists. Since the repositories will contain diff histories, snapshots of the datasets at a certain commit have to be recreated to be deployable. These can be automatically uploaded to S3 and labeled after the commit hash, via the Data Manager.

Links:

https://news.ycombinator.com/item?id=35930895 https://news.ycombinator.com/item?id=35806843

This paradigm enables hibernating or cleaning up history on S3 for old datasets, if these are deleted in Git and snapshots of earlier commits are no longer needed. Individual data entries can also be removed for GDPR compliance using versioning on S3 objects, orthogonal to git.

I built the Data Manager for a pain point I was experiencing: it was impossible to (1) uniquely identify and (2) make available behind an API multiple versions of a collection of datasets and config parameters, (3) without overburdening HDDs due to small, but frequent changes to any of the datasets in the repo and (4) while being able to see the diffs in git for each commit in order to enable collaborative discussions and reverting or further editing if necessary.

Some background: I am building natural language AI algorithms (a) easily retrainable on editable training datasets, meaning changes or deletions in the training data are reflected fast, without traces of past training and without retraining the entire language model (sounds impossible), and (b) that explain decisions back to individual training data.

I look forward to constructive feedback and suggestions!

submitted by /u/Usual-Maize1175
[link] [comments]

Excel Sheet Data Processing Help – Helicopter FOIA Separating Data Excel Sheet

Data Processing

Hey, I need help processing data. My friend offered me a helicopter ride (met through someone) in a certain city in the US in January of 2022 … Lost contact of person who connected me the helicopter dude never gave me his name 😭 (he has an electrical engineering lisence from I’m assuming Florida… He owns a house in this city in south Florida)

Fast forward. I requested a FOIA (freedom of information act) of all helicopters in that city January 2022 … Less that 20 total. Easy. What happens.

My FOIA came in ANDDD according to the FOIA letter they couldn’t separate the rotorcrafts (helicopters) from the fixed wing (small planes 😭) from January 2022.

January 2022 was a VERY BUSY month for planes… It’s going to be an insane amount of data.

(But it’s probably over 10 pages with 50 aircraft info per page on 11 pt. Font)

How do I sort the helicopters out of the data. It was like 15 helis maximum.

HOWEVER…

You can also download all the persons with pilot licenses. This guy has a pilot license (owns and flies his helicopter)… On a sheet that identifies type of lisence

As like p/h for helicopter…. How do you sort all the helicopter owners from this sheet?

It’s as an excel sheet.

Please advise!!

submitted by /u/Soggy-Nectarine-3578
[link] [comments]

Where Can I Download Cairo Dataset ?

Cairo – Cairo University’s dataset consists of a total of 610 questions which are 10 answers for 61 questions. These are collected from only one chapter of the official Egyptian curriculum for the Environmental Science course. The average length of a student’s answer is 2.2 sentences, 20 words, or 103 characters. The dataset contains a collection of students’ responses along with their grades that vary between 0 and 5 according to an assessment of two human evaluators. An English version of the Cairo University data set is also available to research this area. This dataset can be downloaded from the webpage

The link refers to http://www.aucegypt.edu/src/datasets.htm, but unfortunately the link is dead. And I can’t find any other link.

Basically I need dataset of questions, correct answer, student’s answers, and their grade (graded by human). I want to compare my method of automatic answer grading of short answer. So, if you know any other familiar dataset, please let me know.

Thank you.

submitted by /u/yokowasis2
[link] [comments]

Value Of 2.8 Million African Student ID Pictures

Being a datahoarder I stumbled on a way to harvest student ID pictures from an exam authority in sub-saharan africa. No illegal hacking involved, just exploiting a predictable URL format.

Have now gathered 2.8 million of them, about 90gb, spanning about a decade of student exams. Typical ID format, face & shoulders only, often quite small (20-50kb), no metadata besides year, exam type & region.

Is there any monetary value to this? Any open source projects that need such data?

submitted by /u/Joonicks
[link] [comments]

London Stock Exchange Daily Prices Wanted

I am looking for some historic stock exchange prices to analyse. I notice a few sites seem to have them for sale, but does anyone know of any open source or community-created ones? I’d prefer the LSE, but any stock exchange would do for first look.

I would like a dataset of about 10 years worth of daily prices, for 100 or more stocks. The smaller sets I have seen tend to have values for opening, closing, low, high and volume.

I want to try some trading strategies on historic data.

submitted by /u/brainburger
[link] [comments]

Fed Funds Rate (FFR) Futures Historical Data?

I have a nice little ipynb doing data analysis on fed futures rates. However, the available data is rolling, and i haven’t been logging results to a DB to save them for myself.

Is there a way i can access all historical FFR data?

For reference, this is what i’m using: https://www.cmegroup.com/markets/interest-rates/cme-fedwatch-tool.html?redirect=/trading/interest-rates/countdown-to-fomc.html

Edit: I’m using automation to scrape all the files from the “download” link/tab on the left

submitted by /u/throwawayrandomvowel
[link] [comments]

Dataset Of Examples Of Logical Fallacies?

I’m working on a project that is going to require a dataset of logical fallacies (and their classification). This has been quite a tricky task so far and so far have come across just one linked to the paper “Logical Fallacy Detection” by Z Jin (2022). So if anyone is aware of any other examples or possible websites to scrape that would help, thanks!

submitted by /u/CrossingPearl
[link] [comments]

Trying To Create A Spam Voicemail Dataset

Hey guys, I am working on a project to help predict if a voicemail is spam! I am building the dataset, and I have around 300 voicemails, almost half are spam and the others are not. I want to create a dataset of at least 500-1000 voicemails.

So I am requesting that anyone share their spam voicemails and/or normal voicemails (which can be non-personal). It can be in any audio format and shared however you are comfortable with!

submitted by /u/thebatgamer
[link] [comments]