Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Any Way To Search For Similar Datasets?

I don’t know if anything exists like this but I have a data set that shows percent change year over year since 2016 in a specific industry. I took the USA GDP percent change over the same time period (just for curiosity) and there was somewhat of a correlation between the two datasets. My question is, are there any tools that search public data sets for similar percentage changes? I understand there is a high percentage of the correlation being “coincidence”, but does anything like this exist?

submitted by /u/lil_cheeks
[link] [comments]

Found Some More Simple Csv & Excel Datasets For Business Use Cases

I have been looking for simple public datasets in csv & excel format which does not require me to be a data analyst to draw some charts and understand my data. Usually many of the public ones are super complicated plus usually about public topics on health, economy etc. Found some interesting ones on Kaggle already, but just discovered Maven openly available datasets. Here it is https://mavenanalytics.io/data-playground (I have not affiliation to them btw)
Also I find interesting data on Statista, but all is behind an expensive paywall. If you know of any similar but free or cheaper alternative, pls share. thank you 🙂

submitted by /u/andidia82
[link] [comments]

Mental Health Misinformation Dataset

I’ve been searching for datasets regarding misinformation in mental health but I can’t find any. I understand that for ethical reasons they might not be public but even ones that I can fill out a form/ask an author for access to aren’t showing in my search results. I tried asking Chat GPT and Bard which both listed three datasets but when I simply search for those names nothing about them appears and LLMs don’t provide links to their “knowledge”.

Would be very happy if I got any leads.

submitted by /u/homebutnothome
[link] [comments]

Requesting An Images Dataset With Annotated Human Actions To Train Visual Description Model For Accessibility App

Hi everyone, I need help finding a dataset of images annotated with human actions [such as sitting+in-chair, working+on-laptop, etc.]. I found a model capable of generating such tags on Huggingface here, however I was unable to locate its source dataset.

Just for context, I am trying to create a fine-tuned ViT model, that incorporates as broad a set of visual tags as possible. My plan is to optimize this model for edge devices [using Quantization aware training + TFLite model conversion] and open-source the weights. Eventually, I am hoping this can be used for a broad range of visual search/tagging/QnA tasks. Currently, I am training the model on top 2500 Danbooru tags + MIT SUN indoor location tags.

An online demo of the model can be found here. If anyone has any suggestions regarding what other dataset/tags to add, or would like to help with the training efforts, please drop a line. I would really appreciate it.

[Disclosures: I am not affiliated in any way with any of the HuggingFace /Arxiv/Mit.edu links I posted here. The link to the online-demo is maintained by me, but there are no ads or anything else that procures me financial gain on it.]

submitted by /u/DisintegratingBo
[link] [comments]

Where Can I Find Data About Scientific Papers?

Hi,
I am searching for a database of scientific papers, the bigger the better. Like the web of science papers, but I d like to have all the information of the paper; like the abstract text and the introduction (all the paper would be perfect). Information about the authors, academic affiliation, and sector where it has been published

submitted by /u/riegel_d
[link] [comments]

Trying To Find An ‘official’ Premier League Dataset, Covering Disciplinary Events (fouls + Yellow/red Cards) By Matchday

I’m looking for a dataset covering disciplinary events (fouls, yellow cards and red cards) by fixture, stadium and matchday (date) in the Premier League, ideally from season 1992/93 (though I expect this to be super unrealistic) up to 2022/23 (though any dataset covering around ten years would be great). this data is out there online in bits and pieces, but is fan-collected and not from official sources (which would be required for semi-formal research). has anyone had any luck with getting this kind of dataset before, or have any suggestions as to who I could contact? I emailed the premier league a few months ago but haven’t received a replythanks!!

submitted by /u/oof-oofs
[link] [comments]

3000 Microwave Ovens From Popular E-commerce Sites

I received about 3000 listings when I played around with the NPM package [ecommerce-scraper-js](https://www.npmjs.com/package/ecommerce-scraper-js). Here’s the resulting dataset, if you’re interested. I tried to get 1000 microwave oven listings from each website. But there are not always so many products in practice. There are:

– 480 listings from Amazon;

– 1000 listings from eBay;

– 180 listings from Google Shopping;

– 299 listings from The Home Depot;

– 1000 listings from Walmart.

In total, I received 2959 microwave oven listings.

With this parser, you can get any listings (or selected listing info). Check the docs for more detail, it’s elementary, like:

“`javascript

import { config, amazon, walmart, ebay, homeDepot, googleShopping } from “ecommerce-scraper-js”;

config.API_KEY = “your_api_key_from_serpApi”;

amazon.getListings().then(console.log);

walmart.getListings().then(console.log);

ebay.getListings().then(console.log);

homeDepot.getListings().then(console.log);

googleShopping.getListings().then(console.log);

“`

You can load the dataset from [Kaggle](https://www.kaggle.com/datasets/mykhailozub/3000-microwave-ovens-from-popular-e-commerce-sites)

submitted by /u/Character_Equal_2732
[link] [comments]

[self-promotion] 7500 Hotels From Airbnb, Booking, And Hotels.com

I made a hotel parser on JS (hotels-scraper-js) and checked for usefulness. Here’s the resulting dataset, if you’re interested. For tests, I chose 5 European capitals: Berlin, London, Madrid, Paris, and Rome — 500 hotels from each site for each city. (In theory, there should be 500, but there are not always so many free rooms on the selected dates in practice so the results may be slightly less). You can get the hotel data you need with this parser. Check the docs for more detail, it’s very simple, like:

“`javascript import { airbnb, booking, hotelsCom } from “hotels-scraper-js”;

airbnb.getHotels(“YOUR_SEARCH_PARAMS”).then(console.log); booking.getHotels(“YOUR_SEARCH_PARAMS”).then(console.log); hotelsCom.getHotels(“YOUR_SEARCH_PARAMS”).then(console.log); “`

You can load the dataset from Kaggle

submitted by /u/Character_Equal_2732
[link] [comments]

3.1M BuzzFeed News “Trending” Headlines 2018–2023

BuzzFeed Inc. has shut down BuzzFeed News, but this data set captures editor selected headlines that were featured on their website (a script ran every five minutes for years capturing this data).

This file contains 3.1 million rows, each representing one headline article observed at one point in time: https://app.gigasheet.com/spreadsheet/BuzzFeed-News–Trending–Strip–2018-2023/b8c8cff0_3227_4585_b86b_976d0a7410da

Note: this file includes duplicates.
timestamp: The time (in UTC) of the fetch. All articles from the same fetch will have the same timestamp.
position: The article’s zero-indexed position in the trending strip, from left to right.
text: The text of the link used to highlight the article. Note: Sometimes the same article is associated with different text at different points in time.
url: The link’s URL. Note: Sometimes (although relatively rarely) the URL for the same underlying article changes over time.

Data source: https://github.com/jsvine/buzzfeed-news-trending-strip/

submitted by /u/n1nja5h03s
[link] [comments]

Looking For Crop Yield Dataset Suitable For PCA

I’m looking for datasets suitable for PCA examples – ideally something like crop yields as a function of soil nutrients.

Back in 1998-1999, I took an applied statistics course and the instructor demonstrated PCA through a dataset on crop yields. (No I don’t remember if it was wheat or corn or something more specific.)

The set had measured various soil nutrients across a field (potassium, calcium, sodium, phosphorous, nitrogen, etc.) and the idea was to perform PCA regression .. which if memory serves the first PC looked like pH (e.g., all the cations had positive coefficients, phosphorous and nitrogen had negative coefficients).

I’ve looked through a bunch of dataset archives with no luck. If anyone knows this source, or something similar, I’d be really grateful. Thanks in advance for any help.

submitted by /u/geoffh2016
[link] [comments]

What Are Some Good Publicly Available Real-time Data Sources?

This is a cross-post from r/dataengineering via recommendation from the comments there.

I am trying to curate and crowdsource a list of real-time datasets and sources into an “Awesome List” in a GitHub repo – https://github.com/bytewax/awesome-public-real-time-datasets. It is something I found difficult when building hobby projects or trying to learn about streaming data.

If you have any recommendations please share a link in the comments or open a PR in the repo :).

submitted by /u/math-bw
[link] [comments]

List Of Code Generation Datasets (open Source)

I’ve compiled a list of datasets that can be used to train LLMs to generate code from text. Let me know if there is any dataset that I’ve missed!

WikiSQL

A large crowd-sourced dataset for developing natural language interfaces for relational databases.

TheVault

The Vault dataset is a comprehensive, large-scale, multilingual parallel dataset that features high-quality code-text pairs derived from The Stack, the largest permissively-licensed source code dataset.

CodeContests

CodeContests is a competitive programming dataset for machine-learning. This dataset was used when training AlphaCode.

The Pile

The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.

CodeSearchNet

CodeSearchNet corpus is a dataset of 2 milllion (comment, code) pairs from opensource libraries hosted on GitHub. It contains code and documentation for several programming languages.

GitHub Code

The GitHub Code dataset consists of 115M code files from GitHub in 32 programming languages with 60 extensions totaling in 1TB of data.

MBPP

The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on.

CodeXGLUE

A collection of code intelligence tasks and a platform for model evaluation and comparison

submitted by /u/04RR
[link] [comments]

Dataset Of Every Electronic Device Of The Past 20 Years

Hi there,

I am looking for a dataset of every electronic device of the past 20 years, think of appliances like Radio, Televisions, Phones, Printers, Baby Cameras, and so on, the dataset needs to contain technical specifications like power outage/input etc and EAN numbers.

Anyone know how I can obtain such a dataset (and if it does not exist I can scrape it as long as I have a reliable source).

I would be able to share the dataset if someone gives me instructions how to get them.

Best regards

submitted by /u/mmnagra31
[link] [comments]

Where Can I Find Resources To Make A MIDI Dataset Of Guitar Tablatures?

I’m making a dataset of MIDI files of guitar accompaniment to various melodies. Ultimate Guitar is the only resource I have found for this so far, but it doesn’t actually provide MIDI files for its tabs; in the Pro plan, it provides tab sheets and those must be converted to MIDI with a transcription software. Provided there are no such existing datasets, where could I scrape for these MIDIs?

submitted by /u/Outrageous_Signal_48
[link] [comments]

289k Medium Articles At Your Fingertips! 🚀

Hello everyone! 👋

I’m thrilled to share an exciting update with all of you today. We’ve just completed a remarkable data project, and the result is nothing short of extraordinary. Introducing our colossal dataset of 289k Medium Articles! 🎉🔥

Dataset Overview:

This incredible collection is the culmination of our meticulous efforts, as we scoured 35 different publications, capturing the evolution of their articles from inception to 26 May 2023. Imagine the vast wealth of knowledge waiting to be explored!

What’s in the Dataset?

Contained within a convenient 1.7GB zip file, the dataset is organized into 35 folders, each corresponding to a specific Medium publication. Dive into these folders, and you’ll discover thousands of JSON files packed with article-related information, including titles, authors, word counts, reading times, claps, comments, publication details, and much more. It’s a data enthusiast’s dream come true! 🤓💡

Unleashing the Power of Metadata:

But wait, there’s more! We’ve gone the extra mile to provide you with comprehensive metadata for each article. From the text itself to markups, embeds, links, and other contextual information, this dataset empowers you to delve into the nuances of content and unlock deeper insights. The possibilities are endless! 📄🔍✨

Fueling Research and Innovation:

Whether you’re a data scientist, a researcher, or an innovator, this dataset is a game-changer. It opens up new avenues for groundbreaking research in natural language processing, content analysis, user behavior patterns, and more. Let your curiosity run wild and see where this treasure trove of knowledge takes you! 🚀🔬💥

How to Get Access:

If you’re as excited as we are about this dataset, we’d love to share it with you. Simply reach out to us at [nishu@mediumapi.com](mailto:nishu@mediumapi.com), and our team will guide you through the process of obtaining this
invaluable resource. Let’s embark on a journey of discovery together! 📧💻

Responsible Data Usage:

With great data comes great responsibility. We kindly request that all users utilize this dataset strictly for research purposes and in accordance with Medium’s terms and conditions. Let’s maintain ethical
data practices and respect the intellectual property rights of content creators. 🙏🔒

Join the Knowledge Revolution:

We believe that knowledge should be shared and accessible to all. This dataset represents a major step toward democratizing information and fostering innovation. Together, we can push the boundaries of what’s possible and create a brighter future. Join us on this thrilling
adventure! 🌍💪💡

Let’s ignite a spark of discovery, unravel hidden insights, and propel the world of research and innovation forward. Reach out, grab your slice of this remarkable dataset, and embark on a journey that will redefine the limits of knowledge!

submitted by /u/medium-api
[link] [comments]