Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

London Stock Exchange Daily Prices Wanted

I am looking for some historic stock exchange prices to analyse. I notice a few sites seem to have them for sale, but does anyone know of any open source or community-created ones? I’d prefer the LSE, but any stock exchange would do for first look.

I would like a dataset of about 10 years worth of daily prices, for 100 or more stocks. The smaller sets I have seen tend to have values for opening, closing, low, high and volume.

I want to try some trading strategies on historic data.

submitted by /u/brainburger
[link] [comments]

Fed Funds Rate (FFR) Futures Historical Data?

I have a nice little ipynb doing data analysis on fed futures rates. However, the available data is rolling, and i haven’t been logging results to a DB to save them for myself.

Is there a way i can access all historical FFR data?

For reference, this is what i’m using: https://www.cmegroup.com/markets/interest-rates/cme-fedwatch-tool.html?redirect=/trading/interest-rates/countdown-to-fomc.html

Edit: I’m using automation to scrape all the files from the “download” link/tab on the left

submitted by /u/throwawayrandomvowel
[link] [comments]

Dataset Of Examples Of Logical Fallacies?

I’m working on a project that is going to require a dataset of logical fallacies (and their classification). This has been quite a tricky task so far and so far have come across just one linked to the paper “Logical Fallacy Detection” by Z Jin (2022). So if anyone is aware of any other examples or possible websites to scrape that would help, thanks!

submitted by /u/CrossingPearl
[link] [comments]

Trying To Create A Spam Voicemail Dataset

Hey guys, I am working on a project to help predict if a voicemail is spam! I am building the dataset, and I have around 300 voicemails, almost half are spam and the others are not. I want to create a dataset of at least 500-1000 voicemails.

So I am requesting that anyone share their spam voicemails and/or normal voicemails (which can be non-personal). It can be in any audio format and shared however you are comfortable with!

submitted by /u/thebatgamer
[link] [comments]

English Words “familiarity” Dataset.

There are plenty of Word Frequency lists but plurals, adjectives, adverbs of the same word end up in different positions in these lists.

I’m looking for a dataset or a way to create a dataset that has all forms or one word clumped together so it’s less about frequency and more about how familiar the word (and its different forms) is if that makes sense.

For instance, i have a list whete the word “have” is at 25th place, “has” at 39 and “had” at 105. Clearly, anyone who knows one of these words would know the other two as well.

Apologies if I did not get my point across clearly. Any help is appreciated. Thanks!

submitted by /u/haskpro1995
[link] [comments]

Stanford Cars (cars196) Contains Many Fine-Grained Errors

Hey Redditors,

I know the cars196 dataset is nothing new, but I wanted to share some label errors and outliers that I found within it.

It’s interesting to note that the primary goal of the original paper that curated/used this dataset was “fine-grained categorization” meaning discerning the differences between something like a Chevrolet Cargo Van and a GMC Cargo Van. I found numerous examples of images that exhibit very nuanced mislabelling which is directly counterintuitive to the task they sought to research.

Here are a few examples of nuanced label errors that I found:

Audi TT RS Coupe labeled as an Audi TT Hatchback Audi S5 Convertible labeled as an Audi RS4 Jeep Grand Cherokee labeled as a Dodge Durango

I also found examples of outliers and generally ambiguous images:

multiple cars in one image top-down style images vehicles that didn’t belong to any classes.

I found these issues to be pretty interesting, yet I wasn’t surprised. It’s pretty well known that many common ML datasets exhibit thousands of errors.

If you’re interested in how I found them, feel free to read about it here.

submitted by /u/cmauck10
[link] [comments]

Which Templates Do You Like The Most?

This is for all data enthusiasts out there!
We’ve launched Plug & Play Data Templates on Product Hunt today! 🥳
Our data templates are a step-by-step walkthrough for 50+ use-cases with pre-baked, interactive SQL queries- covering 5 critical categories- Product Analytics, Customer Analytics, Sales Analytics, Marketing Analytics and Finance Analytics.
Please check us out here! 👉🏻https://www.producthunt.com/posts/plug-play-data-templates

submitted by /u/AirbookIO
[link] [comments]

BuzzFeed News “Trending” Strip, 2018–2023

The file contains 3.1 million rows, each representing one article observed at one point in time.

The file uses these columns:

timestamp: The time (in UTC) of the fetch. All articles from the same fetch will have the same timestamp. position: The article’s zero-indexed position in the trending strip, from left to right. text: The text of the link used to highlight the article. Note: Sometimes the same article is associated with different text at different points in time. url: The link’s URL. Note: Sometimes (although relatively rarely) the URL for the same underlying article changes over time.

Note: Although the script generally ran every five minutes, there are some gaps in the data, accounting for roughly 3% of the total time period covered. These gaps owe to two main factors: technical complications (such as server downtime) and periods during which the website swapped out the trending strip with breaking news alerts, single-story highlights, or other notices. Unfortunately, I did not have the foresight to collect data that would distinguish between those scenarios.

submitted by /u/brianckeegan
[link] [comments]

Looking For Feedback On The New Standards, Data Sources And Methods Hub / Dites-nous Ce Que Vous En Pensez Du Carrefour Des Normes, Sources De Données Et Méthodes [self-promotion]

Statistics Canada added new features to enhance the overall data user experience on the Standards, Data Sources and Methods Hub. With its improved design, new frequently asked question section and quick access links to resources, the hub is meant to be a one-stop shop for data users, statisticians and others for:

variables and classifications survey methodology key aspects of data quality direct access to questionnaires.

Explore the hub and tell us what you think, so we can make sure this page meets your needs!

Visit the Standards, Data Sources and Methods Hub.

[We are Canada’s national statistical agency. We are here to engage with Canadians and provide them with high-quality statistical information that matters! Publishing in a subreddit does not imply we endorse the content posted by other redditors.]

***

Des améliorations ont été apportées au Carrefour des normes, sources de données et méthodes de Statistique Canada pour rendre l’expérience utilisateur plus conviviale. Avec sa conception améliorée, sa nouvelle section Foire aux questions et ses liens d’accès rapide aux ressources, ce carrefour se veut un guichet unique pour les utilisateurs de données, les statisticiens et autres, qui y trouveront tout ce dont ils ont besoin sur :

les variables et les classifications; la méthodologie d’enquête; la qualité des données; l’accès direct aux questionnaires.

Explorez le Carrefour et dites-nous ce que vous en pensez, nous voulons nous assurer qu’il répond à vos besoins!

Carrefour des normes, sources de données et méthodes.

[Nous sommes l’organisme national de statistique du Canada. Nous sommes ici pour discuter avec les Canadiens et les Canadiennes et leur fournir des renseignements statistiques de grande qualité qui comptent! Le fait de publier dans un sous-reddit ne signifie pas que nous approuvons le contenu affiché par d’autres utilisateurs de Reddit.]

submitted by /u/StatCanada
[link] [comments]

NBA March Madness Or Other Internal-to-company Sports-prediction Related Datasets

I’m wanting to answer a question about whether companies who run sports-guessing competitions make good predictions in aggregate, despite the fact that there will be many people in the organisation that don’t care about sports at all, and just pick at random.

What I’m looking for is the data from a company that ran a tipping competition for some sports competition where I can analyse the answers. (e.g. if it was the NBA, I can look that up; if it was your internal squash competition, that’s OK as long as it has who won, as well as the predictions of who would win.)

In return I’ll hopefully be able to confirm that there’s a way of maximising your total return on next year’s competition.

submitted by /u/solresol
[link] [comments]

Thesis Help Statistically Significant Data

Hey guys I need some help. I’m using statista and working on my thesis.

I have data that’s 1 and 0 (present and not present) and I’m trying to figure out if the data is statistically significant but there isn’t a normal distribution. I’m not sure what to do. Any help would be appreciated.

submitted by /u/Kapotter
[link] [comments]

What Are You Using Your Datasets Actually For?

Hi, my apologies for possibly asking the wrong questions here. I am a total newbie to all things machine learning, have just discovered kaggle and such and I’m a bit stuck with a silly question: I’ve discovered like a million different datasets on there, but I’m just wondering how people are putting these sets to good use. For instance there’s a big dataset about the Titanic. I can’t fathom a realistic use case where this dataset would prove to be useful. I guess I don’t understand just yet where the ‘machine learning’ aspect in datasets like these come into play. What is it exactly you are predicting with these?

Can somebody please enlighten me what I’m obviously missing here? I really want to know.

Thank you

submitted by /u/VHS124
[link] [comments]

What Open-source Dataset Tagging/storage Solutions Are Out There?

I am having trouble finding this, what do people use to store and create these datasets? Not as in ‘JSON’ or a relational/non-relational data bases, but is there a popular project that streamlines all of this or should I write my own?

I am a software developer so the scraping and storing of data isn’t an issue, what I don’t want to do is re-invent the wheel. I am just starting to get into this generation of AI tech.

I’d like to find something that can take in data like images and text with ‘tagged’ context for fine tuning AI models. Something I can write scraper and parsers and add to a database, then export data for training data sets.

Like I said I am about to just write my own stuff to do this but I feel like this is a common enough problem that I should just use whatever the popular kids are using these days. Trouble is I am just not finding the right words to search.

So does this exist? am I overcomplicating this?

submitted by /u/drywallfan
[link] [comments]

Looking For Business Budget History.

Hi all, for a project in my school I’m looking for a dataset containing business budgets for many companies in the last 10-20 years. We’re Italian, so we would appreciate if some Italian companies appear in the dataset. Thanks in advice to people who will help.

submitted by /u/niger4
[link] [comments]

Dataset Containing Informal/formal Text?

Does anyone know of a publicly available dataset in any language containing formal discursive text along with a “parallel”, less formal text or know of any place where one can create such a dataset (like English Wikipedia articles and corresponding Simple Wikipedia articles)? The GYAFC dataset (Rao et al. 2018) is similar to what I’m looking for.

submitted by /u/geartrains
[link] [comments]

How Frequently Is Commoncrawl Data Updated, And What Is Its Coverage Level?

How often is Commoncrawl updated? On a daily cadence? Or weekly/monthly? If Meghan Markle wears a Versace gown, that becomes a BBC article, and that article shows up on Googling “meghan markle” 2-3 minutes after the publishing of the article by BBC. What is the equivalent time for CC?
And secondly, is there a place where I can see CC coverage level? I mean – which websites they cover fully, which ones they cover partially, whether they cover reuters.com at all, or how much of of vice.com they cover, etc.?

submitted by /u/Attitudemonger
[link] [comments]

Looking For VR Anatomy Learning Dataset

Hi everyone, I’m looking for VR Anatomy Learning Dataset. This dataset was collected by researchers from the University of Glasgow and contains data on the use of virtual reality for teaching human anatomy. The dataset includes performance data, survey responses, and other metrics related to the effectiveness of virtual reality in anatomy education. Kindly let me know about the dataset plus any research paper(website link) regarding this topic would be very helpful.

submitted by /u/AbrarHussain-1234
[link] [comments]

Looking For A Dataset For Live Broadcasting Sports Online Platform

hi everyone new here. need help with a dataset for a school project. im required to generate test data/ mock dataset of web server logs in an excel file/CSV. the dataset should include following columns: country, time-stamp, ip address, status, URL, status code, number of websites visits, content/sports viewed. list should include different sports and reflected on the URL e.g /athletics/videos/200m-final.jpg (minimum of 3000 entries) please help.

submitted by /u/byron_0001
[link] [comments]