Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

English Words “familiarity” Dataset.

There are plenty of Word Frequency lists but plurals, adjectives, adverbs of the same word end up in different positions in these lists.

I’m looking for a dataset or a way to create a dataset that has all forms or one word clumped together so it’s less about frequency and more about how familiar the word (and its different forms) is if that makes sense.

For instance, i have a list whete the word “have” is at 25th place, “has” at 39 and “had” at 105. Clearly, anyone who knows one of these words would know the other two as well.

Apologies if I did not get my point across clearly. Any help is appreciated. Thanks!

submitted by /u/haskpro1995
[link] [comments]

Stanford Cars (cars196) Contains Many Fine-Grained Errors

Hey Redditors,

I know the cars196 dataset is nothing new, but I wanted to share some label errors and outliers that I found within it.

It’s interesting to note that the primary goal of the original paper that curated/used this dataset was “fine-grained categorization” meaning discerning the differences between something like a Chevrolet Cargo Van and a GMC Cargo Van. I found numerous examples of images that exhibit very nuanced mislabelling which is directly counterintuitive to the task they sought to research.

Here are a few examples of nuanced label errors that I found:

Audi TT RS Coupe labeled as an Audi TT Hatchback Audi S5 Convertible labeled as an Audi RS4 Jeep Grand Cherokee labeled as a Dodge Durango

I also found examples of outliers and generally ambiguous images:

multiple cars in one image top-down style images vehicles that didn’t belong to any classes.

I found these issues to be pretty interesting, yet I wasn’t surprised. It’s pretty well known that many common ML datasets exhibit thousands of errors.

If you’re interested in how I found them, feel free to read about it here.

submitted by /u/cmauck10
[link] [comments]

Which Templates Do You Like The Most?

This is for all data enthusiasts out there!
We’ve launched Plug & Play Data Templates on Product Hunt today! 🥳
Our data templates are a step-by-step walkthrough for 50+ use-cases with pre-baked, interactive SQL queries- covering 5 critical categories- Product Analytics, Customer Analytics, Sales Analytics, Marketing Analytics and Finance Analytics.
Please check us out here! 👉🏻https://www.producthunt.com/posts/plug-play-data-templates

submitted by /u/AirbookIO
[link] [comments]

BuzzFeed News “Trending” Strip, 2018–2023

The file contains 3.1 million rows, each representing one article observed at one point in time.

The file uses these columns:

timestamp: The time (in UTC) of the fetch. All articles from the same fetch will have the same timestamp. position: The article’s zero-indexed position in the trending strip, from left to right. text: The text of the link used to highlight the article. Note: Sometimes the same article is associated with different text at different points in time. url: The link’s URL. Note: Sometimes (although relatively rarely) the URL for the same underlying article changes over time.

Note: Although the script generally ran every five minutes, there are some gaps in the data, accounting for roughly 3% of the total time period covered. These gaps owe to two main factors: technical complications (such as server downtime) and periods during which the website swapped out the trending strip with breaking news alerts, single-story highlights, or other notices. Unfortunately, I did not have the foresight to collect data that would distinguish between those scenarios.

submitted by /u/brianckeegan
[link] [comments]

Looking For Feedback On The New Standards, Data Sources And Methods Hub / Dites-nous Ce Que Vous En Pensez Du Carrefour Des Normes, Sources De Données Et Méthodes [self-promotion]

Statistics Canada added new features to enhance the overall data user experience on the Standards, Data Sources and Methods Hub. With its improved design, new frequently asked question section and quick access links to resources, the hub is meant to be a one-stop shop for data users, statisticians and others for:

variables and classifications survey methodology key aspects of data quality direct access to questionnaires.

Explore the hub and tell us what you think, so we can make sure this page meets your needs!

Visit the Standards, Data Sources and Methods Hub.

[We are Canada’s national statistical agency. We are here to engage with Canadians and provide them with high-quality statistical information that matters! Publishing in a subreddit does not imply we endorse the content posted by other redditors.]

***

Des améliorations ont été apportées au Carrefour des normes, sources de données et méthodes de Statistique Canada pour rendre l’expérience utilisateur plus conviviale. Avec sa conception améliorée, sa nouvelle section Foire aux questions et ses liens d’accès rapide aux ressources, ce carrefour se veut un guichet unique pour les utilisateurs de données, les statisticiens et autres, qui y trouveront tout ce dont ils ont besoin sur :

les variables et les classifications; la méthodologie d’enquête; la qualité des données; l’accès direct aux questionnaires.

Explorez le Carrefour et dites-nous ce que vous en pensez, nous voulons nous assurer qu’il répond à vos besoins!

Carrefour des normes, sources de données et méthodes.

[Nous sommes l’organisme national de statistique du Canada. Nous sommes ici pour discuter avec les Canadiens et les Canadiennes et leur fournir des renseignements statistiques de grande qualité qui comptent! Le fait de publier dans un sous-reddit ne signifie pas que nous approuvons le contenu affiché par d’autres utilisateurs de Reddit.]

submitted by /u/StatCanada
[link] [comments]

NBA March Madness Or Other Internal-to-company Sports-prediction Related Datasets

I’m wanting to answer a question about whether companies who run sports-guessing competitions make good predictions in aggregate, despite the fact that there will be many people in the organisation that don’t care about sports at all, and just pick at random.

What I’m looking for is the data from a company that ran a tipping competition for some sports competition where I can analyse the answers. (e.g. if it was the NBA, I can look that up; if it was your internal squash competition, that’s OK as long as it has who won, as well as the predictions of who would win.)

In return I’ll hopefully be able to confirm that there’s a way of maximising your total return on next year’s competition.

submitted by /u/solresol
[link] [comments]

Thesis Help Statistically Significant Data

Hey guys I need some help. I’m using statista and working on my thesis.

I have data that’s 1 and 0 (present and not present) and I’m trying to figure out if the data is statistically significant but there isn’t a normal distribution. I’m not sure what to do. Any help would be appreciated.

submitted by /u/Kapotter
[link] [comments]

What Are You Using Your Datasets Actually For?

Hi, my apologies for possibly asking the wrong questions here. I am a total newbie to all things machine learning, have just discovered kaggle and such and I’m a bit stuck with a silly question: I’ve discovered like a million different datasets on there, but I’m just wondering how people are putting these sets to good use. For instance there’s a big dataset about the Titanic. I can’t fathom a realistic use case where this dataset would prove to be useful. I guess I don’t understand just yet where the ‘machine learning’ aspect in datasets like these come into play. What is it exactly you are predicting with these?

Can somebody please enlighten me what I’m obviously missing here? I really want to know.

Thank you

submitted by /u/VHS124
[link] [comments]

What Open-source Dataset Tagging/storage Solutions Are Out There?

I am having trouble finding this, what do people use to store and create these datasets? Not as in ‘JSON’ or a relational/non-relational data bases, but is there a popular project that streamlines all of this or should I write my own?

I am a software developer so the scraping and storing of data isn’t an issue, what I don’t want to do is re-invent the wheel. I am just starting to get into this generation of AI tech.

I’d like to find something that can take in data like images and text with ‘tagged’ context for fine tuning AI models. Something I can write scraper and parsers and add to a database, then export data for training data sets.

Like I said I am about to just write my own stuff to do this but I feel like this is a common enough problem that I should just use whatever the popular kids are using these days. Trouble is I am just not finding the right words to search.

So does this exist? am I overcomplicating this?

submitted by /u/drywallfan
[link] [comments]

Looking For Business Budget History.

Hi all, for a project in my school I’m looking for a dataset containing business budgets for many companies in the last 10-20 years. We’re Italian, so we would appreciate if some Italian companies appear in the dataset. Thanks in advice to people who will help.

submitted by /u/niger4
[link] [comments]

Dataset Containing Informal/formal Text?

Does anyone know of a publicly available dataset in any language containing formal discursive text along with a “parallel”, less formal text or know of any place where one can create such a dataset (like English Wikipedia articles and corresponding Simple Wikipedia articles)? The GYAFC dataset (Rao et al. 2018) is similar to what I’m looking for.

submitted by /u/geartrains
[link] [comments]

How Frequently Is Commoncrawl Data Updated, And What Is Its Coverage Level?

How often is Commoncrawl updated? On a daily cadence? Or weekly/monthly? If Meghan Markle wears a Versace gown, that becomes a BBC article, and that article shows up on Googling “meghan markle” 2-3 minutes after the publishing of the article by BBC. What is the equivalent time for CC?
And secondly, is there a place where I can see CC coverage level? I mean – which websites they cover fully, which ones they cover partially, whether they cover reuters.com at all, or how much of of vice.com they cover, etc.?

submitted by /u/Attitudemonger
[link] [comments]

Looking For VR Anatomy Learning Dataset

Hi everyone, I’m looking for VR Anatomy Learning Dataset. This dataset was collected by researchers from the University of Glasgow and contains data on the use of virtual reality for teaching human anatomy. The dataset includes performance data, survey responses, and other metrics related to the effectiveness of virtual reality in anatomy education. Kindly let me know about the dataset plus any research paper(website link) regarding this topic would be very helpful.

submitted by /u/AbrarHussain-1234
[link] [comments]

Looking For A Dataset For Live Broadcasting Sports Online Platform

hi everyone new here. need help with a dataset for a school project. im required to generate test data/ mock dataset of web server logs in an excel file/CSV. the dataset should include following columns: country, time-stamp, ip address, status, URL, status code, number of websites visits, content/sports viewed. list should include different sports and reflected on the URL e.g /athletics/videos/200m-final.jpg (minimum of 3000 entries) please help.

submitted by /u/byron_0001
[link] [comments]

There Was An IMDb Dataset On Kaggle That Had Detailed Ratings Breakdown Of All Movies And Was Later Removed, Since Then I Have Not Found Anything Like It.

hello, i think it was around february 2020 someone uploaded an amazing IMDb dataset titled “IMDb movies extensive dataset”, i still have the archive file, but i wanted to find a more recent one, i tried making it myself but IMDb doesn’t provide their complete data for free, you can get the basic info but what’s really interesting for me was the breakdown data on ratings, here’s the columns from the “IMDB ratings.csv” file

imdb_title_id,weighted_average_vote,total_votes,mean_vote,median_vote,votes_10,votes_9,votes_8,votes_7,votes_6,votes_5,votes_4,votes_3,votes_2,votes_1,allgenders_0age_avg_vote,allgenders_0age_votes,allgenders_18age_avg_vote,allgenders_18age_votes,allgenders_30age_avg_vote,allgenders_30age_votes,allgenders_45age_avg_vote,allgenders_45age_votes,males_allages_avg_vote,males_allages_votes,males_0age_avg_vote,males_0age_votes,males_18age_avg_vote,males_18age_votes,males_30age_avg_vote,males_30age_votes,males_45age_avg_vote,males_45age_votes,females_allages_avg_vote,females_allages_votes,females_0age_avg_vote,females_0age_votes,females_18age_avg_vote,females_18age_votes,females_30age_avg_vote,females_30age_votes,females_45age_avg_vote,females_45age_votes,top1000_voters_rating,top1000_voters_votes,us_voters_rating,us_voters_votes,non_us_voters_rating,non_us_voters_votes

as you can see it has some juicy information, such as breakdown by age, gender, and most importantly for me the top1000_voters which i think an extremly underrated data point that i rarely mentioned, it’s very useful when you want to determine if the rating of a movie is unbiased or not, i have noticed that a lot of highly rated turkish and indian movies especially have very biased ratings and using the top1000_voters you can find which ones,

also i was able to find interesting things such as which movies females prefer more than males and which genres as well (males are biased more towards westerns while females are biased more towards the family genre)

so my question is; is it possible to get this info from imdb without paying? i live in a third world country and got no credit card to my name, i love to do these types of exploratory analysis as a hobby, can’t pay imdb the thousands that they are asking for and for the life of my i can’t figure out how to webscrape the data with imdb’s anti-scraping systems.

also on a side note it appears they have removed the breakdown in rating details from their website, you can only see breakdown by how many people voted on each score, but not by genders, age or even the top1000 that was there before.

submitted by /u/NoHetro
[link] [comments]

Local Automotive Repair Shops Data On Repairs Performed

Hi everyone, I have a request for a dataset pertaining to automotive repairs.

I am voluntarily building a free application/platform that anyone can freely use anytime to help the public make informed decisions on where to take their motor vehicles for repairs. My interest in this comes from the fact that I love cars and I hate seeing people get ripped off. I’ve worked on countless cars and helped many people with free repairs. Specifically, this platform would allow users to search for nearby automotive repair shops and they would see a graphical summary view of the quantity of repairs any individual shop has done in a given period of time (X number of brake repairs, Y number of engine oil changes, Z number of front-end alignments, etc.). More features would be added with time but this is the starting point.

I have already done legwork before coming here to make this platform a reality.

I contacted my state’s Department of Motor Vehicles (DMV) and submitted a Freedom of Information Act (FOIA) request to obtain access to the necessary dataset. My state’s DMV has a legal clause that specifically requires all automotive repair shops to retain records of estimates, work orders, invoices, parts purchase orders, and appraisals to be available for inspection by the DMV. The DMV kindly responded to my request and unfortunately, I learned that although all automotive repair shops are required to retain these records, the shops are not obligated to submit these records to the DMV for archival at any point in time. Furthermore, the circumstances under which the DMV would even audit a shop with the intent to inspect these records would be extremely circumstantial and exceptionally rare.

For clarification, my intent is to only depict the values contained in these records through visual means such as graphs and charts. Customer names, cost of repairs, parts vendor names, mechanic names, and any other personally identifiable information (except for the name of the shop doing the repair) would all be obscured.

After hitting this brick wall, I learned about some existing platforms that collect and aggregate automotive repair data (RepairPal, iATN, Mechanic Advisor, AutoMD, CarMD). Although these platforms give users the ability to post reviews like Google Reviews and Yelp, they don’t contain the fundamental data I need to build this free platform. Some also sell products or services to automotive repair shops (namely OEM how-to tutorials for specific make/model cars) and I don’t want to get involved with any financial sponsorships or political bureaucracy.

I have thought about reaching out to local automotive repair shops I have close relations with but there’s less than a handful that trust me enough to grant me access to their data and for this data to be accurate. Networking with each automotive repair shop in my entire state is just not realistic.

Any feedback would be greatly appreciated. Thanks in advance!

submitted by /u/justLURKin220020
[link] [comments]