Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

What Open-source Dataset Tagging/storage Solutions Are Out There?

I am having trouble finding this, what do people use to store and create these datasets? Not as in ‘JSON’ or a relational/non-relational data bases, but is there a popular project that streamlines all of this or should I write my own?

I am a software developer so the scraping and storing of data isn’t an issue, what I don’t want to do is re-invent the wheel. I am just starting to get into this generation of AI tech.

I’d like to find something that can take in data like images and text with ‘tagged’ context for fine tuning AI models. Something I can write scraper and parsers and add to a database, then export data for training data sets.

Like I said I am about to just write my own stuff to do this but I feel like this is a common enough problem that I should just use whatever the popular kids are using these days. Trouble is I am just not finding the right words to search.

So does this exist? am I overcomplicating this?

submitted by /u/drywallfan
[link] [comments]

Looking For Business Budget History.

Hi all, for a project in my school I’m looking for a dataset containing business budgets for many companies in the last 10-20 years. We’re Italian, so we would appreciate if some Italian companies appear in the dataset. Thanks in advice to people who will help.

submitted by /u/niger4
[link] [comments]

Dataset Containing Informal/formal Text?

Does anyone know of a publicly available dataset in any language containing formal discursive text along with a “parallel”, less formal text or know of any place where one can create such a dataset (like English Wikipedia articles and corresponding Simple Wikipedia articles)? The GYAFC dataset (Rao et al. 2018) is similar to what I’m looking for.

submitted by /u/geartrains
[link] [comments]

How Frequently Is Commoncrawl Data Updated, And What Is Its Coverage Level?

How often is Commoncrawl updated? On a daily cadence? Or weekly/monthly? If Meghan Markle wears a Versace gown, that becomes a BBC article, and that article shows up on Googling “meghan markle” 2-3 minutes after the publishing of the article by BBC. What is the equivalent time for CC?
And secondly, is there a place where I can see CC coverage level? I mean – which websites they cover fully, which ones they cover partially, whether they cover reuters.com at all, or how much of of vice.com they cover, etc.?

submitted by /u/Attitudemonger
[link] [comments]

Looking For VR Anatomy Learning Dataset

Hi everyone, I’m looking for VR Anatomy Learning Dataset. This dataset was collected by researchers from the University of Glasgow and contains data on the use of virtual reality for teaching human anatomy. The dataset includes performance data, survey responses, and other metrics related to the effectiveness of virtual reality in anatomy education. Kindly let me know about the dataset plus any research paper(website link) regarding this topic would be very helpful.

submitted by /u/AbrarHussain-1234
[link] [comments]

Looking For A Dataset For Live Broadcasting Sports Online Platform

hi everyone new here. need help with a dataset for a school project. im required to generate test data/ mock dataset of web server logs in an excel file/CSV. the dataset should include following columns: country, time-stamp, ip address, status, URL, status code, number of websites visits, content/sports viewed. list should include different sports and reflected on the URL e.g /athletics/videos/200m-final.jpg (minimum of 3000 entries) please help.

submitted by /u/byron_0001
[link] [comments]

There Was An IMDb Dataset On Kaggle That Had Detailed Ratings Breakdown Of All Movies And Was Later Removed, Since Then I Have Not Found Anything Like It.

hello, i think it was around february 2020 someone uploaded an amazing IMDb dataset titled “IMDb movies extensive dataset”, i still have the archive file, but i wanted to find a more recent one, i tried making it myself but IMDb doesn’t provide their complete data for free, you can get the basic info but what’s really interesting for me was the breakdown data on ratings, here’s the columns from the “IMDB ratings.csv” file

imdb_title_id,weighted_average_vote,total_votes,mean_vote,median_vote,votes_10,votes_9,votes_8,votes_7,votes_6,votes_5,votes_4,votes_3,votes_2,votes_1,allgenders_0age_avg_vote,allgenders_0age_votes,allgenders_18age_avg_vote,allgenders_18age_votes,allgenders_30age_avg_vote,allgenders_30age_votes,allgenders_45age_avg_vote,allgenders_45age_votes,males_allages_avg_vote,males_allages_votes,males_0age_avg_vote,males_0age_votes,males_18age_avg_vote,males_18age_votes,males_30age_avg_vote,males_30age_votes,males_45age_avg_vote,males_45age_votes,females_allages_avg_vote,females_allages_votes,females_0age_avg_vote,females_0age_votes,females_18age_avg_vote,females_18age_votes,females_30age_avg_vote,females_30age_votes,females_45age_avg_vote,females_45age_votes,top1000_voters_rating,top1000_voters_votes,us_voters_rating,us_voters_votes,non_us_voters_rating,non_us_voters_votes

as you can see it has some juicy information, such as breakdown by age, gender, and most importantly for me the top1000_voters which i think an extremly underrated data point that i rarely mentioned, it’s very useful when you want to determine if the rating of a movie is unbiased or not, i have noticed that a lot of highly rated turkish and indian movies especially have very biased ratings and using the top1000_voters you can find which ones,

also i was able to find interesting things such as which movies females prefer more than males and which genres as well (males are biased more towards westerns while females are biased more towards the family genre)

so my question is; is it possible to get this info from imdb without paying? i live in a third world country and got no credit card to my name, i love to do these types of exploratory analysis as a hobby, can’t pay imdb the thousands that they are asking for and for the life of my i can’t figure out how to webscrape the data with imdb’s anti-scraping systems.

also on a side note it appears they have removed the breakdown in rating details from their website, you can only see breakdown by how many people voted on each score, but not by genders, age or even the top1000 that was there before.

submitted by /u/NoHetro
[link] [comments]

Local Automotive Repair Shops Data On Repairs Performed

Hi everyone, I have a request for a dataset pertaining to automotive repairs.

I am voluntarily building a free application/platform that anyone can freely use anytime to help the public make informed decisions on where to take their motor vehicles for repairs. My interest in this comes from the fact that I love cars and I hate seeing people get ripped off. I’ve worked on countless cars and helped many people with free repairs. Specifically, this platform would allow users to search for nearby automotive repair shops and they would see a graphical summary view of the quantity of repairs any individual shop has done in a given period of time (X number of brake repairs, Y number of engine oil changes, Z number of front-end alignments, etc.). More features would be added with time but this is the starting point.

I have already done legwork before coming here to make this platform a reality.

I contacted my state’s Department of Motor Vehicles (DMV) and submitted a Freedom of Information Act (FOIA) request to obtain access to the necessary dataset. My state’s DMV has a legal clause that specifically requires all automotive repair shops to retain records of estimates, work orders, invoices, parts purchase orders, and appraisals to be available for inspection by the DMV. The DMV kindly responded to my request and unfortunately, I learned that although all automotive repair shops are required to retain these records, the shops are not obligated to submit these records to the DMV for archival at any point in time. Furthermore, the circumstances under which the DMV would even audit a shop with the intent to inspect these records would be extremely circumstantial and exceptionally rare.

For clarification, my intent is to only depict the values contained in these records through visual means such as graphs and charts. Customer names, cost of repairs, parts vendor names, mechanic names, and any other personally identifiable information (except for the name of the shop doing the repair) would all be obscured.

After hitting this brick wall, I learned about some existing platforms that collect and aggregate automotive repair data (RepairPal, iATN, Mechanic Advisor, AutoMD, CarMD). Although these platforms give users the ability to post reviews like Google Reviews and Yelp, they don’t contain the fundamental data I need to build this free platform. Some also sell products or services to automotive repair shops (namely OEM how-to tutorials for specific make/model cars) and I don’t want to get involved with any financial sponsorships or political bureaucracy.

I have thought about reaching out to local automotive repair shops I have close relations with but there’s less than a handful that trust me enough to grant me access to their data and for this data to be accurate. Networking with each automotive repair shop in my entire state is just not realistic.

Any feedback would be greatly appreciated. Thanks in advance!

submitted by /u/justLURKin220020
[link] [comments]

[self-promotion] Every Product Listed On LEGO.com, May 2023

I made a little Python crawler that slurps up data about products listed on LEGO.com. That’s every product on the site, not just LEGO sets.

Here’s the crawler’s JSON output from May 9, 2023: https://gist.github.com/ryukoposting/070bea86a3b9fefc285388b0ffe651aa

Each product includes the following information:

The product’s name A link to the product page on LEGO.com The product’s price in USD The product’s discount price in USD, if there is a discount. The number of LEGO pieces in the product (if the product isn’t a LEGO set, this value is null) LEGO’s suggested age range of the product, if one is available. Whether or not the product is currently available for purchase. Note: this is misleadingly called in_stock, but its value will be true for products that are on backorder. The product’s customer rating average, 1-5 stars. A list of themes to which the product belongs. Many products have only one theme, but some belong to multiple themes.

submitted by /u/ryu-ryu-ryu
[link] [comments]

Dataset Or Repository Of People Looking To Acquire External Datasets

I am looking for a dataset or repository that has a list of individuals or organizations actively searching and looking to purchase external datasets. The datasets can be used for research, academia, or business purposes, and they can encompass any type of data as long as the potential buyers have the intent and budget to make the purchase. I’m not even sure such a compilation exists (besides r/datasets) but thought it would be worth a try to ask!

submitted by /u/-x-Knight
[link] [comments]