Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Lyric Dataset With Song Structure For Commercial Use

Hey, I’m trying to find a dataset that contains lyrics and the song structure, exactly like https://genius.com

For example:

[Intro]
Psst, I see dead people
(Mustard on the beat, ho)

[Verse 1]
Ayy, Mustard on the beat, ho
Deebo any rap nigga, he a free thro

Genius doesn’t allow scraping or the usage of his data for commercial use

Except as expressly authorized by Genius in writing, you agree not to modify, copy, frame, scrape, rent, lease, loan, sell, distribute or create derivative works based on the Service or the Genius Content, in whole or in part, except that the foregoing does not apply to your own User Content (as defined above) that you legally upload to the Service. In connection with your use of the Service you shall not engage in or use any data mining, robots, scraping or similar data gathering or extraction methods. Any use of the Service or the Genius Content other than as specifically authorized herein is strictly prohibited. As between you and Genius, the technology and software underlying the Service or distributed in connection therewith is the exclusive property of Genius, our affiliates and our partners (the “Software”). You agree not to copy, modify, create a derivative work of, reverse engineer, reverse assemble or otherwise attempt to discover any source code, sell, assign, sublicense, or otherwise transfer any right in the Software. Any rights not expressly granted herein are reserved by Genius.

Do you know any other source of data that contains the lyrics and the song structure (chorus, verse, etc)? I want to fine-tune whisper to transcribe lyrics with these tags for a commercial product (music generation model).

I think that suno.com has used genius.com for their music model because they use the same tag for song structure xD.

submitted by /u/Which-Breadfruit-926
[link] [comments]

Building A Collection Of The Best Datasets And Resources

Hey scientists!

I’m working on cooldata, I’d like to build a more useful way to access open data online.

What are the best resources you use everyday (data.gov, etc…)? And more importantly why do use them and how?

I’m starting this by myself as a 20% personal project, the goal is to be fully open and maybe also open source as the thing moves on. (If anyone wants to apply to contribute I’m happy to listen! just send a dm)

Have a nice day!

submitted by /u/antonscap
[link] [comments]

Tableau Help Or Better Yet, Can You Analyze My Data?

It has been a while (10yrs) and I can’t figure out how to do a join of several tables using date/time in Tableau Public. Backstory; I have a annoying health condition (SIBO) that is starving my body of nutrients and I am trying figure things out by tracking methane, hydrogen, food intake, meds, symptoms, etc.

https://public.tableau.com/app/profile/mfinaly/viz/SmallIntestinesBacteriaOvergrowth/TrackingmySIBO

submitted by /u/Immediate_Ad3066
[link] [comments]

How To Scrape Subtitles?

There is very little Irish language text, audio and english translation. One of the best sources is this soap opera

https://www.tg4.ie/en/player/play/?pid=6352950048112&title=Ros%20na%20R%C3%BAn&series=Ros%20na%20R%C3%BAn&pcode=669535&genre=Drama

It is fairly easy to find the url of the subtitles when on that webpage manually

getting the vtt file

But the vtt URL uses UUIDs that seem pretty random

https://redirector.playback.eu-west-1.prod.deploys.brightcove.com/v1/1555966122001/7b5d6364-47e2-4016-ae63-93301a7f4e38/ff7182e5-8f90-4af9-8d35-41a3bae7fa1e/441366d1-6c40-4106-9c0f-ecfdc21476b0.vtt

https://redirector.playback.eu-west-1.prod.deploys.brightcove.com/v1/1555966122001/83680fe1-8055-4494-96ff-bc2786f937cc/652c30ad-ff11-45d4-9e0c-46db42f5a34c/0ab149e4-25b0-4c73-8c9a-8130d647de91.vtt

There are subtitle archive sites but this soap opera is not there. So how would you extract a few hundred sets of VTT files (I want to build NLP datasets , ngrams etc, not make money or anything).

I can imagine answers of

With this site you can hire someone and if you show them the steps they can extract them for you cheap

With this mouse emulator you can do it by XYZ

There is away around the UUIDs being random by XYZ

But I do not know how any of these would actually work.

submitted by /u/cavedave
[link] [comments]

Looking For Bacterial Growth Per Time Dataset

hello everyone, thank you for reading this post. Like the title says I’m looking for a dataset experimental one about bacterial growth per time (if you have the protocole it would be better but a real one would be awesome and the source). I try to simulate a bacterial growth model and trying to compare to a real one Ty for your attention. All the best for everyone <3

submitted by /u/Fickle_Buy7668
[link] [comments]

Dataset Browsing Behavior / Search History

Hi everyone,

I am looking to analyze browsing data holistically, so I would like to understand what pages users visit. Best would be search history data from browsers. It would be great if it was recent too (2021-2024). Does anyone know of anything like that? I am a PhD student so I only have limited budget.

Thank you in advance!

submitted by /u/KeyScale1232
[link] [comments]

I Need Ideas For My Data Science Project

(what’s this link thing?) Hello folks, I need ideas of datasets that I can use for a data analisys for my college. I thought about the relation between more developed countries x unemployment or a dataset that I found that contained a study about what may be the most commom way to study a subject and if it’s effective or not, however I couldn’t find the source of the data so if you guys could help me find these or maybe give me some better ideas I would thank a lot

submitted by /u/vitstola
[link] [comments]

Open Sourcing Touristic POI Database – Questions Around Format, Interest

We’re planning to open source our touristic POI Database (currently 1.4 Million points worldwide). There is some effort involved in generalizing it from our internal format so I wanted to confirm that a) there is interest in it as well get some feedback on the format. I’ve also outlined the process of creating/ updating the dataset, as it gives some insight what to expect from the dataset and if it interests anyone, probably the people in this sub.

POI data points

Location (mandatory) Category (mandatory, more on that later) Name Images ( designated thumbnail with blur hash, all with (permissive licensing information) Localizations (consisting of a name, teaser and description in one of the supported languages, availability depends) Rating (mandatory, more on that later) Source (mandatory, such as Wikidata, OSM, tourism council etc.) Type (most POIs are individual sights but „special“ POIs such as places ie cities/towns exist ) Parent (if it exists, a „special“ poi such as a city or town ) Links/References (links to Wikidata entity, Wikipedia/Wikivoyage articles in different languages but also links to social media (fb, ig, twitter etc.), booking sites (agoda, booking, hotels.com etc. ) or relevant 3rd party sites such as Trip Advisor, Atlas Obscura etc.. Misc. Properties: Webaddress Telephone Zip Code Opening Hours Heritage Designation (UNESCO, UK Grade I building ) etc. More depending on the source

We derive our content from many different sources, some of them we simple map to the above format (especially those derived from regional or country level Tourism councils ). The bulk is however combined from Wikidata, Wikipedia, Wikivoyage and OpenStreetMap in the following manner.

Process

Process the complete Wikidata Dump, filtering out all entities that possess a geocoordinate and an instance of-claim. The instance of claim is then checked against a list of touristically relevant classes. Note: This claim can be very specific such as olive sand beach or agricultural theme park so that we expand our list of touristically relevant classes (ie beach and amusement park) to include the descendant subclasses. We get a lot of structured information from this source (especially links to other sites) but little in description, images etc. Process all linked articles in the different language versions of wikipedia/wikivoyage (at the moment we look at the English, German, French, Spanish, Italian, Portuguese and Polish sites). Extract teaser and shorter excerpts for descriptions (Localizations) as well as images with their respective licenses. Clean-Up low quality & unspecific images Assign Parents depending on the “located in adminstrative Region” – claim to “special” POIs (cities, towns), the assigned pois then form an area that are used to assign further Pois in that area to the same parent.

Two things would require some work: category and rating. We map information from sources to an internal category representation. It is binary, fast to filter with bit masks but not very flexible and probably not that easy to use. For the open source version I was thinking of creating a taxonomy somewhat similar to the one Foursquare uses but other suggestions are appreciated.

The rating combines a somewhat objective data quality rating (amount of images, links to wikipedia articles, length of descriptions etc., types of properties present) with a biased weighting of categories (among other information) that fits our use case. We also use user reviews/rating but that wouldn’t be part of the dataset. We could use a slightly more generalized aggregate rating and/ or different rating components but more likely than not you would want to use your own weighting if your use case is sufficiently different, so I guess I am wondering what expectations or requests there are here.

Export Formats

TSV and GeoJSON Feature Collections but open to suggestions.

submitted by /u/berlumptsss
[link] [comments]

Looking For Updated Dataset On Hofstede’s 6 Dimensions Of Culture

Hi I am trying to use the most recent data from Hofstede’s 6 dimensions for my thesis on how culture impact AI innovation. I found the data i need here: https://www.hofstede-insights.com/country-comparison-tool But it is not in a excel format and typing it over would take a lot of time. Online I could only find datasets from 2015. Is there a more recent version publically available?

submitted by /u/Electronic-Boat5375
[link] [comments]

I’m Struggling To Find A Resource That’ll Give Me A List Of Songs That Released Each Year For The Past Decade

I’m conducting a research project where I compare music from before and after the Advent of TikTok to see if TikTok really changed how people music.

I have been looking far and wide for a a library, package, API or database that can give me a reliable list of the songs released each year from 2010 to 2023.

Could y’all recommend the most reliable source to get this type of data?

Thanks

submitted by /u/reddit_turtleking
[link] [comments]

Historical Sale/coupon/promotional Prices At Grocery Chains

Hello! I’m looking for a dataset of historical grocery store item prices, specifically at the promotional / sale / rewards card price (hopefully including details like if it was BOGO, 2 for 1, minimum or maximum purchase requirements, etc.).

I see quite a few price histories but nothing that specifies if it was on sale and what the deal was. And I’m sure grocery chains wouldn’t share this information.

Maybe the best path would be to scrape this data myself going forward?

Thoughts? TIA!

submitted by /u/secondcupoftea
[link] [comments]

Other Examples Of Websites Like NYC’s Data Visualization?

NYC’s “Open Data” website allows you to quickly visualize the datasets right within your web browser. This includes a tabular view along with customizable graphs and charts:

https://data.cityofnewyork.us/d/k397-673e/visualization

Are there other websites that offer something similar for their respective public (and open source) datasets? I’m curious about the overall UI and UX these websites provide in hopes of drawing some inspiration for a website of my own one day.

submitted by /u/TheCodingCyclist
[link] [comments]

Explore The Ultimate UFC Dataset On Kaggle!

Hey everyone,

Just wanted to share this awesome find on Kaggle: “The Ultimate UFC Archive (1993-Present)” dataset. It’s a treasure trove of UFC data covering events, fights, fighters, and referees.

What’s Inside:

Event details Fight outcomes Fighter statistics

Why It’s Cool:

Detailed fight data In-depth fighter profiles Constantly updated

Whether you’re a data enthusiast, a die-hard fan or just curious about MMA, this dataset has something for everyone. Check it out and dive into the world of the UFC!

UFC dataset

Enjoy exploring!

submitted by /u/ShockOk4912
[link] [comments]

Hotel Data – I’m Build A Hotel Availability App.

Anyone know where hotel apps (hotels.com) would get its data from? Example would be Hotels.com I’m looking to gather availability dates and inventory.

I know that most apps will use API’s. I want to see if there is a single system where I can connect an app to that will pull hotel data from around the world inventory availability dates.

submitted by /u/FreeeRide-
[link] [comments]