Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

In Need Of Datasets Of Indian And Carabao Mango Leaves

Hello everyone,

I am a college student currently working on a thesis about machine learning, specifically focused on identifying Indian and Carabao mango leaves with and without anthracnose disease using a CNN model.

At this stage, I need a large number of datasets, likely 1000 and more images, from the mentioned varieties of mango. I am looking for datasets of leaves affected by anthracnose disease as well as healthy leaves from both Carabao mango and Indian mango varieties.

I am reaching out in the hope that you can help us find these datasets, as they will serve as the primary data for our thesis.

Thank you very much for considering my request.

submitted by /u/chadmomentgiga
[link] [comments]

Looking For A Dataset Of Currently Reported As Phishing/scam Crypto Wallets

Hi guys,

I’m currently working on a project to enhance the detection and prevention of cryptocurrency scams and phishing attempts. A crucial part of this project is identifying and analyzing scam crypto wallets that have been reported by users and security experts.

I am looking for a reliable and up-to-date dataset that contains information about cryptocurrency wallets reported as being involved in phishing or scam activities. Ideally, this dataset should include details such as:

Wallet addresses Type of scam or phishing attempt

If anyone knows where I can find such a dataset or has resources that could help, I would greatly appreciate your assistance. Open-source datasets or any repositories maintained by security communities or organizations would be extremely helpful.

Thank you in advance for your help!

submitted by /u/Funny-Accident-5612
[link] [comments]

Datasets Request About Carabao And Indian Mango Leaves

Hello everyone,

I am currently working on a machine learning, specifically focused on identifying Philippine Indian and Carabao mango leaves with and without anthracnose disease using a CNN model.

At this stage, I need a large number of datasets, likely 1000 and more images, from the mentioned varieties of mango. I am looking for datasets of leaves affected by anthracnose disease as well as healthy leaves from both Carabao mango and Indian mango varieties.

Thank you very much for considering my request.

submitted by /u/chadmomentgiga
[link] [comments]

Looking For A Dataset On Suicides In The US

Hi everyone,

Maybe someone knows some open access datasets on suicides committed in the U.S. (or number of death if there is variable for the cause of death) per year (from about 2015 to at least 2020) and per state. The more addition variables there are (such as gender, age, employment status, etc.), the better.

Hope that maybe some of you have seen something of this sort🙏

submitted by /u/dollala
[link] [comments]

UK Private Companies Datasets For 25m+ Filings

We are a UK FinTech company and have launched a new product that automatically extracts data (including handwritten) from 25 million filings for millions of UK companies. In addition, there are insights and easy-to-consume charts and tables. The automatically extracted data includes/ provides the following data for 2m+ private companies:

An industry-first price-per-share and last-round-valuation (market capitalisation) chart Capital structure, shareholding, and the change in shareholding Equity fundraising trends in the UK Top fundraisers and investors in the UK

I would like to hear your feedback on our UK company insights data 🙂

submitted by /u/olive_er
[link] [comments]

[Paid] Anonymized Dataset For Market Analysis

I’m selling a high quality dataset that includes(Email address, Full Name, Phone number, Age, Location(country), Gaming Platforms Owned (e.g., PC, PlayStation, Xbox, Android, etc.), etc.)

Price: $1.20 per individual ($120 total)

Format: CSV, Excel and PDF

Delivery: Secure download link or Direct file

DM If you are interested

submitted by /u/Money_Ad3408
[link] [comments]

Lyric Dataset With Song Structure For Commercial Use

Hey, I’m trying to find a dataset that contains lyrics and the song structure, exactly like https://genius.com

For example:

[Intro]
Psst, I see dead people
(Mustard on the beat, ho)

[Verse 1]
Ayy, Mustard on the beat, ho
Deebo any rap nigga, he a free thro

Genius doesn’t allow scraping or the usage of his data for commercial use

Except as expressly authorized by Genius in writing, you agree not to modify, copy, frame, scrape, rent, lease, loan, sell, distribute or create derivative works based on the Service or the Genius Content, in whole or in part, except that the foregoing does not apply to your own User Content (as defined above) that you legally upload to the Service. In connection with your use of the Service you shall not engage in or use any data mining, robots, scraping or similar data gathering or extraction methods. Any use of the Service or the Genius Content other than as specifically authorized herein is strictly prohibited. As between you and Genius, the technology and software underlying the Service or distributed in connection therewith is the exclusive property of Genius, our affiliates and our partners (the “Software”). You agree not to copy, modify, create a derivative work of, reverse engineer, reverse assemble or otherwise attempt to discover any source code, sell, assign, sublicense, or otherwise transfer any right in the Software. Any rights not expressly granted herein are reserved by Genius.

Do you know any other source of data that contains the lyrics and the song structure (chorus, verse, etc)? I want to fine-tune whisper to transcribe lyrics with these tags for a commercial product (music generation model).

I think that suno.com has used genius.com for their music model because they use the same tag for song structure xD.

submitted by /u/Which-Breadfruit-926
[link] [comments]

Building A Collection Of The Best Datasets And Resources

Hey scientists!

I’m working on cooldata, I’d like to build a more useful way to access open data online.

What are the best resources you use everyday (data.gov, etc…)? And more importantly why do use them and how?

I’m starting this by myself as a 20% personal project, the goal is to be fully open and maybe also open source as the thing moves on. (If anyone wants to apply to contribute I’m happy to listen! just send a dm)

Have a nice day!

submitted by /u/antonscap
[link] [comments]

Tableau Help Or Better Yet, Can You Analyze My Data?

It has been a while (10yrs) and I can’t figure out how to do a join of several tables using date/time in Tableau Public. Backstory; I have a annoying health condition (SIBO) that is starving my body of nutrients and I am trying figure things out by tracking methane, hydrogen, food intake, meds, symptoms, etc.

https://public.tableau.com/app/profile/mfinaly/viz/SmallIntestinesBacteriaOvergrowth/TrackingmySIBO

submitted by /u/Immediate_Ad3066
[link] [comments]

How To Scrape Subtitles?

There is very little Irish language text, audio and english translation. One of the best sources is this soap opera

https://www.tg4.ie/en/player/play/?pid=6352950048112&title=Ros%20na%20R%C3%BAn&series=Ros%20na%20R%C3%BAn&pcode=669535&genre=Drama

It is fairly easy to find the url of the subtitles when on that webpage manually

getting the vtt file

But the vtt URL uses UUIDs that seem pretty random

https://redirector.playback.eu-west-1.prod.deploys.brightcove.com/v1/1555966122001/7b5d6364-47e2-4016-ae63-93301a7f4e38/ff7182e5-8f90-4af9-8d35-41a3bae7fa1e/441366d1-6c40-4106-9c0f-ecfdc21476b0.vtt

https://redirector.playback.eu-west-1.prod.deploys.brightcove.com/v1/1555966122001/83680fe1-8055-4494-96ff-bc2786f937cc/652c30ad-ff11-45d4-9e0c-46db42f5a34c/0ab149e4-25b0-4c73-8c9a-8130d647de91.vtt

There are subtitle archive sites but this soap opera is not there. So how would you extract a few hundred sets of VTT files (I want to build NLP datasets , ngrams etc, not make money or anything).

I can imagine answers of

With this site you can hire someone and if you show them the steps they can extract them for you cheap

With this mouse emulator you can do it by XYZ

There is away around the UUIDs being random by XYZ

But I do not know how any of these would actually work.

submitted by /u/cavedave
[link] [comments]

Looking For Bacterial Growth Per Time Dataset

hello everyone, thank you for reading this post. Like the title says I’m looking for a dataset experimental one about bacterial growth per time (if you have the protocole it would be better but a real one would be awesome and the source). I try to simulate a bacterial growth model and trying to compare to a real one Ty for your attention. All the best for everyone <3

submitted by /u/Fickle_Buy7668
[link] [comments]

Dataset Browsing Behavior / Search History

Hi everyone,

I am looking to analyze browsing data holistically, so I would like to understand what pages users visit. Best would be search history data from browsers. It would be great if it was recent too (2021-2024). Does anyone know of anything like that? I am a PhD student so I only have limited budget.

Thank you in advance!

submitted by /u/KeyScale1232
[link] [comments]

I Need Ideas For My Data Science Project

(what’s this link thing?) Hello folks, I need ideas of datasets that I can use for a data analisys for my college. I thought about the relation between more developed countries x unemployment or a dataset that I found that contained a study about what may be the most commom way to study a subject and if it’s effective or not, however I couldn’t find the source of the data so if you guys could help me find these or maybe give me some better ideas I would thank a lot

submitted by /u/vitstola
[link] [comments]

Open Sourcing Touristic POI Database – Questions Around Format, Interest

We’re planning to open source our touristic POI Database (currently 1.4 Million points worldwide). There is some effort involved in generalizing it from our internal format so I wanted to confirm that a) there is interest in it as well get some feedback on the format. I’ve also outlined the process of creating/ updating the dataset, as it gives some insight what to expect from the dataset and if it interests anyone, probably the people in this sub.

POI data points

Location (mandatory) Category (mandatory, more on that later) Name Images ( designated thumbnail with blur hash, all with (permissive licensing information) Localizations (consisting of a name, teaser and description in one of the supported languages, availability depends) Rating (mandatory, more on that later) Source (mandatory, such as Wikidata, OSM, tourism council etc.) Type (most POIs are individual sights but „special“ POIs such as places ie cities/towns exist ) Parent (if it exists, a „special“ poi such as a city or town ) Links/References (links to Wikidata entity, Wikipedia/Wikivoyage articles in different languages but also links to social media (fb, ig, twitter etc.), booking sites (agoda, booking, hotels.com etc. ) or relevant 3rd party sites such as Trip Advisor, Atlas Obscura etc.. Misc. Properties: Webaddress Telephone Zip Code Opening Hours Heritage Designation (UNESCO, UK Grade I building ) etc. More depending on the source

We derive our content from many different sources, some of them we simple map to the above format (especially those derived from regional or country level Tourism councils ). The bulk is however combined from Wikidata, Wikipedia, Wikivoyage and OpenStreetMap in the following manner.

Process

Process the complete Wikidata Dump, filtering out all entities that possess a geocoordinate and an instance of-claim. The instance of claim is then checked against a list of touristically relevant classes. Note: This claim can be very specific such as olive sand beach or agricultural theme park so that we expand our list of touristically relevant classes (ie beach and amusement park) to include the descendant subclasses. We get a lot of structured information from this source (especially links to other sites) but little in description, images etc. Process all linked articles in the different language versions of wikipedia/wikivoyage (at the moment we look at the English, German, French, Spanish, Italian, Portuguese and Polish sites). Extract teaser and shorter excerpts for descriptions (Localizations) as well as images with their respective licenses. Clean-Up low quality & unspecific images Assign Parents depending on the “located in adminstrative Region” – claim to “special” POIs (cities, towns), the assigned pois then form an area that are used to assign further Pois in that area to the same parent.

Two things would require some work: category and rating. We map information from sources to an internal category representation. It is binary, fast to filter with bit masks but not very flexible and probably not that easy to use. For the open source version I was thinking of creating a taxonomy somewhat similar to the one Foursquare uses but other suggestions are appreciated.

The rating combines a somewhat objective data quality rating (amount of images, links to wikipedia articles, length of descriptions etc., types of properties present) with a biased weighting of categories (among other information) that fits our use case. We also use user reviews/rating but that wouldn’t be part of the dataset. We could use a slightly more generalized aggregate rating and/ or different rating components but more likely than not you would want to use your own weighting if your use case is sufficiently different, so I guess I am wondering what expectations or requests there are here.

Export Formats

TSV and GeoJSON Feature Collections but open to suggestions.

submitted by /u/berlumptsss
[link] [comments]