Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Where/How To Download All Genome Sequence Out Of NCBI?

I am planning to compare genome sequences, but for that I need data. So I came across National Center for Biotechnology Information. Which is an awesome organization.

But we have an issue here, we need to download them one by one. Is there any way we can download the whole thing into my server at once. Like all the available sequences.

I looked into there FTP page as well. But it downloaded data in different formats, like, gbff, faa, gpff, fna. And I’m pretty sure, there is more data than these, as it was just 8ish M.

Ref:

https://www.ncbi.nlm.nih.gov/datasets/taxonomy/37653/ https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/

Any kind help or suggestions are appreciated.

submitted by /u/maifee
[link] [comments]

Where To Find Big Datasets With “updated_at” Date Column?

Hi,

I want to create a sample SCD Type 2 table. To do so, I am looking for some big dataset (>5GB) which updates daily and has “updated_at” date attribute that represents a date when a row has changed.

Example:

Today dataset looks like this

id color updated_at 1 blue 01.01.2024 2 red 01.01.2024

Tomorrow dataset looks like this:

id color updated_at 1 yellow 10.03.2024 2 red 01.01.2024

Do you know where I could find such datasets?

submitted by /u/Betelgeitze
[link] [comments]

A Shared Scorecard To Evaluate Data Annotation Vendors

Evaluating and choosing an annotation partner is not an easy task. There are a lot of options, and it’s not straightforward to know who will be the best fit for a project.
We recently stumbled upon this paper by Andrew Greene titled – “Towards a shared rubric for Dataset Annotation”, that talks about a set of metrics which can be used to quantitatively evaluate data annotation vendors. So we decided to turn it into an online tool.
A big reason for building this tool is to also bring welfare of annotators to the attention of all stakeholders.
Until end users start asking for their data to be labeled in an ethical manner, labelers will always be underpaid and treated unfairly, because the competition boils down solely to price. Not only does this “race to the bottom” lead to lower quality annotations, it also means vendors have to “cut corners” to increase their margins.
Our hope is that by using this tool, ML teams will have a clear picture of what to look for when evaluating data annotation service providers, leading to better quality data as well as better treatment of the unsung heroes of AI – the data labelers.
Access the tool here https://mindkosh.com/annotation-services/annotation-service-provider-evaluation.html

submitted by /u/AdventurousSea4079
[link] [comments]

Looking For A Large Unlabelled Handwritten Text Dataset

I’m looking for a large number of handwritten text (in image format) and they don’t have to be labelled. Simply put, scanned images of handwritten pages, raw, untouched, but lots of them. I’m not even very particular on the language. I mean it would be nice if the images are separated based on their language but even a total mess would be acceptable.

The ones I’ve found so far are all labelled and as the result, there are not that many samples in them. I was hoping if the dataset is not labelled, it would be easier to find ones with a large number of samples.

These are the ones I’ve found:

CENSUS-HWR (1,812,014 samples)

IAM (16,752 samples)

submitted by /u/Ziadloo
[link] [comments]

XGLM-564M – Fine Tuning For Ayacucho Quechua

Hi everyone,

I’m trying to perform fine-tuning on an XGLM-564 model on the Ayacucho Quechua language. Up until now, I’ve found two datasets from Huggingface that could be used to do this.

wikipedia/wikipedia hackathon-pln-es/spanish-to-quechua

I’m facing some problems with the first one and I’m not able to download it because of a missing package called apache_beam. I tried installing it but without any success (I’m using the latest PopOS).

For the second dataset, I’m mainly worried about the quality since I don’t have any knowledge of that language and I’m doing this fine-tuning as part of my uni assignment.

Any help will be greatly appropriated.

Thank you.

submitted by /u/dduka99
[link] [comments]

Dataset Of Books, Novels, And Other Literary Sources That Have Been Adapted Into Movies/tv Shows

I’m conducting exploratory data analysis on streaming platforms like Netflix, Amazon Prime, and others to guide content acquisition strategies for a new streaming service. Specifically, I’m investigating the performance of movies and TV shows that are adapted from literary sources compared to original content. By ‘perform better,’ I mean whether these adaptations, on average, receive higher ratings on the streaming platforms themselves or on external rating sites such as IMDb.

A similar question was asked before but never received a response: https://www.reddit.com/r/datasets/comments/gscwtz/request_is_there_a_comprehensive_database_of/

I would appreciate any assistance on this!

submitted by /u/2bapesrealm
[link] [comments]

Request: USDA 12 Basic Soil Class Dataset For Mapping

Hello,

I am a student researching the precontact cultivation of tobacco by Indian tribes in western North America. I am trying to find a map of the 12 basic soil classes (clay, loam, silt loam, etc) but am having trouble. This would allow me to note where nicotiana species have proliferated despite regions being outside of their “natural” range. I am accounting for other geospacial factors as well, but this would be extremely helpful. Any assistance would be greatly appreciated 🙂

submitted by /u/infernoparadiso
[link] [comments]

[Mock] Ideas For A Dummy Inventory Dataset

I’m about to launch into building a dummy warehouse inventory dataset. I’m trying to come up with a playful type of company and product line upon which to base it. I’m after something whimsical, but meaty enough to build a demo around. I’m thinking at least 400-500 SKUs (products), with a compelling set of product categories (2-3 levels of hierarchy, a few dozen total categories). I’ve thought of things like:

a surf shop chain, with swimming and snorkeling equipment, T-shirts, beach toys and accessories. a “Flintstonesque” shop with all sorts of sticks and rocks something inspired by Wiley Coyote’s “ACME” (bird seed, exploding tennis balls, anvils…) maybe something inspired by Sponge Bob Square Pants (shell emporium….)

Any ideas?

(I realize that this isn’t quite the normal fare here. If it’s not close enough, could you suggest another subreddit?)

submitted by /u/waitak
[link] [comments]

Need Help In A Timeseries Satellite Dataset For A GAN Based Simulation

Hello all,

I am working on an academic project where I am using a GAN to train my synthetic satellite data of a city / vegetation land. I am then changing my labels (air quality, water supply, urbanization parameters etc)to predict what will the new image look like after the feature changes. I am currently working on synthetic satellite data so the results are more or less good. However I want to scale my project to a timeseries data of either a city or a vegetation land so that I can train my model on real time data. Can you point me to the right direction if any such dataset exists ?

submitted by /u/ultrainstinctmasters
[link] [comments]

Help On Finding A Text Summarization Dataset

I’m working on a research idea which can summarize a content for different audiences. For example particular company document summary for marketing, HR or developers which highlight the most relevant content for them. Right now I’m having a difficulty finding a text summarization dataset which has ground truth for different audiences as such. Can anyone point me to the right direction finding this dataset?

submitted by /u/AGENT_SAT
[link] [comments]

Any Interest In CSGO Datasets(specifically From HLTV)?

I spent a lot of time accumulating historical match information for all available teams on HLTV. I’d like to know if this is something of any value for fellow researchers. I’d be happy to host it but I just wanna know if the interest is there. If anyone is interested, I scraped a lot of this data for purposes of generating a discord bot that does match predictions for CSGO matches. If you wanna hear more about the project or dataset just PM me or add ur contact here: https://yhzshsg2ee.us-east-1.awsapprunner.com/

submitted by /u/smackcam20
[link] [comments]

Looking For Open-source/public Client-therapist Transcripts Dataset

I put out an AI therapy chatbot, and I’ve used a few publicly available transcripts I’ve scraped together from here and there, but nowhere near enough for a proper fine-tuning and real analysis of it’s ability to approximate ‘real’ therapists. The one place I found, which actually feels extremely convincing, is fiction.

There is the publication by alexander street, Counseling and psychotherapy transcripts: volumes 1-3, but always blocked by university students/researchers only.

Anyone know of alternatives or a way to access that?

submitted by /u/naftalibp
[link] [comments]

Student, Need Access To Statista Premium/Pro

If anyone can help out, please do. I’m a University student and the only way to access the sources used on Statista is with a Pro account. I need the actual original info in order to properly cite data in my persuasive essay. The price is extremely steep in my currency and I’m on a budget so lol please PM me if you can assist!

I need to access these stats: https://www.statista.com/statistics/1261626/south-africa-gross-tertiary-school-enrollment-ratio/

submitted by /u/digitaldisgust
[link] [comments]

JazzSet: Large Audio Dataset With Instrumentation And Performer Annotation.

Google Drive: https://drive.google.com/drive/folders/1MkAiT8Zgm2bF-BWKYOdhVOJS-eduIofb?usp=sharing

JazzSet Dataset:

A remarkably large dataset of digitized high quality full length jazz session recordings from 1905 to 1966 with instrumentation and performer details annotated.

Statistics: • 40,329 recordings with 399,761 total performance credits.

• 275 credited instrument types or roles for 12,585 individual perfomers.

• 11,421 marked examples of 843 jazz “standards” (Songs with 5 or more examples).

• 2,202.21952 hours (91.75914 days) of audio. 245 GB, mp3.

• Sourced from a well curated session-date specific public domain collection.

• for 35,201 tracks definite (as identified by match to one or more Discogs.com releases by record and catalog number) or probable (by matching names for those individuals who’s names are unambiguous for Discogs artists) Discogs IDs are recorded to aid future metadata cleaning and improvement, and to help ensure specific identification of performers especially if these mappings can be expanded in the future.

All but the audio archive will also be placed on a Neocities page I’ve set up for the project (https://saleach.neocities.org/jazzset/) – all audio in the archive has also been uploaded to the Internet Archive’s “Great 78” project and each card has a direct archive.org file download url so you can explore the set – and download suitable subsets of training material when downloading the entire enormous archive is not practical.

submitted by /u/returnstack
[link] [comments]

Does Anyone Know Where To Find CENSUS-HWR Dataset?

I’m looking for a large (even unlabelled) handwritten text dataset (in image format of course) and apparently, one of the largest ones is CENSUS-HWR. Their paper (which is not that old – May 2023) points to this link https://censustree.org/data.html which is dead. But this link exists: https://censustree.org/data. It’s just that the data you can download from there is in CSV format which has nothing to do with handwritten text.

Does anyone know where to find the CENSUS-HWR dataset?

submitted by /u/Ziadloo
[link] [comments]

Looking For A Dataset Of Cryptocurrency-related Scam Data/tweets

Hi all,

I am conducting research based on scam detection of tweets related to cryptocurrencies. I am in need of a dataset of scammed tweets but unfortunately, everything that I found was just basic cryptocurrency information that isn’t labelled. Since I require a labelled dataset for my model, I am in need of scammy/suspicious tweets such as fake giveaways and other data that is determined to be sketchy.

Any help on this would be much appreciated

submitted by /u/Prestigious_Ruin_822
[link] [comments]

Historical Daily Weather Dataset For All U.S. Cities

I’m trying to get daily weather dataset for all U.S. cities and this proved to be a harder task than I thought. I’m looking for daily aggregated weather metrics, such as temperature minimum, temperature maximum, precipitation, average wind speed, humidity, etc.

This NCEI NOAA API (and its FTP bulk data download option) seemed promising initially, but it’s missing a lot of data for majority of their weather stations: https://www.ncei.noaa.gov/support/access-data-service-api-user-documentation

I also looked into Wunderground API, but from the thread, the price is $10K per year, which I can’t afford: https://www.reddit.com/r/webdev/comments/8tjavu/now_that_the_free_wunderground_api_has_been/

I looked into National Weather Service API, but this one doesn’t go back far enough and provides only granular data points: https://www.weather.gov/documentation/services-web-api

Does anyone know other good source for getting historical weather data?

submitted by /u/Specialist_Dig2115
[link] [comments]