Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Looking For A Beauty Rating Dataset

I’m working on a project which requires an AI model to rate the beauty of human images ,I’m having trouble finding datasets to use, all the ones I’ve found were limited. If its possible to gain access to datasets that other beauty rating AI were trained with, it would be really appreciated.

submitted by /u/Ujay_mk
[link] [comments]

Looking For Emergency Calls/Transcripts Dataset

Hello everyone. I am building a classification AI that takes as input a voice call and needs to classify it as an emergency or a false-alarm. I found this 911 Kaggle dataset as a starting point to use for my training. But it’s pretty limited in terms of size and is not very high quality. Since I am going with a multi-modal approach (there are 2 submodels, one for the voice and one for the transcript), can you suggest me any decent high quality datasets of either audio calls or transcripts relevant to my query? Thank you all in advance!

submitted by /u/ZK2K2
[link] [comments]

Twitter Count Of Posts Containing Specific Keywords

I’m very confused by what API access is now needed to do this since it seems like this has changed. I’ve searched this sub and googled a ton and haven’t been able to come up with a good answer. If the $100 basic tier would allow me to scrape the data I need for a month to do this analysis I’m okay with that, but I can’t even tell if that access would allow me to comb through the tweets in the way I’m looking to. I’m basically just looking to do something as simple as this (obviously not in SQL language but easiest to explain this way):

SELECT Day, count(distinct tweets) from twitter WHERE tweet like ‘%keywords%’ and date_range between x AND y

Thanks for any help!

submitted by /u/BachShitCrazy
[link] [comments]

Co2 Emission Dataset – Ineedtowrite36characters

Good evening/morning/night everyone;

My professor suggested to use the International Energy Agency dataset (as if there was just one) to obtain past data on Co2 emissions per country. The international energy agency appears to require 900 euros for a twelve month access as the smallest possible transaction.

Two questions:

1 – do you know any free dataset that covers single countries’ past Co2 emissions?

2- do you know any way to get the International Energy Agency dataset for free? any site? What prompts such question, of perhaps dubious legality, is that the very director of the agency has started the process of making its database free, as it is basically sustained by public money anyway. t is for a master’s thesis; there is no profit involved.

submitted by /u/Adorable-Snow9464
[link] [comments]

What Is The Right Methodology For The Following Situation?

We have a setup for surface particle quantification, where we classify particles in few different classes wrf their size. However, we are able to measure only roughly 80% of the whole surface. Question would be: how to extrapolate the amount to 100% surface, and is probability-plot the right direction? Or do you have any other proposal?

submitted by /u/R3DBAT
[link] [comments]

Anyone Who Needs Tax Invoice/bill Data

Hi Everyone I have the 40k tax invoices/bills data which is generated by me which looks like real invoices/bill only. Can anyone help me to connect with someone who needs data ? There is no legal issue as the invoices belongs to me only. You can DM me for rates and further details. Thanks

submitted by /u/devanshu_12345
[link] [comments]

Looking For A Dataset Designed For Training Automated Image Moderation/censorship On Social Media Platforms

I’m fairly new to reddit so please forgive me if there’s a subreddit this thread would be more suited to!

Context: I’m currently working on my research proposal paper for a PhD in Fine Arts. I’m primarily a painter, so this is a practice-led research project on the subject of post-photography/image theory, post-digital visual culture and traumatic representation. I am by no means a data scientist and have a very base level understanding of ML and image recognition, but as I’m exploring traumatic representation in images on the internet/in relation to screen culture, my work does somewhat intersect with the field of computer vision – which is, of course, what brings me to Reddit.

I’m interested in how image recognition is used for the automated moderation/censorship/removal of “sensitive” content on social media platforms. I’m trying to locate any known dataset that’s been used to train this kind of image recognition model – I know there are plenty of datasets specifically for training ML to identify porn, but as my research revolves around trauma I’d ideally like to find one that includes a broader range of NSFW categories (violence, gore, etc.). I’m not too hopeful that any image based dataset of this kind would be publicly accessible (I suppose you’d hope it wasn’t), but alas, just putting this out here if anyone has any leads.

Even if you can’t answer my question, any thoughts/feedback/comments on this are more than welcome. I don’t particularly speak the language of computer science, but always open to having conversations about the project 🙂

submitted by /u/sentient-glue
[link] [comments]

About GDELT: Event Classification Into CAMEO Code

Hello,

We are using GDELT events for our project but have realised that many events need reclassification to the correct event code after taking a closer look at the data.

We are considering clustering techniques or using proprietary/OS LLMs for this task. But we want to make sure that we are not duplicating the same strategy by gdelt itself.

To evaluate this, I have been trying to read about Gdelt’s actual classification strategy. What does it do to classify one event to a CAMEO code? How is it happening automatically? Without much luck as I cannot find any documentation on this.

Any help is much appreciated!

submitted by /u/voidwithAface
[link] [comments]

Is There Any Good Search Suggestion Dataset For Dictionary

Recently I’m building a dictionary & flashcard app, i’m using cambridge-dictionary-api to get dictionary data, but I also want to have a search suggestion for my search bar, I have tried to use puppeteer to get search suggestion data from cambridge dictionary website but it was sooo slow, so I want to use Trie data structure to get the search suggestion data, but I can’t find the dataset for all the english word.

Any one knows any dataset about that?

submitted by /u/eliaschen_cat
[link] [comments]

Public Datasets With Market Names And Their Sizes?

Hello, everyone!

Are there any free publicly available datasets with data like market name, market size in 2023, projected market size, etc. (e.g. global bakery products market size, global smartphone market size, …, basically the most popular and established market sizes)? And are there any paid versions?

During my googling, I only found websites with separate market sizes, written in form of a report. I would really like to have a proper dataset, with the biggest markets and their sizes written in a nice way.

I don’t mind getting a bit inaccurate sizes. But at least orders of magnitude should be correct.

I tried to generate one using different LLMs, but all of them just hallucinated the numbers. If there isn’t a dataset, I will probably have to just web scrape all the markets one by one.

submitted by /u/PlagueCookie
[link] [comments]

Help Needed With Extracting A Large Dataset From Multiple Compressed Parts

Hi everyone,

I’m working with a dataset that’s approximately 200GB in size, and it is split into 200 compressed parts on Google Drive, named like this:

My Google Drive has a total capacity of 500GB, with 250GB of free space available.

I understand that on a Linux system, I can combine and uncompress all parts using the following commands:

However, when I try to perform this operation on Google Colab, I encounter the following error:

Has anyone faced a similar issue or does anyone have suggestions on how to handle this? Any help would be greatly appreciated!

Thanks in advance!

submitted by /u/lasindudemel
[link] [comments]