Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Anyone Who Needs Tax Invoice/bill Data

Hi Everyone I have the 40k tax invoices/bills data which is generated by me which looks like real invoices/bill only. Can anyone help me to connect with someone who needs data ? There is no legal issue as the invoices belongs to me only. You can DM me for rates and further details. Thanks

submitted by /u/devanshu_12345
[link] [comments]

Looking For A Dataset Designed For Training Automated Image Moderation/censorship On Social Media Platforms

I’m fairly new to reddit so please forgive me if there’s a subreddit this thread would be more suited to!

Context: I’m currently working on my research proposal paper for a PhD in Fine Arts. I’m primarily a painter, so this is a practice-led research project on the subject of post-photography/image theory, post-digital visual culture and traumatic representation. I am by no means a data scientist and have a very base level understanding of ML and image recognition, but as I’m exploring traumatic representation in images on the internet/in relation to screen culture, my work does somewhat intersect with the field of computer vision – which is, of course, what brings me to Reddit.

I’m interested in how image recognition is used for the automated moderation/censorship/removal of “sensitive” content on social media platforms. I’m trying to locate any known dataset that’s been used to train this kind of image recognition model – I know there are plenty of datasets specifically for training ML to identify porn, but as my research revolves around trauma I’d ideally like to find one that includes a broader range of NSFW categories (violence, gore, etc.). I’m not too hopeful that any image based dataset of this kind would be publicly accessible (I suppose you’d hope it wasn’t), but alas, just putting this out here if anyone has any leads.

Even if you can’t answer my question, any thoughts/feedback/comments on this are more than welcome. I don’t particularly speak the language of computer science, but always open to having conversations about the project 🙂

submitted by /u/sentient-glue
[link] [comments]

About GDELT: Event Classification Into CAMEO Code

Hello,

We are using GDELT events for our project but have realised that many events need reclassification to the correct event code after taking a closer look at the data.

We are considering clustering techniques or using proprietary/OS LLMs for this task. But we want to make sure that we are not duplicating the same strategy by gdelt itself.

To evaluate this, I have been trying to read about Gdelt’s actual classification strategy. What does it do to classify one event to a CAMEO code? How is it happening automatically? Without much luck as I cannot find any documentation on this.

Any help is much appreciated!

submitted by /u/voidwithAface
[link] [comments]

Is There Any Good Search Suggestion Dataset For Dictionary

Recently I’m building a dictionary & flashcard app, i’m using cambridge-dictionary-api to get dictionary data, but I also want to have a search suggestion for my search bar, I have tried to use puppeteer to get search suggestion data from cambridge dictionary website but it was sooo slow, so I want to use Trie data structure to get the search suggestion data, but I can’t find the dataset for all the english word.

Any one knows any dataset about that?

submitted by /u/eliaschen_cat
[link] [comments]

Public Datasets With Market Names And Their Sizes?

Hello, everyone!

Are there any free publicly available datasets with data like market name, market size in 2023, projected market size, etc. (e.g. global bakery products market size, global smartphone market size, …, basically the most popular and established market sizes)? And are there any paid versions?

During my googling, I only found websites with separate market sizes, written in form of a report. I would really like to have a proper dataset, with the biggest markets and their sizes written in a nice way.

I don’t mind getting a bit inaccurate sizes. But at least orders of magnitude should be correct.

I tried to generate one using different LLMs, but all of them just hallucinated the numbers. If there isn’t a dataset, I will probably have to just web scrape all the markets one by one.

submitted by /u/PlagueCookie
[link] [comments]

Help Needed With Extracting A Large Dataset From Multiple Compressed Parts

Hi everyone,

I’m working with a dataset that’s approximately 200GB in size, and it is split into 200 compressed parts on Google Drive, named like this:

My Google Drive has a total capacity of 500GB, with 250GB of free space available.

I understand that on a Linux system, I can combine and uncompress all parts using the following commands:

However, when I try to perform this operation on Google Colab, I encounter the following error:

Has anyone faced a similar issue or does anyone have suggestions on how to handle this? Any help would be greatly appreciated!

Thanks in advance!

submitted by /u/lasindudemel
[link] [comments]

School Directory Data – What I Can/cant Do?

Several years ago now my college accidentally sent the entire faculty and student directory master excel sheet through email. Now I cant remember who they sent it to, if they rescinded it moments later but I was staring at my email when it was sent. I opened it and downloaded it, it contains over 5000 email addresses, majors, home phones numbers and cell phone numbers. Now I am curious as to what I could do with this data, I understand its usually very hard to come across something like this unless sold you. Are there legal aspects? Could these be email marketing leads? Obviously scammers, etc would love this but id like to just be ethical about it.

Thanks…

submitted by /u/Taziot7
[link] [comments]

U.S. Consumer Expenditures Data By County

I’m looking for public datasets on consumer/household expenditures in the US by county and household size. I know the BLS’s Consumer Expenditure Survey provides this data, but it’s not available on a county level. Does anyone know where this information is available? I’d like to see mean values for rent/mortgage, food (both store-bought groceries and delivery/restaurant), and other household expenses for Manhattan (NY County) specifically. Thank you!

submitted by /u/nd9760
[link] [comments]

Need To Migrate A SAS Database To A New Software

Hey, I just joined a new job as Data Manger with little to no experience in the field and they told me that they want to move away from SAS for the data base.

As I said, I have almost no experience in this filed and they are looking for my input on where we can migrate to. It is a fairly big data base with (I think) about 1 TB of storage of medical information on different studies and patients (we are studying sleep apnea and other sleep illnesses)

Does anyone have suggestions or ideas on what I could propose to the team to switch?

I don’t know the exact structure, but we seem to be using SAS for generating queries and saving the data base and we use MySQL to look at the different tables and gather the necessary info.

submitted by /u/Yottarro
[link] [comments]

Reliable Data Set For The Reddit Dataset

now I am trying to do a project which is associated with the representation learning for large scale dynamic network, and I want to look for a reliable reddit data set( the data should include post_id, user_id, time, comment). So that I can build the graph by using the user as node and if two user comment the same post i can build one edge.

The macro task of the current article is to create a representation learning. For the purpose of the reddit dataset (build a good representation learning to complete a community search based on a graph of social network data. I want to use reddit data to complete my project, and I have some requirements for the data I need. I want the reddit dataset to contain users as nodes, and then I want to use different users to comment on the same post to build edges. I tried a few datasets, but I feel that none of them meet my needs. I would like to ask if you have a link to a reddit dataset that meets my needs. The following are what I have tried:

https://github.com/dingidng/reddit-dataset (I only can create several edge based on these data which is not making sense) https://snap.stanford.edu/graphsage/#datasets (the node is not user)

And I also have problem about how to using the Pushshift to access any Reddit data. Since whenever I submitted the request of the access to the data, my request will be rejected by the bot automatically. If anyone knows how to use the pushshift to access the data set and get the access permission for that.
https://pushshift.io/signup

This is my first time posting for help, thank you for any help you can provide!

submitted by /u/Terrible_Band6290
[link] [comments]

Searching For Social Media Screenshot Dataset

I have been searching for a dataset that contains screenshots of social media posts from various platforms (Twitter, Instagram, Truth Social, Facebook, etc.). I have been able to find datasets that contain URLs of social media posts, but none of sufficient size that include screenshots. I would like at least 1,000 images per platform. Please let me know if there are any datasets that you know of or if you have any advice.

submitted by /u/ImpossibleBear6458
[link] [comments]

Chatbot Datasets That Is Used For RNN And NLP

Hello everyone,

I recently started to learn about AI and RNN. I started to learn how do models work. But recently I wanted to do something else I though i can make my first NLP model from scratch but the main problem is that there is little to no information on how to make a rich dataset to train the model.

I’ve looked everywhere but whenever I put the model to test the results are very bad.

Can someone help me or refer me to dataset examples that it is used for training a chatbot model? Thanks

submitted by /u/InfiniteAd328
[link] [comments]

How Reliable Is Data On Wikipedia (war Casualties)?

Interested in working with data on war casualties. Wikipedia has an interesting page (List of battles by casualties), but the data seems implausible/lacking evidence/sources.

E.g., the Battle of Stalingrad is listed with 1,250,000 to 4,172,000 casualties while the Battle of Berlin is listed with 1,286,367 casualties.

These numbers fall out of numbers I read elsewhere. Is there a more reliable list/dataset to be found online?

submitted by /u/Vylerios
[link] [comments]