Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Synthetic Data For AGI Is Not THAT Hard (math Especially)

The fact is you could easily generate a lot of synthetic data just by asking an already trained bot to rewrite this as a given author that they have a lot of text they trained on. Or just have something like a thesaurus bot (maybe trains with Grammarly) that learns how to swap enough info out without changing the meaning (very strictly cause without this meaning being the same this training is useless although this may limit the scope of the changes allowed but is still generally better than no synthetic data (extremely easy to do with math cause it can just have math rules to define one step changes it generates) ) which is much easier to make than AGI. Thus whatever bot you are using the synthetic data to train on, it has to try to check if these two things the original and the synthetic data match in meaning. Thus it would have to understand the meaning or/and math to follow if the changes that were made match so it could replicate the process on its own.
So this could basically have a bot that can use Symbolab to train AGI in math.
And a bot that uses a more strict Grammarly or some form of thesaurus bot to train the AGI in language comprehension.

submitted by /u/Deamichaelis
[link] [comments]

SOS How To Make Stacked Bar Chart In Excel?

I need to make a stacked bar chart on a recurring basis. Included a few pictures here. The bar chart needs to show 15 grocery stores. Each grocery store has multiple applications. I need to show the number of users for each application by grocery store. Each grocery store application varies in maximum user size (between 100 -50,000).

I have a few problems: My data doesn’t have the exact data I need. The data has emails (with grocery stores embedded). The data also doesn’t have direct numbers, just “FALSE”. How do I turn all of this into a graph automatically and easily change the colors? Any advice is SO appreciated, thank you! I will literally PayPal for help.

submitted by /u/dbdhshhsh
[link] [comments]

Help Pulling Multiple .csv Files With Timestamped Data And Multiple Participants Into One File.

I have 11 .csv files containing data which has information about multiple participants in a study. All of the tables have a ‘timestamp’ column, some have ‘start-time’ and ‘end-time’ columns too. I then have 5 .csv files with data that is *not* timestamped – it contains some background/onboarding information collected at the beginning of the study.

I want to use this data to train a machine learning model.

I need to pull all of this information into one .csv file. I’m not sure how exactly to go about doing this. I’ve thought about matching timestamps for each table, and adding the relevant columns onto the row with the same timestamp, and just having the non-timestamped information in each row for that participant ID.

i.e., it would look something like this:

[ID] [timestamp] [feature1] [added feature 1] [added feature 2]

Then, all of the timestamps associated with each person’s id would have its own row, but some of the features would be empty/null values.

Would it make sense to do this? What are some methods I could use to achieve this?

submitted by /u/an-diabhal
[link] [comments]

Can Fair Use Principles Safeguard The Creation Of A Movie Scene Dataset For Research Purposes?

Hello, I am building a dataset for research purposes, and the content I am using is audiovisual and copyrighted. It consists of video clips from scenes in movies. I have observed that there are datasets with their accompanying papers available, and they don’t seem to have legal issues despite using copyrighted movie scenes.
I wanted to know if fair use covers this type of usage or what recommendations you could give me for publishing a dataset with these characteristics.
Thank you.

That datasets

HOLLYWOOD2: Actions in Context (CVPR 2009) HLVU: A New Challenge to Test Deep Understanding of Movies the Way Humans do MPII-MD: A Dataset for Movie Description MovieNet: A Holistic Dataset for Movie Understanding (ECCV 2020) MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions MovieQA: Story Understanding Benchmark (CVPR 2016) Video Person-Clustering Dataset: Face, Body, Voice: Video Person-Clustering with Multiple Modalities MovieGraphs: Towards Understanding Human-Centric Situations from Videos (CVPR 2018) Condensed Movies: Story Based Retrieval with Contextual Embeddings (ACCV 2020)

https://github.com/xiaobai1217/Awesome-Video-Datasets

Best regards.

submitted by /u/Tlaloc-Es
[link] [comments]

ETS Compliance Data In Useable Format?

I would like to have a panel dataset with for each year the ETS system existed:
– all firms that handed in too little ETS rights for their emissions
– the number of EST rights they were short
– (nice to have: sector and country)

This data is available at https://ec.europa.eu/clima/ets/allocationComplianceMgt.do?languageCode=en, but the format is really poor. To create a panel, I would need to select each country individually, select years one by one, export the compliance data and add all resulting csv files together. Using a webscrapper this should be doable maybe, but I haven’t done that before.

Companies not handing in enough ETS (and hence being fined) are identified by compliance status “B”. The number of rights they are short can be calculated based on the tables too.

My question is if anybody maybe knows if there is a more accessible version of this data available online. Or maybe someone already scrapped the database? Any leads are appreciated.

submitted by /u/AtkinsonStiglitz
[link] [comments]

Stock Market News Dataset – 2008 Or Later

Hello,

I’m working on a machine learning project, and need a large dataset of financial news. Specifically, I’m looking for news on companies that have a medium market cap or lower, and from a period of 2008 until now… or any interval of time over this period.

Is anyone aware of such a dataset? Or any websites where I can query historical financial news – ideally free?

Thank you.

submitted by /u/JustinPooDough
[link] [comments]

Enlighten Me About These Project’s Dataset.

I have a school project which involves creating an Ingredient-Based Recipe Generator Chatbot for Bicol Cuisine Main Dishes. The chatbot should generate recipes based on user commands, but these commands must contain a minimum of three ingredients. I plan to use fine-tuning with OpenAI’s language model. Since this is my first AI project, I’m a bit confused about how to begin creating the dataset. Can someone help me by explaining how I should go about creating the dataset?

submitted by /u/akameaoi
[link] [comments]

Want A Huge Dataset Of All English Songs

i want to train my AI on songs and poems, so i want a huge dataset of all english songs and poems, any suggestions on websites , i can scrape to get a large set of english songs only i heard of azlyrics but it contains other languages romanized versions too that makes it hard to get english songs only

submitted by /u/innocentboy0000
[link] [comments]

Providing Datasets, Leads As Needed. US Healthcare Available.

Hey all! 👋
👩‍⚕️ Healthcare Datasets Expertise:
Been diving into USA healthcare datasets for a year now 🏥✨
🔧 Services:
Web scraping, data management, and cleaning – I’ve got your data needs covered. Let’s tidy up those datasets and make them shine! 🌟
🌐 Tech Stack:
Python, Node.js, Puppeteer, Scrapy, Selenium, BS4 – name it, I’ve conquered it! 🚀
💬 Let’s Connect:
Ready to boost your projects with quality data? DM me, let’s chat and cook up something awesome together! 📬🤝

submitted by /u/purplepyramid7
[link] [comments]

Looking For Datasets: ClickStream, HealthCare, IOT, Agri, Edtech,Sales

I’m looking for raw datasets either session based or user based, (NOT THE AGGREGATED)

Here’s what I’m looking for, I’ll pay for any or all of the following, I’m fine either with one or many of these ….

1) IOT: timeseries dataset from individual IOT device, I’m fine with any data in it.

2) HealthCare: timeseries for individual patient or procedule, if you have anything else please let me know, it should not be aggregated

3)Agri: Individual sensors or any other device data along with location(perferable)

4)ClickStream: timeseries and session based

5) Sales: timeseries, user or session based along with product and sales cost

6) Edtech: let me know whatever you have.

Please DM me if you can help or point me to some source. I’m fine to pay or free or whatever works.

submitted by /u/Winter-Breadfruit943
[link] [comments]

Help With SPSS Survey Data Set For Grad Student

Struggling grad student here. My advisor is off for the break and I could really use some support with my quantitative analysis. I’m using SPSS on a survey data set I collected. I need to run multiple regression analysis but everything is coming back insignificant. This might be the case, but I would really appreciate a second set of eyes. I’m willing to pay for your time, just wanted to get this knocked out while I’m home for the holidays.

submitted by /u/TiredTiddies
[link] [comments]

Looking For A Comprehensive Sector Categorisation (string) For A Boolean Search On Company Name

Hi,

I’ve got a large list (1M+) of company names which have input by users. I’d like to categorise them by sector, but given the regional bias of some sets (e.g. SIC, NAICS) and the cost of others (Bloomberg) there isn’t a single comprehensive source that I can find.

Does anyone know of one? The end output is career guidance for people getting back into main street work (e.g. mums after kids, veterans leaving the forces).

Thanks

submitted by /u/Early_Respond7150
[link] [comments]

Do You Know Any Dataset Of 3d Human Meshes, Where The Train Images Are Synthetic But The Test Images Are Real?

I need a dataset of human 3d meshes. The most important requirement of this dataset is it to have real test data. The actors of the human 3d meshes must have images in the real scenarios.

The train data can be generated and not be given by the dataset. Since if they provide the meshes with the textures, I can use a software to generate synthetically the train data.

But the test data it must be real.

submitted by /u/henistein
[link] [comments]

I Need A Face+Audio+EEG Dataset For Didactic Purpose

Hello everyone,

I’m a CS student and I’m trying to approach to the emotion recognition. I played a little bit with this multimodal network for emotion recognition (https://github.com/katerynaCh/multimodal-emotion-recognition). I find it pretty cool, with the network that works very well with the Face+Audio modality. However, I was trying to implement in this network the emotion recognition with EEG (I don’t really know how to do it, but still…) but I cannot find any dataset that contains Face, Audio and EEG data. Actually, I find the PME4 dataset (https://figshare.com/articles/dataset/PME4_Emotion_Recognition_with_Audio_Video_EEG_and_EMG/18737924, Face+Audio+EEG+EMG) but it has a very different structure than the RAVDESS dataset used for the multimodal network that I used in first place and I have no idea on how to adapt it to the network, so I was trying to find other datasets.

submitted by /u/_link23_
[link] [comments]

🧼 SUDS – A Guide To Structuring Unstructured Data [self-promotion]

I’ve spent a decent amount of time indexing and formatting a lot of machine learning datasets that include images, audio, video, and text and wanted to propose a simple format that might help us standardize a format for the data with a little more structure. Wouldn’t say it is ground breaking, but I feel like could be a good practice.

https://blog.oxen.ai/suds-a-guide-to-structuring-unstructured-data/

Let me know what you think!

submitted by /u/FallMindless3563
[link] [comments]