Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

[self-promotion] Hosted Embedding Marketplace – Stop Scraping Every New Data Source, Load It As Embeddings On The Fly For Your Large Language Models

We are building a hosted embedding marketplace for builders to augment their leaner open-source LLMs with relevant context. This lets you avoid all the infra for finding, cleaning, and indexing public and third-party datasets, while maintaining the accuracy that comes with larger LLMs.

Will be opening up early access soon, if you have any questions be sure to reach out and ask!

Learn more here

submitted by /u/achyutjoshi
[link] [comments]

In Search Of E-Mail Datasets Like ENRON

Hi everyone, I am PhD student and currently working on an NLP and Network Analysis project. I am in search of an email dataset with sender, receiver and the message information. Preferably from companies with some other metrics such as performance and so on included (which is not absolute necessity). If anyone know of such a dataset like ENRON or SpamAssasian and direct me to it, I would be most thankful.

submitted by /u/Saklehir
[link] [comments]

Dataset For Airline Passengers 2019-2022

Doing a project where we are finding data about airlines. I need a dataset with complex demography of passengers from the years 2019-2022. This primarily focuses on age, gender, and possibly nationality. It has been a pain in the ass to find anything that specific, and I’m guessing it is hard to find because most datasets have limited information, and others may have restrictions on how data can be used. If you do find anything, please comment.

submitted by /u/ShrimpChipCEO
[link] [comments]

Looking For Suitable Dataset To Predict Forest Firest For My Project

The subject for my project is predicting forest fires and I am looking for a dataset similar to the one shared on Kaggle but I can’t find one. I looked on Earth engine and found some datasets but they don’t provide dates and they are Imagecollections, not csv. I am familiar with machine learning and cleaning datasets in csv format after turning it into dataframes but not at all familiar with Imagecollections. So basically my question comes down to two paths:

I use the datasets from Earth Engine but I don’t know how to work with them. So perhaps someone could give me some tips on how to predict Can someone guide me towards a suitable dataset to predict forest fires?

I appreciate all input!

submitted by /u/Ripplekipple
[link] [comments]

Best Books (10k) Multi-Genre Data [self-promotion]

I started on this idea of finding a comprehensive book dataset which for sure has a description and more than one genre (makes things more realistic), since I wanted to cluster them based on similarity to find some good ones to read for myself 😉 The only ones I could find on Kaggle were ones with a single genre label, so collected it on my own.

So sharing it here in case it helps someone else too:

[Dataset](https://www.kaggle.com/datasets/ishikajohari/best-books-10k-multi-genre-data)

The data was collected from Goodreads from their list – Books That Everyone Should Read At Least Once and contains Description, Ratings and Multiple Genre classifiers.

submitted by /u/ishika_jo
[link] [comments]

Free Arrival/departure Aircraft API?

I’m wondering if there is a free aviation API to track arrivals and departures to a set airport. It would collect: Callsign, Aircraft Type, Gate, and Arrival/Departure airport, then plug that into a Google Sheet.
Currently I run this process manually by looking at FlightAware data, but if I can automate this for free that would be great!

submitted by /u/ModeratorOfNothing
[link] [comments]

Sephora Cosmetic Cost And Reviews Dataset

In March 2023, a Python scraper was used to collect a dataset that comprises of over 8,000 beauty products available on the Sephora online store. The dataset includes comprehensive details about the products, such as their names, brands, prices, ingredients, ratings, and all other relevant features.

Source of the dataset: https://www.kaggle.com/datasets/nadyinky/sephora-products-and-skincare-reviews

To view the dataset: https://app.gigasheet.com/spreadsheet/Sephora-Cosmetic-Reviews/e74caa44_2abf_4f49_bc0d_3477fcb1663e

submitted by /u/sheetheadd
[link] [comments]

Datasets Concerning Nextdoor And Ring

Hi all,

Wondering if anyone has any ideas for accessing or gathering data on Nextdoor and Ring user demographics – specifically in four target cities. This is a part of a larger project that is examining the effects of technology on neighborhoods. Been beating my head against a wall trying to figure this out myself and decided I would ask y’all.

Thanks!

submitted by /u/FrightFeats
[link] [comments]

Getting The Exact Address Of Companies With The Following Data: Company, City, State; What Is My Best Option?

Hi all,

I’m currently trying to merge 3 lists of together over 30.000 rows with companies, to make sure I can delete the duplicates I want to merge the companies sin one list with one similar cell for all, the exact address or coordinates.

I’ve tried using bing maps API, but after checking it doesn’t show up correctly. What I can do is go into google maps an manually put in the company, state & city and then copy the address but doing this for 30.000 rows will take me years.

What would my best option be to do this? I’m advanced with Zapier & Power Automate.

Many thanks!

submitted by /u/Florent-Lesage
[link] [comments]