Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

REQUEST: 275M USA Business Email Dataset

Hey r/datasets,

I represent a small business that is looking to replicate the 275,000,000 record in Apollo.io, ZoomInfo, etc. We are just looking for USA biz emails (not consumer).

This is essentially LinkedIn data + emails.

We can go without phone numbers perhaps.

We have some surprisingly low offers already, but please DM me with any leads on a dataset like this.

Thanks in advance!

(Would also accept offers on 2 column dataset: Name / Email)

submitted by /u/Anon_PR_pro
[link] [comments]

Building A Niche Data Community Of Likeminded People!

Hello everyone,

TL;DR – I’m starting a community for professionals in the data industry or those aiming for big tech data jobs. If you’re interested, please comment below, and I’ll add you to this niche community I’m building.

A bit about me – I’m a Senior Analytics Engineer with extensive experience at major tech companies like Google, Amazon, and Uber. I’ve spent a lot of time mentoring, conducting interviews, and successfully navigating data job interviews.

I want to create a focused community of motivated individuals who are passionate about learning, growing, and advancing their careers in data. Please note that this is not an open-to-all group. I’ve been part of many such “communities” that lost their appeal due to lack of moderation. I’m looking for people who are genuinely interested in learning and growing together, maybe even starting a data-related business.

Imagine a community where we:
* Share insights about big tech companies
* Exchange actual interview questions for various data roles
* Conduct mock interviews to help each other improve
* Access to my personal collection of resources and tools that simplify life
* Share job postings and referral opportunities
* Collaborate on creating micro-SaaS projects

If this sounds exciting to you, let me know in the comments or reach out to me.

PS: Would you prefer this community on Slack or Discord?

Cheers!

submitted by /u/IllustratorOk7613
[link] [comments]

Seeking Feedback: Grocery Pricing Dataset API

Hello, DataMunchers!

I just launched my Grocery Pricing API on RapidAPI, and I’m super stoked to share it with you all! It’s a real-time treasure trove of pricing info for all your grocery needs.

I’m all ears for your thoughts! Any cool features you think would make this API even better? Shoot me your ideas—I’m here to make this tool awesome for us all.

Check it out on RapidAPI and let’s chat about making our data game stronger!

Thanks a ton for your input !

submitted by /u/Affectionate-Olive80
[link] [comments]

[REQUEST] Saudi Market Data, Live Or Historic.

Hi, I searched online alot for historic and live (even if it’s daily updated) Saudi market data but couldn’t seem to find it. I don’t know if such data is open or not, but it feels like market data should be readily available since it’s something public

So if anyone could help me find it or have any open source (or even paid, just not tickerchart -laggy, faulty, unclean, couldn’t easily export data to csv and expensive- ) source?

submitted by /u/Pxy_
[link] [comments]

Searching For A Data Set: School Data Task On, The Dietary Habits And Nutritional Knowledge Of High School Students In Relation To Academic Performance

For school I have a task where using secondary and primary data I have to investigate my topic of “How do the dietary habits and nutritional knowledge of high school students correlate with overall health and academic performance?” The idea is using previous Australian data I can build some kind of questionnaire to find primary data, but finding this data is difficult and I was wondering if anyone could point me in the right direction or help me out with a dataset.

submitted by /u/Jeddyson
[link] [comments]

Independence Of Observations In Datasets

Hi everyone,

I’ve was performing some binary logistic regressions today, but had a bit of a disaster. My analysis involves looking at a country’s international criminal court membership as the dependent variable (coded 0 or 1) and other independent factors such as level of democracy etc.

I thought it was going well. However, when it came to my assumptions testing, I realised something was slightly wrong: my Breusch Pagan test (for residuals) and my GVIE text (for multi-collinearity) had terrible scores.

Then something occurred to me: the dataset I had being using had a row per country per year. I am presuming that this violates the independence of observations as multiple rows have the same country in them?

Does this mean I have to re-do all my analysis which just one row per country instead? This would mean I would have to change my scope to looking at stats for the country upon the year they joined rather looking across all the years.

I would appreciate any help or advice you could give, as I am slightly stressed and confused!

Many thanks,

Tom

submitted by /u/grovseyy
[link] [comments]

Worldwide Violence Perception Dataset For The Period 1970-2021

I’m looking for a dataset that measures perceptions of violence or crime globally for the period 1970-2021. The Global Peace Index (GPI) would be ideal, but it only covers the years 2008-2023.

I’m aware that it’s almost impossible to find such dataset, so I’d take suggestions that measure violence, crime, conflict or any similar proxy for violence perception. However, I can’t deviate much from the period 1970-2021.

submitted by /u/Puzzleheaded_Steak54
[link] [comments]

How To Obtain Data For Journalist Discovery

Hey everyone,

I’m currently working on developing a platform to assist startups in pitching journalists for media coverage, and I could really use some advice on obtaining the necessary journalist data to make it happen.

As part of our efforts to build a comprehensive Journalist Discovery Module, we’re looking to gather essential data to facilitate the identification and connection with relevant journalists. Here’s a list of the data we need:

Email Addresses of Journalists Recent Articles Written by Journalists (with publication details and dates) Social Media Profiles of Journalists (e.g., Twitter, LinkedIn) Topics Covered by Journalists

If you’ve got any ideas how we can access this data, I’d be eternally grateful for your guidance!

submitted by /u/Imaginary-Bench-3175
[link] [comments]

Looking For A Self-hostable Platform For Sharing Datasets

Objective:

I’m looking to create a website intended to gather together and release datasets for a specific theme (impact investing).

These would be a mixture of unamened open access datasets and a few with my edits. CSV and JSON mostly.

It would be cool to also be able to add blog posts with live data object embeds. And maybe (this is a “stretch feature” idea) include a sandbox for querying a read-only database. But the essential elements would be sharing datasets in a way that’s better than Github (no objection to that but I want to give potential visitors a specific site to access).

I tried setting up CKAN today on a VPS and found it a lot of work to get running. I think something a little simpler from an admin perspective would make more sense.

It’s a not-for-profit personal project so I’d like to keep costs reasonable.

Any suggestions for platforms, hosting, or both much appreciated!

submitted by /u/danielrosehill
[link] [comments]

Need Written On People’s Perception Of Artificial Intelligence (AI) And Their Job Prospects

If anyone can connect me with any written prose (up to and including reddit threads) from everyday working-age people on the adoption of artificial intelligence by corporations and organizations and what they feel it portends for their job prospects now and in the future, I’d sure be thankful. I’m doing a primary research study on such, but I’d like to have unprompted thoughts with which to compare my dataset.

My gratitude abounds.

submitted by /u/molineskytown
[link] [comments]

Crime Rates In The US- Latest Data Needed

Hi everyone, I’m looking for a reliable open source where I can find the latest available either crime rates/crime index or the ranks data for all the cities in the USA. Can anybody help me out with this? I have tried looking on FBI’s site but all I could find over there is the data by states or region population size.

submitted by /u/bandhu_
[link] [comments]

Looking For An Old Drugbank.ca Dataset

Dear community,

back in 2019 or 2020, I downloaded the full dataset from Drugbank.ca and have been using it for personal purposes ever since. Unfortunately, I recently lost all my data (both in NAS and backup), and now I’m unable to re-download the dataset as access is restricted now. I’m not affiliated with any academic institution and sadly, I can’t afford the payment.

Does anyone happen to have an old version of their full database?

I would be *extremely* grateful for your help.

submitted by /u/VohaulsWetDream
[link] [comments]

Earth Science Dataset Binary Classification

I’m a statistician looking for a dataset in earth science for a binary classification task, i.e., the response variable should be binary. My goal is to test a newly developed version of the invariant causal prediction algorithm, which tries to find the immediate causal drivers of some response variable. Do you have any suggestions for interesting datasets with roughly 3 to 10 covariates (continuous or categorical) and a binary response? Any help would be much appreciated!

submitted by /u/ParticularJacket6330
[link] [comments]

Effective Method For Finding Common Colleges In Two Excel Sheets Despite Inconsistent Formatting

I have two excel sheets both containing huge set of data of colleges names in different formats and abbreviations. I want to find the list of colleges common in both the sheets, however because of inconsistency in format names of colleges it is proving to be very tedious and difficult to do so. kindly suggest the best effective method to do the work.
Is there any way to do so in excel with the help of some other tool or maybe some in-build tools in excel. I have already used filters like sort, find and replace filters etc.

submitted by /u/Darkness-of-Light
[link] [comments]

Looking For Dataset, Consisting Of Invoices And Receipts With The Corresponding General Ledger/ERP Entries

Dear community, I’m in search of a comprehensive dataset that includes Receipt Data and Invoice Data, with more than 100,000 item-lines in formats such as PDF, JPG, etc. Additionally, I need the corresponding general ledger/ERP entries, including the chosen account according to the chart of accounts, VAT, and so on.
I haven’t been able to find anything on the web. Does anyone know where I can obtain such datasets?

submitted by /u/Altruistic-Box-5744
[link] [comments]

Workout Logs (Strength Training) – Exercise, Weight, Reps

Hey everybody,

I’m currently building something that relates molecular biology, time-series algos and more to optimize muscle and strength building.

For that I need data in the form of workout logs from people. They should look something like this:

Deadlift 180kg 1×3

Squats 100kg 3×12

Lying Hamstring curls 50kg 3×8

Would help me out immensely if you have such a dataset / know someone who does and are willing to share it!

In return, everyone who contributes is invited to use the beta version for free of course!:)

Cheers,

Tim

submitted by /u/Biotential
[link] [comments]

Better Way Of Preparing Datasets For Finetuning With Large Text In Each Example???

Better way to prepare datasets ?

I have my datasets in format :

text : length 19k

extracted entity 1 : list of entity 1 extracted

extracted entity 2 : list of entity 2 extracted

Does anyone have idea on how to finetune opensource model with this kind of data .

Is finetuning better option becuase the model(llm) have to learn to extract items from the text and length of text is so large ?

Example : I have train a llm model to look at whole book text and extract author name, place name, people name Now I have 100 of books data how can I proeare datsets to fine-tune llm to be very good at extracting also consider I have supervised data of book text with extracted author, people name place name from whole text……
How can I finetune a good model let me know

submitted by /u/Guilty-Tea6607
[link] [comments]