Need Advice: How To Collect 2k Company Contacts (specific Roles) Without Doing Everything Manually?

Hi everyone, I’m facing a problem and could really use some advice from people who’ve done this before or been in similar situation.

I need to collect contact details for around 2,000 companies, but the tricky part is that I don’t need generic inboxes like info@ or support@. I specifically need contacts of responsible people (for example: Head of HR, HR Manager, CEO, Founder, or similar decision-makers). Doing this manually company by company feels almost impossible at this scale. I’m facing this challange for the first time and don’t know where to start.

I’m open to: paid tools APIs semi-automated workflows services you’ve personally used or even outsourcing ideas (if that’s realistic).

My main questions: Is this realistically automatable? Are there tools/services that actually work for role-based contacts? What should I absolutely avoid (wasting money, getting banned, bad data, etc.)? I’d really appreciate any real-world experience, tool recommendations, or warnings. Thanks in advance 🙏

submitted by /u/grafieldas
[link] [comments]

0

How Do You Usually Clean Messy CSV Or Excel Files?

Iam trying to understand how people deal with messy CSV or Excel files before analysis.

submitted by /u/__Badass_
[link] [comments]

0

Any (free) Api Out There To Classify Domain Names?

Basically I am looking for api (free if possible) to classify if a given domain name is listed for sale or developed. Google doesn’t return anything. Did come across whoixml apis but they only offer history api (which is pretty expensive) which I tried but pretty seemed outdated. Need to process at least 1M domains monthly (happy to pay per request). Would appreciate some directions.

submitted by /u/ghad0265
[link] [comments]

0

I Need Free Audio Datasets For My Project

Need free, high-quality audio datasets for tasks like speech recognition, sound classification, or environmental noise analysis—ideally with labels, metadata, and permissive licenses (CC0 or similar). Does anyone have recommendations for sources beyond Hugging Face (Common Voice, AudioSet) or Kaggle? Bonus if they’re preprocessed or good for big data tools like Spark/Hadoop. Links, sizes, and usage tips appreciated.

submitted by /u/yobigp
[link] [comments]

0

Looking For Geotagged Urban Audio Data.

I’m training a SLAM model to map road noise to GIS maps. Looking for as much geolabeled audio data as possible.

submitted by /u/EverythingGoodWas
[link] [comments]

0

Neighborhood Data On Race/ Ethnicity/nationality Density By Area. How To Get That Data?

I need to get data on population density by neighborhood for a local business for a niche nationality/ ethnicity. How do I get that data?

What is my avenue? Is data available? Is it available thru open records?

submitted by /u/Leather-Wheel1115
[link] [comments]

0

Creating Datasets For Physical Activities, What Sensors?

Those of you collecting data for sports, hobbies, workouts, physical activities what sensors are you using?

I’m currently using the witmotion WT901 sensor, but I’d love to know what others are using?

Extra information: I’m finishing out an iOS app for collecting phone data specifically for ai data training with support for time syncing with external sensors. I’ll need this data for my own personal project. I’m trying to figure out if I’m better off using a different sensor? The only concern is that some sensors have so little information on them that connecting to them through the app and reading the data and syncing it with my phone data is an absolute pain. Witmotion sensor took me forever to get working with the phones sensor data.

submitted by /u/programmerguineapigs
[link] [comments]

0

I’m Looking For A Very Large Spatial Dataset

I thought this would be easy to find, but it’s been difficult so far. All I’m looking for is:

At least 10,000 observations
Open-source (or at least free to access)
Each observation has two spatial coordinates (x and y or longitude/latitude)
Each observation has at least two numeric variables (one that can be used as an explanatory variable, and one as a response variable.
NOT temporal/time-based

Anyone know where else I can look? I haven’t been able to find anything on the UCI ML repository. I’m sifting through Kaggle now but there are so many options.

submitted by /u/Cold-Priority-2729
[link] [comments]

0

Looking For Blood Test Dataset Of Multiple Diseases

I’m new and testing things on llm training . Should I look for individual diseases or is there a way to find this particular dataset . Someone mentioned using synthetic dataset but I’m not sure about it.

Will the llm learn properly if for example one dataset has cholesterol values and one dataset has liver based values or something

submitted by /u/SAY_GEX_895
[link] [comments]

0

2 Million Messy → Clean Addresses. What Would You Build With This?

Hello fellow developers,

I have a dataset containing 2 million complete Brazilian addresses, manually typed by real users. These addresses include abbreviations, typos, inconsistent formatting, and other common real-world issues.

For each raw address, I also have its fully corrected, standardized, and structured version.

Does anyone have ideas on what kind of solutions or products could be built with this data to solve real-world problems?

Thanks in advance for any insights!

submitted by /u/Hour-Dirt-8505
[link] [comments]

0

Extract Data From PDF Figures And Graphs

submitted by /u/cavedave
[link] [comments]

0

Datasets Where The Schema Actually Breaks Over Time?

I’m trying to get better at handling real-world data drift, not just loading clean CSVs once.

Are there public datasets where:

Fields get added/removed over time
Data types quietly change
Nulls suddenly spike for no obvious reason

Basically datasets that force you to add validation and monitoring instead of assuming everything stays the same.

I’m less interested in size and more in realism.
APIs, government feeds, or long-running open datasets all welcome.

Would love examples + what broke for you when you used them.

submitted by /u/crowpng
[link] [comments]

0

6500 Hours Of Multi-person Action Video. Rights Cleared, 1080 30fps

Dataset Overview

∙ Size: 6,500 hours / average clip length 25 minutes/ 13 TB

∙ Resolution: 1080p

∙ Frame rate: 30fps

∙ Format: MP4 (H.264)

I have a dataset I’ve gathered at my rage room business. We have 4 rooms with consistent camera and lighting. Camera angle is from the top corner of the room, standard cctv angle. Groups of 1-6 people. Full PPE for all subjects, mostly anonymous, some subjects will take off the helmet at the end. All subjects have signed talent release.

Activities: Physical actions including destruction, tool use, object interaction, coordination tasks

Objects: Various materials (glass, electronics, tools)

Scenarios: Both coordinated and chaotic multi-person behavior

Samples available

Looking to license

Open to feedback, currently collecting more video everyday and willing to create custom datasets.

submitted by /u/DrHARDCOREy
[link] [comments]

0

Dataset For School Incident Classification

Hi everyone! I’m currently working on a school-related machine learning project where I’m trying to classify short incident reports written in free text. The goal is to help guidance counselors sort through reports more easily by grouping them based on the type of incident and how serious it might be.

I’m using a pretty simple approach (Naive Bayes) and focusing on things like bullying, harassment, misconduct, vandalism, and facility concerns, with labels like minor or major. The model is just meant to assist with organization and prioritization (all final decisions are still made by people).

Right now, I’m looking for a public, anonymized, or synthetic dataset with short complaint- or incident-style text that I can train the model on. It doesn’t have to be school-specific; anything similar (complaints, reports, misconduct descriptions, etc.) would be super helpful as long as it’s ethical to use.

Since this is an academic project, I can’t use real or identifiable student data, and everything will only be used for research.

If you know of any datasets, past projects, or even tools for generating realistic synthetic text, I’d really appreciate the help. Thanks in advance!

submitted by /u/Soggy_Macaron_5276
[link] [comments]

0

I Made A Free Tool To Extract Tables From Any Webpage (Wikipedia, Gov Sites, Etc.)

Made a quick tool and thought some might find it useful!

🔗 lection.app/tools/table-extractor

It does one thing: paste a URL, it finds all HTML tables on the page, and you can download them as CSV or JSON. No signup, no API key, just works.

Works great for:

Wikipedia data tables

Government/public data portals

Sports stats sites

Any page with HTML tables

Limitations: Won’t work on JavaScript-rendered tables (like React dashboards) since it fetches raw HTML. But for most static pages it works pretty well.

Let me know if you run into any issues or have suggestions!

submitted by /u/Unmoovable
[link] [comments]

0

I’m Looking For Help Creating A Dataset

Hi everyone! I would like to start a new research project and I would appreciate a lot if anyone wants to join! The project consists in taking high quality scans of leaves. I know it sounds basic but it can have a great impact in the field of natural sciences. It is very hard to find high quality pictures of leaves online. Taking high quality scans can undercover the vein structure clearly, opening a whole set of possibilities in research. If anyone is interested in collaborating, you can send me a DM 🙂

submitted by /u/MammothComposer7176
[link] [comments]

0

I’m Looking For A Used Car Dataset For University Project

I’m looking for a dataset with the following features for a large number of vehicles

Brand, model, year
Mileage
Engine, transmission, drivetrain, fuel type, and other specs
Price
Vehicle condition (e.g., minor/moderate/severe damage or Good/Fair/Salvage)

submitted by /u/shamsfathalla
[link] [comments]

0

Michelin Star Restaurant Dataset

submitted by /u/cavedave
[link] [comments]

0

Open Dataset: 3,023 Enterprise AI Implementations With Analysis

I analyzed 3,023 enterprise AI use cases to understand what’s actually being deployed vs. vendor claims.

Key findings:

Technology maturity:

Copilots: 352 cases (production-ready)
Multimodal: 288 cases (vision + voice + text)
Reasoning models (e.g. o1/o3): 26 cases
Agentic AI: 224 cases (growing)

Vendor landscape:

Google published 996 cases (33% of dataset), Microsoft 755 (25%). These reflect marketing budgets, not market share.

OpenAI published only 151 cases but appears in 500 implementations (3.3x multiplier through Azure).

Breakthrough applications:

4-hour bacterial diagnosis vs 5 days (Biofy)
60x faster code review (cubic)
200K gig workers filed taxes (ClearTax)

Limitations:

This shows what vendors publish, not:

Success rates (failures aren’t documented)
Total cost of ownership
Pilot vs production ratios

My take: Reasoning models show capability breakthroughs but minimal adoption. Multimodal is becoming table stakes. Stop chasing hype, look for measurable production deployments.

Full analysis on Substack.
Dataset (open source) on GitHub.

submitted by /u/abbas_ai
[link] [comments]

0

Seeing The Same File-level Data Issues Again And Again, Why Are These Still So Hard To Catch?

Over the last few weeks, I’ve seen multiple discussions and anecdotes around file-level data problems that pass basic validation but still cause downstream pain.

Things like:

placeholder values that silently propagate
zero-width or invisible characters
encoding or locale-specific quirks
delimiter and quoting inconsistencies
numeric values flipping to scientific notation
dates and timezones behaving “correctly” but wrong in context

What’s interesting is that many of these aren’t schema violations and don’t fail parsing. The file looks fine, loads fine, and only causes issues much later.

A common pattern seems to be:

data comes from external teams or manual exports
files change subtly over time validation focuses on structure, not behavior

Is this problem is worth to be solved, because I was constantly trying to resolve this issue to some extent.

One approach I’ve seen discussed is tackling these issues incrementally, case by case, rather than trying to “validate everything” upfront, but adoption itself seems hard, especially when data privacy and workflow friction are concerns.

For people working in data engineering or analytics:

Which file-level issues have caused the most real-world pain for you, despite the files being technically valid?

Curious what patterns others have noticed. And is this a real issue for everyone out there.

submitted by /u/PriorNervous1031
[link] [comments]

0

Have You Had Experience Selling Your Own Datasets, And If So, What Was It Like?

I’ve spent several years selling custom datasets to companies, and more recently began developing a data marketplace for professional datasets. The goal is to create a space where high-quality data can be published, bought, and sold. I’d appreciate any feedback on the idea.

submitted by /u/Other-Place2942
[link] [comments]

0

Is There A Flights API With Deep Links For Booking?

So over the last few weeks I was playing around with Duffel API and Amadeus for flight booking. This is just for a random idea that I thought of, and while they work fine, in order to actually build this random idea I had, I would need to build the entire flow for booking, fetching, managing, checking in, payment, support, etc… Basically it’s several months worth of work for something that might not even work at all…

So I came across this expedia documentation which lets you build a link for searching flights, and then you get redirected to their website for booking and whatnot. I would love to have something like this, but in API format, because this only works if you actually open the website and browse the flights manually. Is there any such API?

submitted by /u/randomseller
[link] [comments]

0

VC Investor Email Lists Shutting Down Jan 26

If you’re fundraising, this is the last window to access VC emails + LinkedIn.
All datasets go offline after 26 Jan.

https://projectstartups.com

submitted by /u/project_startups
[link] [comments]

0

Static Malware Analysis Dataset For University AI Project

Hi! I’m looking for dataset for static Malware analysis that just contains information about features common in malwares but it should not have executable or files which can infect my system. I’m really new to this whole ML thing and I would really appreciate if anyone can help me

submitted by /u/shitty_psychopath
[link] [comments]

0

America Isn’t Exceptional — It’s The Exception

submitted by /u/ashendruk
[link] [comments]

0

[Resource] Advanced Prompt For Generating Messy Datasets – Perfect For Practicing ETL & Data Cleaning Skills

submitted by /u/dataexec
[link] [comments]

0

Here’s A Dataset Of The Ratings Of All 7,072 Movies On IMDb With Over 25,000 Votes

Date of data: 12 January, 2026

Data: All 7,072 movies with over 25,000 votes (that’s the current vote threshold for the IMDb Top 250.)

Instructions: Download the .txt file, rename it to a .csv file, and you can open it in a spreadsheet program and play around with the figures.

Dropbox link.

(Note: you don’t need to sign in to Dropbox to download it. There’s a bypass button at the bottom of the screen.)

A list of the tab-separated columns:

Title
IMDb code
Year
1 ratings
2 ratings
3 ratings
4 ratings
5 ratings
6 ratings
7 ratings
8 ratings
9 ratings
10 ratings
Total number of ratings
Weighted Mean [the IMDb rating that is published on the website]
Arithmetic Mean [the unweighted IMDb rating calculated from the raw totals]
Difference of Means [the difference between the previous two columns]
Standard Deviation

submitted by /u/RunDNA
[link] [comments]

0

Looking For VIN-based Pre-check / Decoder + Specs + Commercial Use + Recalls (Europe / Worldwide)

submitted by /u/cauchyez
[link] [comments]

0

Beta Testers Wanted: API For Fair-value Arb

submitted by /u/the_cryptory_1313
[link] [comments]

0

Need Dataset For A Personal Poker Project

Hi guys im planning on working on a poker project and i wanna build a Model which predicts and makes betting decisions for poker. I just want help to find a suitable database for this project. (Im new to this stuff and its my first proper project 🙏)

submitted by /u/Flyawayistaken_
[link] [comments]

0

Category: Datatards