Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

IT Ops CMDB/DW With Master Data For Commodity Hardware/software?

Hi Dataseters

I’ve asked LLMs and scoured .. github etc for projects to no avail, but ideally if anyone knows of a fact/dimension style open source schema model (not unlike BMC/Service Now logical data CDM models) with dimensions pre-populated with typical vendors/makes/models both on hardware/software dimensions. Ideally in Postgres/Maria .. but if in Oracle etc, that’s fine too, easy conversion.

Anyone who has Snow/Flexera/ServiceNow .. might build such a skeleton frame with custom tables for midrange/networking .. w UNSPC codes etc

Sure I can subscribe to big ITSM vendors, but ideally id just fork something the community has already built, then ETL/ELT facts in our own use. Also DIY, it’s like reinventing the wheel, im sure many of you have already built this…

Its a shot in the dark .. but just seeing if anyone has seen useful projects

thanks in advance

submitted by /u/Laymans_Perspective
[link] [comments]

Will Pay For Datasets That Contain Unredacted PDFs Of Purchase Orders, Invoices, And Supplier Contracts/Agreements (for Goods Not Services)

Hi r/datasets ,

I’m looking for datasets, either paid or unpaid, to create a benchmark for a specialised extraction pipeline.

Criteria:

  • Recent (last ten years ideally)
  • PDFs (don’t need to be tidy)
  • Not redacted (as much as possible)

Document types:

  • Supplier contracts (for goods not services)
  • Invoices (for goods not services)
  • Purchase Orders (for goods not services)

I’ve already seen: Atticus and UCSF Industry Document Library (which is the origin of Adam Harley’s dataset). I’ve seen a few posts below but they aren’t what I’m looking for. I’m honestly so happy to pay for the information and the datasets; dm me if you want to strike a deal.

submitted by /u/phililisaveslives
[link] [comments]

Looking For Football Penalty Shootout Videos (rear Camera Angle)

Hey everyone! I’m working on a university project where we’re trying to predict the direction of football penalty kicks based on the shooter’s body movement. To do that, we’re using pose estimation and machine learning on real-world footage.

Right now, I’m building a dataset of penalty shootouts — but I specifically need videos where the camera is placed behind the player, like the rear broadcast angle you usually see in World Cup coverage.

I already have all the penalty shootouts from the 2022 World Cup, but I’d love to collect more of this kind — from other tournaments or even club games. If you remember any videos (on YouTube or elsewhere) with that camera angle, please drop them here 🙏

Thanks in advance — you’d be helping a lot!

submitted by /u/tiagonob
[link] [comments]

Sharing My A Demo Of Tool For Easy Handwritten Fine-tuning Dataset Creation!

hello! I wanted to share a tool that I created for making hand written fine tuning datasets, originally I built this for myself when I was unable to find conversational datasets formatted the way I needed when I was fine-tuning llama 3 for the first time and hand typing JSON files seemed like some sort of torture so I built a little simple UI for myself to auto format everything for me.

I originally built this back when I was a beginner so it is very easy to use with no prior dataset creation/formatting experience but also has a bunch of added features I believe more experienced devs would appreciate!

I have expanded it to support :
– many formats; chatml/chatgpt, alpaca, and sharegpt/vicuna
– multi-turn dataset creation not just pair based
– token counting from various models
– custom fields (instructions, system messages, custom ids),
– auto saves and every format type is written at once
– formats like alpaca have no need for additional data besides input and output as a default instructions are auto applied (customizable)
– goal tracking bar

I know it seems a bit crazy to be manually hand typing out datasets but hand written data is great for customizing your LLMs and keeping them high quality, I wrote a 1k interaction conversational dataset with this within a month during my free time and it made it much more mindless and easy

I hope you enjoy! I will be adding new formats over time depending on what becomes popular or asked for

Here is the demo to test out on Hugging Face
(not the full version/link at bottom of page for full version)

submitted by /u/abaris243
[link] [comments]

Looking For Data About US States For Multivariate Analysis

Hi everyone, apologies if posts like these aren’t allowed.

I’m looking for a dataset that has data of all 50 US States such as GDP, CPI, population, poverty rate, household income, etc… in order to run a multivariate analysis.

Do you guys know of any that are from reputable reporting sources? I’ve been having trouble finding one that’s perfect to use.

submitted by /u/theabhster
[link] [comments]

Built A Comprehensive Geo API With Countries, Airports & 140K+ Cities – Feedback Welcome!

*TL;DR**:* Built a comprehensive geographic API that combines countries, airports, and cities in one fast endpoint. Looking for feedback from fellow developers!

What I Built
After getting frustrated with having to integrate 3+ different APIs for basic geographic data in my e-commerce projects, I decided to build something better:

**🌍 Geo Data Master API** – One API for all your geographic needs:
– ✅ 249 countries with ISO alpha-2/alpha-3 codes
– ✅ Major airports worldwide with IATA codes & coordinates
– ✅ 140K+ cities from GeoNames with population data
– ✅ Multi-language support with official status
– ✅ Real-time autocomplete for cities and airports

Tech Stack
– Backend: FastAPI (Python) for performance
– Caching: Redis for sub-millisecond responses
– Database: SQLite with optimized queries
– Infrastructure: Docker + NGINX + SSL
– Data Sources: ISO standards + GeoNames

Why I Built This
Working on traveling projects, I constantly needed:
– Country dropdowns with proper ISO codes
– Airport data for shipping calculations
– City autocomplete for address forms
– Language detection for localization

Instead of juggling REST Countries API + some airport service + city data, now it’s one clean API.

Performance

  • Sub-millisecond response times (Redis caching)
  • 99.9% uptime with monitoring
  • Handles 10k+ requests/minute easily

What I’m Looking For

  1. Feedback on the API design and endpoints
  2. Use cases I might have missed
  3. Feature requests from the community
  4. Beta testers (generous free tier available)

I’ve made it available on RapidAPI – you can test all endpoints instantly without any setup. The free tier includes 500 requests/day which should be plenty for testing and small projects.

Try it out: https://rapidapi.com/omertabib3005/api/geodatamaster

Happy to answer any technical questions about the implementation!

submitted by /u/COVID-20S
[link] [comments]

I Made A 50k Ai Generated Banking Support Convo Dataset (BankBot50k)

Hey everyone, I’ve been experimenting with building datasets for chatbot training and decided to go all-in on this one for my first product –

🏦 BankBot 50K — a fully AI-generated dataset with 50,000 realistic customer support convos in the banking world.

It covers stuff like: • Lost cards / fraud alerts • Loan and credit questions • Password resets • General customer support issues

It’s designed for: • Fine-tuning LLMs (chatbots or assistants) • NLP projects • Intent classification • Prototyping AI customer service flows

Formats: JSON + CSV Includes: User + Agent turns, labeled topics, clean structure

If you’re building something with LLMs or just want some synthetic data to play with, grab it. The full 50K version is up for $25 if anyone needs a bigger set: BankBot 50K Gumroad

Open to feedback, questions, or collabs. Hope it helps someone out here 👇

submitted by /u/Jaycevecc
[link] [comments]

Need Advice For Finding Datasets For Analysis

I have an assessment that requires me to find a dataset from a reputable, open-access source (e.g., Pavlovia, Kaggle, OpenNeuro, GitHub, or similar public archive), that should be suitable for a t-test and an ANOVA analysis in R. I’ve attempted to explore the aforementioned websites to find datasets, however, I’m having trouble finding appropriate ones (perhaps it’s because I don’t know how to use them properly), with many of the datasets that I’ve found providing only minimal information with no links to the actual paper (particularly the ones on kaggle). Does anybody have any advice/tips for finding suitable datasets?

submitted by /u/xmishieee
[link] [comments]

Looking For A Cheap API To Fetch Employees Of A Company (No Chrome Plugins)

Hey everyone,

I’m working on a project to build an automated lead generation workflow, and I’m looking for a cost-effective API that can return a list of employees for a given company (ideally with names, job titles, LinkedIn URLs, etc.).

Important:

I’m not looking for Chrome extensions or tools that require manual interaction. This needs to be fully automated.

Has anyone come across an API (even a lesser-known one) that’s relatively cheap?

Any pointers would be hugely appreciated!

Thanks in advance.

submitted by /u/Key-Ad-4907
[link] [comments]

Requesting Data For Dataset Creation

Hello everyone ^ I’m working on creating an extensive dataset that consists of labeled memory dumps from all kinds of different videogames and videogame engines. The things I am labeling are variables for things like health, ammo, mana, position, rotation, etc. For the purpose of creating a proof of concept for a digital forensics tool that is capable of finding specific variables reliably and consistently with things like dynamic memory allocation and ASLR in place.

This tool will use AI pattern recognition combined with heuristics to do this, and I’m trying to collect as much diverse data as possible to improve accuracy across different games and engines.

I have already collected quite a bit of real data from multiple engines and games, and I’ve also created a tool that generates a lot of synthetic memory dumps in .bin format with .json files that contain the labels, but I realize that I might need some help with gathering more real data to supplement the synthetic data.

My request is therefore as follows; are there any people willing to assist me in creating this dataset?

I understand that commercially available games are intellectual property and that ToS often restrict reversing and otherwise tampering with the games so I’m mostly using sample projects for engines like Unreal Engine and Unity, or open source projects that allow for doing this.

Please feel free to send me a message or respond to this post if you are interested in helping or have any suggestions or tips for possible videogames I could legally use to gather data from.

submitted by /u/Cannibull33
[link] [comments]

Working On A Dashboard Tool (Fusedash.ai) — Looking For Feedback, Partners, Or Interesting Datasets

Hey folks,

So I’ve been working on this project for a while called Fusedash.ai — it’s basically a data visualization and dashboard tool, but we’re trying to make it way more flexible and interactive than most existing platforms (think PowerBI or Tableau but with more real-time and AI stuff baked in).

The idea is that people with zero background in data science or viz tools can upload a dataset (CSV, API, Public resources, devices, whatever), and immediately get a fully interactive dashboard that they can customize — layout, charts, maps, filters, storytelling, etc. There’s also an AI assistant that helps you explore the data through chat, ask questions, generate summaries, interactions, or get recommendations.

We also recently added a kind of “canvas dashboard” feature that lets users interact with visual elements in real-time, kind of like youre working on a live whiteboard, but with your actual data.

It is still in active dev and there’s a lot to polish, but I’m really proud of where it’s heading. Right now, I’m just looking to connect with anyone who:

  • has interesting datasets and wants to test them in Fusedash
  • is building something similar or wants to collaborate
  • has strong thoughts about where modern dashboards/tools are heading

Not trying to pitch or sell here — just putting it out there in case it clicks with someone. Feedback, critique, or just weird ideas very welcome 🙂

Appreciate your input and have a wonderful day!

submitted by /u/DumyTrue
[link] [comments]

[Dataset Release] YaMBDa: 4.79B Anonymized User Interactions From Yandex Music

Yandex has released YaMBDa, a large-scale open-source dataset comprising 4.79 billion user interactions from Yandex Music, specifically My Wave (its personalized real-time music feed).

The dataset includes plays, likes/dislikes, timestamps, and various track features. All data is anonymized, containing only numeric identifiers. Although sourced from a music platform, YaMBDa is designed for testing recommender algorithms across various domains — not just streaming services.

Recent progress in recommender systems has been hindered by limited access to large datasets that reflect real-world production loads. Well-known sets like LFM-1B, LFM-2B, and MLHD-27B have become unavailable due to licensing restrictions. With close to 5 billion interaction events, YaMBDa has now presumably surpassed the scale of Criteo’s 4B ad dataset.

Dataset details:

  • Sizes available: 50M, 500M, and full 4.79B events
  • Track embeddings: Derived from audio using CNNs
  • Metadata: Includes track duration, album, artist, etc.
  • is_organic flag: Differentiates organic vs. recommended actions
  • Format: Parquet, compatible with Pandas, Polars, and Spark

Access:

This dataset offers a valuable, hands-on resource for researchers and practitioners working on large-scale recommender systems and related fields.

submitted by /u/azalio
[link] [comments]