Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Collecting News Headlines From The Last 2 Years

Hey Everyone,

So we are working on our Masters Thesis and need to collect the data of News Headlines in the Scandinavian market. More precisely: Newsheadlines from Norway, Denmark, and Sweden. We have never tried webscraping before but we are positive on taking on a challenge. Does anyone know the easiest way to gather this data? Is it possible to find it online, without doing our own webscraping?

submitted by /u/hiddenman12345
[link] [comments]

Offering Free Jobs Dataset Covering Thousands Of Companies, 1 Million+ Active/expired Job Postings Over Last 1 Year

Hi all, I run a job search engine (Meterwork) that I built from the ground up and over the last year I’ve scraped jobs data almost daily directly from the career pages of thousands of companies. My db has well over a million active and expired jobs.

I fee like there’s a lot of potential to create some cool data visualizations so I was wondering if anyone was interested in the data I had. My only request would be to cite my website if you plan on publishing any blog posts or infographics using the data I share.

I’ve tried creating some tools using the data I have (job duration estimator, job openings tracker, salary tool – links in footer of the website) but I think there’s a lot more potential for interesting use of the data.

So if you have any ideas you’d like to use the data for just let me know and I can figure out how to get it to you.

submitted by /u/jjzwork
[link] [comments]

Title: Steam Dataset 2025 – 263K Games With Multi-modal Database Architecture (PostgreSQL + Pgvector)

I’ve been working on a modernized Steam dataset that goes beyond the typical CSV dump approach. My third data science project, and my first serious one that I’ve published on Zenodo. I’m a systems engineer, so I take a bit of a different approach and have extensive documentation.

Would love a star on the repo if you’re so inclined or get use from it! https://github.com/vintagedon/steam-dataset-2025

After collecting data on 263,890 applications from Steam’s official API (including games, DLC, software, and tools), I built a multi-modal database system designed for actual data science workflows. Both as an exercise, a way to ‘show my work’ and also to prep for my own paper on the dataset.

What makes this different:

Architecture-first approach: Instead of flat CSV files, this uses PostgreSQL 16 for normalized relational data, Neo4j for publisher/developer relationship graphs, and pgvector for semantic search on game descriptions. The goal was to make it analytically-native from the start.

Comprehensive coverage: 263K applications compared to the 27K in the popular 2019 Kaggle dataset. Includes rich HTML descriptions with embedded media, international pricing, detailed metadata, and Steam’s full application catalog as of January 2025.

Semantic search ready: Game descriptions are vectorized using sentence-transformers, enabling queries like “find games similar to Baldur’s Gate 3” based on actual content similarity rather than just tags.

Use cases: – NLP projects using game descriptions (avg 270 words) – Price prediction models with international market data – Semantic search and recommendation systems – Time-series analysis of gaming trends

Data quality notes: – ~56% API success rate (Steam delists games, regional restrictions, content type diversity) – Conservative rate limiting (1.5s delays) for sustainable collection – All data from official Steam Web API only (no third-party scrapers) – Comprehensive error handling and retry logic

The dataset is fully documented with setup guides, analysis examples, and architectural decision rationale. Built using Python 3.12+, all collection and processing code included.

Repository: https://github.com/vintagedon/steam-dataset-2025

Zenodo Release: https://zenodo.org/records/17266923

Quick stats: – 263,890 total applications – ~150K successful detailed records – International pricing across 40+ currencies – 50+ metadata fields per game – Vector embeddings for 100K+ descriptions

This is an active project – still refining collection strategies and adding analytical examples. Open to feedback on what analysis would be most useful to include.

Technical stack: Python, PostgreSQL 16, Neo4j, pgvector, sentence-transformers, official Steam Web API

submitted by /u/vintagedon
[link] [comments]

Here’s A Relational DB Of All Space Biology Papers Since 2010 (with Author Links, Text & More)

I just compiled every space biology publication from 2010–2025 into a clean SQLite dataset (with full text, authors, and author–publication links). 📂 Download the dataset on Kaggle 💻 See the code on GitHub

Here are some highlights 👇

🔬 Top 5 Most Prolific Authors

Name Publications
Kasthuri Venkateswaran 54
Christopher E Mason 49
Afshin Beheshti 29
Sylvain V Costes 29
Nitin K Singh 24

👉 Kasthuri Venkateswaran and Christopher Mason are by far the most prolific contributors to space biology in the last 15 years.

👥 Top 5 Publications with the Most Authors

Title Author Count
The Space Omics and Medical Atlas (SOMA) and international consortium to advance space biology 109
Cosmic kidney disease: an integrated pan-omic, multi-organ, and multi-species view 105
Molecular and physiologic changes in the Spaceflight-Associated Neuro-ocular Syndrome 59
Single-cell multi-ome and immune profiles of the International Space Station crew 50
NASA GeneLab RNA-Seq Consensus Pipeline: Standardization for spaceflight biology 45

👉 The SOMA paper had 109 authors, a clear example of how massive collaborations in space biology research have become.

📈 Publications per Year

Year Publications
2010 9
2011 16
2012 13
2013 20
2014 30
2015 35
2016 28
2017 36
2018 43
2019 33
2020 57
2021 56
2022 56
2023 51
2024 66
2025 23

👉 Notice the surge after 2020, likely tied to Artemis missions, renewed ISS research, and a broader push in space health.

Disclaimer: This dataset was authored by me. Feedback is very welcome! 📂 Dataset on Kaggle 💻 Code on GitHub

submitted by /u/union4breakfast
[link] [comments]

Open-source Bluesky Social Activity Monitoring Pipeline!

The AT Protocol from 🦋 Bluesky Social is an open-source networking paradigm made for social app builders. More information here: https://docs.bsky.app/docs/advanced-guides/atproto

The OSS community has shipped a great 🐍 Python SDK with a data firehose endpoint, documented here: https://atproto.blue/en/latest/atproto_firehose/index.html

🧠 MOSTLY AI users can now access this streaming endpoint whilst chatting with the MOSTLY AI Assistant!Check out the public dataset here: https://app.mostly.ai/d/datasets/9e915b64-93fe-48c9-9e5c-636dea5b377e

This is a great tool to monitor and analyze social media and track virality trends as they are happening!

Check out the analysis the Assistant built for me here: https://app.mostly.ai/public/artifacts/c3eb4794-9de4-4794-8a85-b3f2ab717a13

Disclosure: MOSTLY AI Affiliate

submitted by /u/SyllabubNo626
[link] [comments]

Created A Real Time Signal Dashboard That Pulls Trade Signals From Top Tier Eth Traders. Looking For People Who Enjoy Coding, Ai, And Trading.

Over the last 3+ years, I’ve been quietly building a full data pipeline that connects to my archive Ethereum node.
It pulls every transaction on Ethereum mainnet, finds the balance change for every trader at the transaction level (not just the end-of-block balance), and determines whether they bought or sold.

From there, it runs trade cycles using FIFO (first in, first out) to calculate each trader’s ROI, Sharpe ratio, profit, win rate, and more.

After building everything on historical data, I optimized it to now run on live data — it scores and ranks every trader who has made at least 5 buys and 5 sells in the last 11 months.

After filtering by all these metrics and finding the best of the best out of 500k+ wallets, my system surfaced around 1,900 traders truly worth following.
The lowest ROI among them is 12%, and anything above that can generate signals.

I’ve also finished the website and dashboard, all connected to my PostgreSQL database.
The platform includes ranked lists: Ultra Elites, Elites, Whales, and Growth traders — filtering through 30 million+ wallets to surface just those 1,900 across 4 refined tiers.

If you’d like to become a beta tester, and you have trading or Python/coding experience, I’d love your help finding bugs and giving feedback.
I opened 25 seats for the general public, if you message me directly, I won’t charge you for access just want looking for like-minded interested people— I’m looking for skilled testers who want to experiment with automated execution through the API I built.

submitted by /u/Wrong_Wrongdoer_6455
[link] [comments]

Looking To Interview People Who’ve Worked On Audio Labeling For ML (PhD Research Project)

Hi everyone, I’m a PhD candidate in Communication researching modern sound technologies. My dissertation is a cultural history of audio datasets used in machine learning: I’m interested in how sound is conceptualized, categorized, and organized within computational systems. I’m currently looking to speak with people who have done audio labeling or annotation work for ML projects (academic, industry, or open-source). These interviews are part of an oral history component of my research. Specifically, I’d love to hear about: – how particular sound categories were developed or negotiated, – how disagreements around classification were handled, and – how teams decided what counted as a “good” or “usable” data point. If you’ve been involved in building, maintaining, or labeling sound datasets – from environmental sounds to event ontologies – I’d be very grateful to talk. Conversations are confidential, and I can share more details about the project and consent process if you’re interested. You can DM me here Thanks so much for your time and for all the work that goes into shaping this fascinating field.

submitted by /u/heyheymymy621
[link] [comments]

Looking For Public Datasets On Consumer Search Behavior & Conversational Search (for Academic Research)

Hi everyone,

I’m currently conducting a research project comparing traditional search engines (e.g., Google) and LLM-based conversational search tools (e.g., ChatGPT, Perplexity.ai) in the context of consumer search behaviour — specifically, how users search for and choose products like smartphones when factors such as price and features moderate their decisions. I intend to conduct a controlled experiment to collect search behavior of approximately. 100 participants providing causal evidence, but still want to validate those insights using external datasets or benchmarks.

I’m looking for publicly available datasets that capture one or more of the following aspects:

  • User´s background, including age, gender, education, employment, nationality, residence, prior knowledge of AI tools, and shopping-related tools.
  • Search behavior logs (queries, clicks, scrolls, or multi-turn interactions).
  • Conversational or query reformulation datasets → datasets where users ask follow-up questions or clarify queries.
  • Consumer choice or e-commerce data (based on price or features).
  • User attitude or satisfaction survey data (e.g., perceived trust, relevance, ease of use, usefulness, overload, decision confidence, and handling contradictory information).

Also open to:

  • Suggestions for benchmark datasets used in Conversational Search or Retrieval-Augmented Generation (RAG) evaluations
  • References to recent arXiv or TREC publications releasing such data

If anyone here knows of datasets that bridge search interactions — or newer LLM-integrated conversational search datasets — I’d really appreciate your input. Thanks in advance!

submitted by /u/Dismal_Priority_2381
[link] [comments]

[REQUEST] Looking For Sample Bank Statements To Improve Document Parsing

We’re working on a tool that converts financial PDFs into structured data.

To make it more reliable, we need a diverse set of sample bank statements from different banks and countries — both text-based and scanned.

We’re not looking for any personal data.

If you know open sources, educational datasets, or demo files from banks, please share them. We’d also be happy to pay up to $100 for a well-organized collection (50–100 unique PDFs with metadata such as country, bank name, and number of pages).

We’re especially interested in layouts from the United States, Canada, United Kingdom, Australia, New Zealand, Singapore, and France.

The goal isn’t to mine data — it’s to make document parsing smarter, faster, and more accessible.

If you have leads or want to collaborate on building this dataset, please comment or DM me.

submitted by /u/mercuretony
[link] [comments]

Grantor Datasets For Nonprofit Analysis Project (Massachusetts)

I’m volunteering at a local nonprofit and trying to find data to run analysis on grantors in Massachusetts. Right now, the best workflow I’ve got is scraping 990-PF filings from Candid (base tier) and copying into Excel, even that is limited.

Ideally, the dataset would include info on grantors’ interests, location, income, etc., so I can connect them to this nonprofit based on their likelihood to donate to specific causes. I was thinking a market basket analysis?

Hoping this could also be applied to my portfolio for my job search. Anyone have any ideas on (ideally free since its unpaid and I’m job hunting) sources or workflows that might help?

submitted by /u/A-Garden-Hoe
[link] [comments]

Looking For An API That Can Return VAT Numbers Or Official Business IDs To Speed Up Vendor Onboarding

Hey everyone,

I’m trying to find a company enrichment API that can give us a company’s VAT number or official business/registry ID (like their company registration number).

We’re building a workflow to automate vendor onboarding and B2B invoicing, and these IDs are usually the missing piece that slows everything down. Currently, we can extract names, domains, addresses, and other information from our existing data source; however, we still need to look up VAT or registry information for compliance purposes manually.

Ideally, the API could take a company name and country (or domain) and return the VAT ID or official registry number if it’s publicly available. Global coverage would be ideal, but coverage in the EU and the US is sufficient to start.

We’ve reviewed a few major providers, such as Coresignal, but they don’t appear to include VAT or registration IDs in their responses. Before we start testing enterprise options like Creditsafe or D&B, I figured I’d ask here:

Has anyone used an enrichment or KYB-style API that reliably returns VAT or registry IDs? Any recommendations or experiences would be awesome.

Thanks!

submitted by /u/mladenmacanovic
[link] [comments]

Multilingual Wiki Dataset Sample (5 Languages, 500 Rows) [self-promotion]

I’ve been building a multilingual wiki-style dataset and put together a free sample on Zenodo.

It’s 500 structured entries across five languages with stable IDs, ISO codes, titles, and short text fields.

The idea is to make something researchers and hobbyists can actually use for cross-language analysis or NLP.

For those that are curious, the dataset is permanently archived here: https://doi.org/10.5281/zenodo.17253688

I’d really like feedback on whether this structure feels useful for projects in your workflow!

submitted by /u/uricavelar
[link] [comments]

Scout Stars: Football Manager 2023 Player Data – 89k Players With 80+ Attributes For Analytics & ML

I’ve created and uploaded a comprehensive dataset from Football Manager 2023 (FM23), featuring stats for nearly 89,000 virtual players across global leagues. This includes attributes like Pace, Dribbling, Finishing, Transfer Value, Injury Proneness, Leadership, and more—over 70 columns in total. It’s cleaned, merged via Python/pandas, and covers everything from youth prospects to veterans in leagues from the Premier League to lower divisions in Argentina, Asia, Africa, and beyond.

submitted by /u/Mental-Flight8195
[link] [comments]

Multi Language SMS Dataset For Application But ı Cant Find It

I’m looking for a multilingual SMS dataset for an application, but I can’t find one

Hello, as mentioned in the title, I’m looking for an SMS dataset. I found a few, but these

Critical Issues:

Class Imbalance – Raw: 4,825 (86.59%) | Spam: 747 (13.41%) → 6.46:1

~440 duplicates in each language (7.5-8%)

🟡 Medium-Level Issues:

Weak Hindi translation – Mixed characters, poor transcription

Wide length distribution – Especially in Hindi (max: 1406!)

Very short messages – Especially in Hindi (95 instances)

How can I find datasets without these issues?

submitted by /u/Extension-Onion2310
[link] [comments]

How Can I Collect Updated Trail Closure And Reopening Info At Multiple Levels (national, State, Local)?

Hi all,
I’m working on a personal side project and want to build something that keeps track of trail closures and reopenings in the U.S. The goal is to have the most up-to-date look at what’s closing or opening in parks.

I’d like to cover:

  • National parks
  • State parks
  • County and city parks

I’m not sure the best way to approach this. Some questions I have:

  1. Are there existing APIs or open datasets that already track this info?
  2. If not, what would be the best way to scrape government/park websites that post closure announcements?
  3. For anyone who’s done similar projects: how do you handle the fact that every agency posts things in different formats?

Any tools, techniques, or data source suggestions would be super helpful. Thanks!

submitted by /u/Few-Performance-1875
[link] [comments]

I Am Looking For A Dataset Of Datasets That Have Been Bought And Sold In My Attempt To Value Different Characteristics Of Data.

As the title says, I am trying to find a historical record of datasets that have been bought. Ideally, this dataset of datasets would include a transaction price and the list of variables that were included in the sold dataset.

I am hoping to learn something about how different characteristics of data are valued. However, I cannot seem to find any dataset (of datasets) out there that aligns with what I am searching for. Any help would be greatly appreciated!

submitted by /u/Head-Problem-1385
[link] [comments]

UAE Real Estate API – 500K+ Properties From PropertyFinder.ae

🏠 [Dataset] UAE Real Estate API – 500K+ Properties from PropertyFinder.ae

Overview

I’ve found a comprehensive REST API providing access to 500,000+ UAE real estate listings scraped from PropertyFinder.ae. This includes properties, agents, brokers, and contact information across Dubai, Abu Dhabi, Sharjah, and all UAE emirates.

📊 Dataset Details

Properties: 500K+ listings with full details

  • Apartments, villas, townhouses, commercial spaces
  • Prices, sizes, bedrooms, bathrooms, amenities
  • Listing dates, reference numbers, images
  • Location data with coordinates

Agents: 10K+ real estate agents

  • Contact information (phone, email, WhatsApp)
  • Broker affiliations
  • Super agent status
  • Social media profiles

Brokers: 1K+ real estate companies

  • Company details and contact info
  • Agent teams and property portfolios
  • Logos and addresses

Locations: Complete UAE location hierarchy

  • Emirates, cities, communities, sub-communities
  • GPS coordinates and area classifications

🚀 API Features

12 REST Endpoints covering:

  • Property search with advanced filtering
  • Agent and broker lookups
  • Property recommendations (similar properties)
  • Contact information extraction
  • Relationship mapping (agent → properties, broker → agents)

📈 Use Cases

PropTech Developers:

# Get luxury apartments in Dubai Marina response = requests.get( "https://api-host.com/properties", params={ "location_name": "Dubai Marina", "property_type": "Apartment", "price_from": 1000000 }, headers={"x-rapidapi-key": "your-key"} ) 

Market Researchers:

  • Price trend analysis by location
  • Agent performance metrics
  • Broker market share analysis
  • Property type distribution

Real Estate Apps:

  • Property listing platforms
  • Agent finder tools
  • Investment analysis dashboards
  • Lead generation systems

🔗 Access

RapidAPI Hub: Search “UAE Real Estate API”
Documentation: Complete guides with code examples
Free Tier: 500 requests to test the data quality .
Link : https://rapidapi.com/market-data-point1-market-data-point-default/api/uae-real-estate-api-propertyfinder-ae-data

📋 Sample Response

{ "data": [ { "property_id": "14879458", "title": "Luxury 2BR Apartment in Dubai Marina", "listing_category": "Buy", "property_type": "Apartment", "price": "1160000.00", "currency": "AED", "bedrooms": "2", "bathrooms": "2", "size": "1007.00", "agent": { "agent_id": "7352356683", "name": "Asif Kamal", "is_super_agent": true }, "location": { "name": "Dubai Marina", "full_name": "Dubai Marina, Dubai" } } ], "pagination": { "total": 15420, "limit": 50, "has_next": true } } 

🎯 Why This Dataset?

  • Most Complete: Includes agent contacts (unique!)
  • Fresh Data: Updated daily from PropertyFinder.ae
  • Production Ready: Professional caching & performance
  • Developer Friendly: RESTful with comprehensive docs
  • Scalable: From hobby projects to enterprise apps

Perfect for anyone building UAE real estate applications, conducting market research, or needing comprehensive property data for analysis.

Questions? Happy to help with integration or discuss specific use cases!

Data sourced from PropertyFinder.ae – UAE’s leading property portal

submitted by /u/Comfortable-Ad-6686
[link] [comments]

Dataset: AI Use Cases Library V1.0 (2,260 Curated Cases)

Hi all.

I’ve released an open dataset of 2,260 curated AI use cases, compiled from vendor case studies and industry reports.

Files:

  • use-cases.csv — final dataset
  • in-review.csv (266) and excluded.csv (690) for transparency
  • Schema and taxonomy documentation

Supporting materials:

  • Trends analysis and vendor comparison
  • Featured case highlights
  • Charts (industries, domains, outcomes, vendors)
  • Starter Jupyter notebook

License: MIT (code), CC-BY 4.0 (datasets/insights)

The dataset is available in this GitHub repo.

Feedback and contributions are welcome.

submitted by /u/abbas_ai
[link] [comments]

What African Datasets Are Hardest To Find?

Hey all,

I’ve been thinking a lot about how hard it is to get good data on Africa. A lot of things are either behind paywalls, scattered across random sites, or just not collected properly.

I’m curious. what kind of datasets would you like to see but can never seem to find?

Could be anything:

  • local business/market info
  • transport routes
  • historical or cultural records
  • climate or environmental data
  • health, education, housing, etc.

Basically, if you’ve ever thought “why is this data so hard to get??” — I’d love to hear what it was.

submitted by /u/Exciting_Agency4614
[link] [comments]

[synthetic] [self-promotion] Synthetic Employee Dataset 800k+ Records For Burnout Turnover And Hr Analytics

Hey everyone,

I made a synthetic real hybrid employee dataset with over 800000+ records. the dataset is fully synthetic so there is no personal or sensitive data but it is generated to match real-world distributions of employee metrics. it includes performance scores burnout risk satisfaction scores tenure salaries skill arrays and 12 behavioral personas. the dataset is available in json and parquet formats for easy use

you can use it for things like:

  • predicting who might leave a company
  • analyzing burnout hotspots
  • exploring skill gaps across roles and departments
  • practicing machine learning models on realistic hr data

here is the dataset link for anyone who might be interested: https://huggingface.co/datasets/BrotherTony/employee-burnout-turnover-prediction-800k

would love to hear what you think or if you make something cool with it

submitted by /u/AnyCookie10
[link] [comments]