Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

What Makes An Egocentric Video Dataset Actually Useful For Research?

I’ve been exploring first-person (egocentric) video datasets recently and noticed that dataset size alone doesn’t seem to tell the whole story.

Some datasets have a huge number of videos, while others focus more on annotation quality, action diversity, object interactions, or long temporal sequences.

For those who have worked with action recognition, embodied AI, AR/VR, robotics perception, or related tasks:

* What dataset characteristics matter most to you?
* How important is annotation quality compared to dataset scale?
* Are there any egocentric datasets you keep coming back to for benchmarking?

I’d be interested to hear what people here consider the most useful datasets for real-world experimentation.

submitted by /u/Vane1st
[link] [comments]

Open-sourcing BIP-39 Display Wordlists In 31 Languages

Hi everyone,

I wanted to share an open-source Bitcoin UX project we just published:

https://github.com/osem23/bip39-wordlists-tzur

It is a set of BIP-39 display wordlists in 31 languages: English plus 30 native-language lists.

The goal is simple: let users back up and restore a BIP-39 recovery phrase in their own language, without changing the cryptographic seed.

The seed of record remains the canonical English BIP-39 mnemonic. PBKDF2 still runs on the English form. The native-language lists are only a display and input layer, index-paired to canonical English, so they add no new cryptographic surface.

The repo includes:

30 native-language display wordlists
2048 entries per language
Bidirectional English-to-native mappings
Validation scripts
Test vectors
Documentation
MIT license

Languages include Arabic, Hindi, Bengali, Urdu, Farsi, Turkish, Vietnamese, Thai, Hebrew, Polish, Ukrainian, Romanian, Swedish, Danish, Filipino, Malay, Indonesian, Russian, Dutch, German, Estonian, and others.

Why we built it:

BIP-39 has canonical wordlists for only 10 languages. Most of the world still has to deal with recovery phrases in English or in a language that is not native to them.

We wanted to explore whether wallets can improve recovery UX for non-English users while staying fully compatible with standard BIP-39 flows.

This is not a new seed scheme, not a wallet, not a token, and not a replacement for canonical BIP-39.

It is a display-layer convention for multilingual recovery UX.

We would appreciate review, criticism, native-speaker corrections, and feedback from wallet developers.

GitHub:
https://github.com/osem23/bip39-wordlists-tzur

submitted by /u/osem23
[link] [comments]

[Project] Open Database Of 1,000+ IP Camera Specs — JSON/CSV, CC0, 49 Brands

I released an open dataset of IP/CCTV camera specifications under CC0 (public domain).

The problem it solves: camera specs are scattered across vendor PDFs, inconsistent retailer listings, and paywalled databases. There was no single structured open source to query from.

What’s in it:

– 1,000 cameras across 49 brands (Hikvision, Dahua, Reolink, Axis, Hanwha, Tapo, Ubiquiti, and more)

– One JSON file per camera under cameras/<brand>/<model>.json, aggregated into data/cameras.json + CSV

– Fields: resolution, sensor, lens, connectivity (PoE/WiFi/battery/4G), night vision type and range, IP rating, ONVIF/RTSP support, audio, storage, price, market tags

– Schema validated on every PR via GitHub Actions

– CC0 — no attribution required, do whatever you want with it

Contributing:

Non-devs can submit cameras via a GitHub issue form (no cloning needed). Developers can use an interactive CLI wizard (npm run add) that writes the JSON file without needing to know the schema.

Browse it: https://ch-bas.github.io/cctv-camera-database/

Repo: https://github.com/ch-bas/cctv-camera-database

Built with Claude Code — specs sourced from manufacturer datasheets, each entry cites its source URL.

submitted by /u/CantaloupeHeavy996
[link] [comments]

Sick Of “black Box” Cost-of-living Data? I Built An Open-source Tool For Radical Price Transparency, But I Need Your Help.

Hello everyone,

I want to introduce you to an open-source project I’ve been building called Price Compass.

The goal is simple: empower international price transparency through open, verifiable data. Traditional cost-of-living indexes offer “black box” metrics, where you just have to trust their final number. Price Compass serves both aggregated insights and the underlying raw data. Every single price is tagged with a vendor name, timestamp, and a direct link to the product so users can verify, audit, and calculate their own economic indicators.

Why this matters

  • Custom “Shopping Baskets”: Instead of relying on generic averages, you can build a personalized monthly cart reflecting what you actually consume (e.g., 10L milk, 2 gym memberships, 1 transit pass). This means you no longer have to blindly believe the media or political propaganda comparing international costs of living, you can truly, independently, and verifiably see the real numbers across borders for yourself.
  • On-the-Fly Aggregation: Switch between Average, Minimum, or Maximum modes to see the full market spectrum instantly.
  • Historical Indicators & Auditing: Because the system tracks and stores raw data over time, you can look back at historical data points to independently reconstruct and verify inflation rates and other economic indicators. Instead of just accepting a government’s official CPI (Consumer Price Index) percentage at face value, you have the historical ledger to audit how prices actually shifted on the ground.

Where it stands right now

The frontend is a fast, static page. It is about 70% functional for two countries: Denmark and Hungary.

The Catch (And why I need you)

I love this concept, but I recently hit a wall. After a 2-month pause due to burnout, I returned to find that 1 data source had shut down entirely and 4 others had changed their interfaces. To get back to where I left off, I have to rewrite one scraper from scratch and rework 4 others.

As we all know, scraper-based projects don’t have a finish line. They require constant maintenance as vendors update their sites, making it effectively a part-time job with no end in sight. I don’t have that kind of free time, and writing/rewriting scrapers isn’t how I want to spend my weekends.

How We Can Move Forward (My Proposal)

Right now, the code is a bit of a mess following the burnout pause, and the broken scrapers mean the data is frozen. However, I still deeply believe in this mission. If there is genuine community interest, I am willing to pick this project back up.

Here is what I am looking to gauge before diving back in:

  1. User Interest & Crowdfunding: Would you actually use a tool like this? If the interest is there, I’d consider setting up a non-binding community fund (like Patreon or GoFundMe) down the line to help cover general project costs.
  2. Future Contributors: If you are a developer, data scientist, or open-source enthusiast, would you want to contribute? I’d love to connect with anyone who wants to help expand regional data or collaborate on building a more resilient data architecture.

My commitment to you: If this post gets traction and validates that people want Price Compass to exist, I will commit to jumping back in. I will clean up the codebase, write proper documentation, fix the broken Danish and Hungarian scrapers, and set it up as a clean, welcoming environment for community contributions.

Is a transparent, auditable cost-of-living tool something the community actually wants, or am I shouting into the void? Let me know your thoughts, critiques, or if you’d be down to help build this!

submitted by /u/nlevi-dev
[link] [comments]

Internal App Ideas Keyword Research Tool Hitting Roadblocks

So I’m trying to build and internal private tool for myself, so i can research App/Content Ideas i would like to build. I would like to get tips on how to do it. How would you build it? What tools and methods would you use?

I applied for Google Ads Api (waiting approval) Source Pack template with raw data, staging, reporting build already for Keyword planner. Need search volume, trend, competition index. Same for the other tools.

Google Trends Explore for specific Keyword Families/seeds.
Pytrends and pytrends-modern like tools seem to be outdated and don’t work. What’s the recent way to do that? i get blocked after one request.

Apple charts, Apple reviews for finding pain points etc.

I have no experience for scraping and don’t even wanna do broad scraping. just have a report for specific keywords and expand on that. an opportunity score if u will. Would appreciate any tips.

submitted by /u/serdox
[link] [comments]

Built An Alternative To OpenCorporates Using Strictly First-party Government Data. Looking For Feedback.

Hey r/datasets, I’ve noticed a lot of offline countries and gaps when using OpenCorporates, so my team and I built an alternative www.zephira.ai . We source our data directly from official government registries across 200+ countries. I’d love for this community to test it out and let me know how it compares to what you’re currently using.

Mainly interested in understanding:

  • How do you currently verify companies and directors internationally?
  • What data providers do you use today?
  • What are the biggest gaps with providers like OpenCorporates, D&B, Moody’s/BvD, Creditsafe, or local registries?
  • Would registry-sourced company data with API/bulk access be useful for your workflow?

Not trying to make this a sales post. I’d appreciate critical feedback from people who have worked with these datasets.

submitted by /u/SectionLongjumping92
[link] [comments]

[self-promotion] Built A Rules-based Economic Stress Monitor For 11 African Economies — Dataset Now Available

Been working on this for a few months. The problem: African macro data is either paywalled (Bloomberg, Refinitiv) or significantly lagged (World Bank annual releases). There’s not much in between for developers and researchers who need current, attributed data at a reasonable price.

What I built: a cross-signal economic stress monitor that pulls directly from central banks and national statistics offices across 11 African economies (Nigeria, Ghana, Kenya, South Africa, Zambia, Tanzania, Uganda, Morocco, Côte d’Ivoire, Ethiopia, Rwanda).

Two analytical layers: – Acute stress: FX momentum, inflation, export-weighted commodity shock, real interest rate, reserve drawdown – Structural vulnerability: debt distress, fiscal position, banking stress, REER misalignment, political stability This week’s most interesting finding: Zambia has the lowest acute stress score in the dataset (copper boom, appreciating kwacha, low inflation) while simultaneously carrying one of the highest structural vulnerability scores (debt at 114% of GNI from its 2020 default). The commodity windfall is masking unrestructured debt.

Available on Apify with full source attribution on every record: https://apify.com/malmon/african-economic-stress-monitor

Free monthly newsletter with the findings if you’d rather not run it yourself: https://malmonde.substack.com/p/african-macro-signal-june-2026

Happy to answer questions about methodology or coverage.

submitted by /u/g_kalle
[link] [comments]

What Is The Best Travel Search API (flights, Hotels, Etc) Today?

I have a little personal project that I’d like to build and I see there are a number of APIs available around the Internet (RapidAPI, apify, etc.)

Is there a known best-in-class API that provides flight information/pricing from most airlines, can discriminate by coach/business, and offer information on hotel availability and pricing too?

A while ago I tried an API from RapidAPI, but quickly discovered that it wasn’t bringing in a lot of stuff from lesser-known airlines (Copa, smaller Euro carriers, etc). I’d like to build this on top of something solid, but that doesn’t require me to buy millions of calls a month since this is a personal project.

submitted by /u/puckpuckgo
[link] [comments]

How Deepfake Detection Models Perform Across Social Media Platforms

When images are run through social media platforms, they are resized, re-encoded, and pushed through the platform’s codec. In assessing a deepfake detector model, it’s important to ensure the model remains robust across real world platforms.

I built a dataset of varied image formats that mimic the image adjustments made by these popular platforms and tested some open source models on it.

Dataset, Model Results

submitted by /u/Tasty_Pressure_5618
[link] [comments]

Seeking Multi-year Airbnb Listing Data (prices, Location, Capacity) For European Coastal Cities

I am looking for Airbnb data for research on short-term rental markets. I am especially interested in listings and listing-level data, ideally covering several years so I can analyze changes over time. I am looking for information such as price, location, size, number of guests, minimum stay / length of stay, and other basic listing characteristics.
The geographic scope I am interested in includes tourist coastal cities in Poland, such as Gdańsk, Sopot, and Kołobrzeg, as well as selected cities abroad, such as Dubrovnik, Split, and Rijeka.
The Inside Airbnb website primarily features data for the US. It doesn’t list any Polish cities.

If anyone has access to such data, knows where it can be obtained, or has worked with similar datasets before, I would be very grateful for any contact, advice, or suggestions.

submitted by /u/CatResponsible6064
[link] [comments]

Which Egocentric Video Datasets Do You Find Most Useful For Research?

I’ve been looking into first-person (egocentric) video datasets for activity recognition and multimodal learning research.

A few challenges that seem to come up repeatedly are:

Motion blur
Rapid viewpoint changes
Occlusions from hands and objects
Long video sequences
Annotation consistency

For people who have worked with these datasets:

Which datasets have been the most useful?
What limitations did you encounter?
How well do current datasets generalize to real-world applications?
Are there any newer datasets you’d recommend exploring?

I’d appreciate hearing about experiences from both research and production environments.

submitted by /u/Vane1st
[link] [comments]

[Dataset] REFUTE — Scientific Critique & Epistemic Calibration On Recent Paper Summaries (Apache-2.0)

Sharing a dataset I work on. REFUTE is an Apache-2.0 benchmark for testing whether models can critique recent science summaries with calibrated, evidence-grounded judgment.

Configs: – refute_soundness — judge-free split (no LLM judge needed to score) – refute_hard_60 / refute_120 — harder vignettes

Each item: a paper summary (some with planted flaws / overclaims / missing evidence) + gold labels, with confidence targets scored using Brier (a strictly proper rule), so calibration is measured rather than just accuracy.

License: Apache-2.0 Load: load_dataset(“BGPT-OFFICIAL/refute”, “refute_soundness”) Dataset: https://huggingface.co/datasets/BGPT-OFFICIAL/refute Leaderboard: https://huggingface.co/spaces/BGPT-OFFICIAL/refute-leaderboard

Happy to answer questions about how it was constructed and labeled.

submitted by /u/connerpro
[link] [comments]

Open-sourced World’s Largest Database Of UFC Stats + Vegas-beating Model And Code

https://mcinerney.ai/writings/i-open-sourced-my-ufc-prediction-model-weights-and-database/ – lessons on using the data for modeling over the last several years.

Here’s my gigantic database of UFC stats including hour by hour odds data of fights over the past 5 years. There’s an AGENTS.md/CLAUDE.md/README.md that are optimized for CC or codex to analyze the data.

submitted by /u/FlyingTriangle
[link] [comments]

Global Jobs Dataset (271M+ Job Openings Since 2018)

Hi everyone,

I work at PredictLeads, where we collect and maintain company datasets focused on business signals.

Our Jobs Dataset currently includes:

  • 271.3 million job openings detected since 2018
  • 8.9 million active job openings with job descriptions available
  • Historical hiring activity and trends
  • Company-level hiring signals
  • API and bulk data access

Documentation:

https://docs.predictleads.com/api_endpoints/job_openings_dataset

In addition to jobs data, we also provide datasets covering:

  • Technologies
  • News Events
  • Funding Events
  • Company Data
  • Website Changes
  • GitHub Activity
  • And more

One thing that makes us a bit different is that we don’t focus on building a platform. We’re a data provider focused primarily on data quality, coverage, and making the data easy to integrate into your existing workflows, data warehouses, CRMs, or enrichment pipelines.

Happy to answer any questions about coverage, use cases, APIs, or data delivery formats.

submitted by /u/Expensive_Horse6568
[link] [comments]

Crimedatasets – A Comprehensive Collection Of Crime-related Datasets For Python

PyPI: https://pypi.org/project/crimedatasets/
GitHub: https://github.com/lightbluetitan/crimedatasets-py
Docs: https://lightbluetitan.github.io/crimedatasets-py/
pip install crimedatasets
The crimedatasets package provides a comprehensive collection of crime-related datasets from around the world. It includes extensive data on topics such as mass shootings, hate crimes, incarceration statistics, serial killers, corruption indexes, law enforcement data, criminal justice metrics, drug overdoses, and prison facilities.

submitted by /u/renzocrossi
[link] [comments]

June 2026 Job/Careers Dataset, Use Structured Data + AI In Your Job Search

reposting this here. But I’ve built out a crawler that obtains live job listings across 5.6 million US company websites, and continuously updates a monthly pool of job listing data.

I’ve seen other people doing this on reddit but refusing to be transparent and actually share their datasets for download.

My airflow dags complete a full crawling cycle of all companies and their associated job boards in under 24 hours. This is on a windows machine and modest home network so my operating costs are near zero.

This data will remain forever free @ jobdatapool.com

submitted by /u/never_sleeping99
[link] [comments]

Does Anything Exist That Can Automatically Translate Variable And Value Labels In A Stata Dataset?

I’ve been working with a cross-national dataset where all the variable labels and value labels are in a foreign language. Renaming them manually is tedious and error-prone, especially with 200+ variables.

I know I can write a do-file to relabel everything but that still requires me to know what the foreign labels mean and manually enter English equivalents one by one.

Is there any tool or workflow that handles this automatically? Ideally something that takes the .dta file, translates the metadata, and returns a clean English-labeled file without touching the underlying data

submitted by /u/WordAware2689
[link] [comments]

[self-promotion] 25 Years Of Official West African FX Rates — Daily Data From Central Banks, Now In One API

Been working on a gap I kept running into: getting official,

daily FX rates for West African countries programmatically.

The World Bank has this data but with a 6-12 month lag.

Everything else is either paywalled or scraped from aggregators

with no attribution.

So I built an actor that pulls directly from the issuing

central banks — CBN Nigeria, Bank of Ghana, BCEAO for the 8

WAEMU nations, and Banco de Cabo Verde. 11 countries, 4

currencies, history back to 1996 in some cases.

A few things I found interesting while building it:

The 8 WAEMU countries (Côte d’Ivoire, Senegal, Mali etc.)

share a currency pegged to the euro by treaty since 1999 —

at exactly 655.957 XOF/EUR, never changed. There’s no

independently set USD rate, it’s mathematically derived from

the ECB daily reference rate.

Every output record carries the source bank, URL, retrieval

timestamp and licence note — CBN explicitly grants permission

to copy with attribution which made things cleaner legally.

Available here if useful: https://apify.com/malmon/west-africa-fx-rates

Happy to answer questions about coverage or methodology.

submitted by /u/g_kalle
[link] [comments]

Looking For Honest Feedback On A Business/company Dataset I’m Building

Hey everyone,

I’m working on a business/company dataset and I’d really appreciate honest feedback from people who care about datasets, data quality, structure, and usefulness.

Just to be clear, this is not meant to be an ad. I’m not trying to sell anything here. I’m genuinely looking for advice on whether the data is useful, what’s missing, and what would make it more valuable as a dataset.

The idea is to build a structured dataset of business profiles over time. Right now, each company profile can include things like:

  • company name
  • website
  • industry
  • sector
  • location/headquarters
  • short description
  • related business details where available
  • confidence indicators
  • sources/references where possible

The longer-term plan is for the dataset to improve and grow as more businesses are searched and evaluated. But before I keep building in that direction, I’d really like people to look at what it currently returns and tell me whether it’s actually useful from a data perspective.

There’s a free live search page here where you can test the current output:

https://fastbusinessapi.com/trial-search/

I’d really appreciate feedback on things like:

  • whether the fields are useful
  • whether the structure makes sense
  • what fields are missing
  • whether the data feels trustworthy
  • what would make this more useful as a dataset
  • what would make you not use or trust it
  • whether this type of dataset has value if it grows over time

Again, this is genuinely not intended as advertising. I’m asking because I want honest feedback from people who understand datasets before I spend more time building the wrong thing.

Any criticism, advice, or suggestions would be really appreciated.

submitted by /u/Nacez
[link] [comments]