Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

20,000 Hours Of Real-world Dual-arm Robot Manipulation Data Across 9 Embodiments, Open-sourced With Benchmark And Code (LingBot-VLA)

TL;DR

• 20,000 hours of teleoperated manipulation data from 9 dual-arm robot configurations (AgiBot G1, AgileX, Galaxea R1Pro, Realman, ARX Lift2, Bimanual Franka, and others)

• Videos manually segmented into atomic actions, then labeled with global and sub-task descriptions via VLM

• GM-100 benchmark: 100 tasks × 3 platforms × 130 episodes per task = 39,000 expert demonstrations for post-training evaluation

• Full code, base model weights, and benchmark data released

• Paper: arXiv:2601.18692

• Code: github.com/robbyant/lingbot-vla

• Models/Data: HuggingFace collection

What’s in the data

Each of the 9 embodiments has a dual-arm setup with multiple RGB-D cameras (typically 3 views: head + two wrists). The raw trajectories were collected via teleoperation (VR-based or isomorphic arms depending on the platform). Action spaces range from 12-DoF to 16-DoF depending on the robot. Every video was manually segmented into atomic action clips by human annotators, with static frames at episode start/end removed. Task and sub-task language instructions were then generated using Qwen3-VL-235B. An automated filtering pass removes episodes with technical anomalies, followed by manual review using synchronized multi-view video.

The data curation pipeline is probably the part I found most interesting to work through. About 50% of the atomic actions in the test set are absent from the top 100 most frequent training actions, which gives a sense of how much distribution shift the benchmark actually tests.

Benchmark structure

The GM-100 benchmark covers 100 tabletop manipulation tasks evaluated on 3 platforms (AgileX, AgiBot G1, Galaxea R1Pro). Each task gets 150 raw trajectories collected, top 130 retained after quality filtering. Object poses are randomized per trajectory. Evaluation uses two metrics: Success Rate (binary task completion within 3 minutes) and Progress Score (partial credit based on sequential subtask checkpoints). All evaluation rollouts are recorded in rosbag format and will be released.

For context on the numbers: LingBot-VLA w/ depth hits 17.30% average SR and 35.41% PS across all three platforms. π0.5 gets 13.02% SR / 27.65% PS on the same tasks with the same post-training data. These are not high numbers in absolute terms, which honestly reflects how hard 100 diverse real-world manipulation tasks actually are.

Scaling observations from the data

One thing worth flagging for people interested in data scaling: going from 3,000 to 20,000 hours of pre-training data showed consistent improvement with no saturation. The per-platform curves (Fig 5 in the paper) all trend upward at the 20k mark. This is on real hardware, not sim, which makes the continued scaling somewhat surprising given how noisy real-world data tends to be.

Training codebase

The released codebase achieves 261 samples/sec/GPU on an 8-GPU setup (1.5x to 2.8x over OpenPI/StarVLA/Dexbotic depending on the VLM backbone). Uses FSDP with hybrid sharding for the action expert modules and FlexAttention for the sparse multimodal fusion. Scaling efficiency stays close to linear up to 256 GPUs.

Caveats

All data is dual-arm tabletop manipulation only. No mobile manipulation, no single-arm, no legged locomotion. The 17% average success rate means these tasks are far from solved. Depth integration helps on some platforms more than others (AgileX benefits most, AgiBot G1 barely moves). The language annotations are VLM-generated after manual segmentation, so annotation quality depends on both the human segmentation and the VLM’s captioning accuracy.

Disclosure: this is from Robbyant. Sharing because 20k hours of labeled real-robot data with a standardized benchmark is something I haven’t seen at this scale in an open release before, and the benchmark data alone could be useful for people working on evaluation protocols for embodied AI.

Curious what formats and subsets would be most useful for people here to work with directly.

submitted by /u/Independent_Plum_489
[link] [comments]

How Investigate Performance Issues In Spark?

Hi everyone,

I’m currently studying ways to optimize pipelines in environments like Databricks, Fabric, and Spark in general, and I’d love to hear what you’ve been doing in practice.

Lately, I’ve been focusing on Shuffle, Skew, Spill, and the Small File Problem.

What other issues have you encountered or studied out there?

More importantly, how do you actually investigate the problem beyond what Spark UI shows?

These are some of the official docs I’ve been using as a base:

https://learn.microsoft.com/azure/databricks/optimizations/?WT.mc_id=studentamb_493906

https://learn.microsoft.com/azure/databricks/optimizations/spark-ui-guide/long-spark-stage-page?WT.mc_id=studentamb_493906

https://learn.microsoft.com/azure/databricks/pyspark/reference/functions/shuffle?WT.mc_id=studentamb_493906

submitted by /u/Significant-Side-578
[link] [comments]

Is There Research Value In Time-aligned Crypto Market + Sentiment Observations?

Hi,

Over the past few months I’ve built a pipeline that produces weekly observational snapshots of crypto markets, aligning spot market structure (prices, spreads, liquidity context) with aggregated social sentiment.

Each observation captures a monitoring window of spot price samples, paired with aggregated sentiment from the hour preceding the window.

I’ve published weekly Sunday samples for inspection:

https://huggingface.co/datasets/Instrumetriq/crypto-market-sentiment-observations

https://github.com/SiCkGFX/instrumetriq-public

What I’m genuinely trying to understand:

– Is this kind of dataset interesting or useful to anyone doing analysis or research?

– Are there obvious methodological red flags?

– Is this solving a real problem, or just an over-engineered artifact?

Critical feedback is welcome. If this is pointless, I’d rather know now.

submitted by /u/SiCkGFX
[link] [comments]

Active Directory Vulnerability Datasets

TLDR; Is there a dataset I can feed to LLM’s to test their capability in identifying vulnerabilities in Active directory.

Hi, Im currently preparering for testing different LLM’s for their capability in vulnerability detection. As far as i have found out, this does not exist. I have however seen some articals where the author has made or simulated the data sets like in “A Methodological Framework for AI-Assisted Security Assessments of Active Directory Environments”. I would think that some of these researchers might upload their datasets, but i cant find them. If you have any suggestions for data sets or where I might find them, please leave a comment.

submitted by /u/ThaLazyLand
[link] [comments]

Best Sources For A Global 2026 Tech & Startup Database? (Website + Email)

Hi everyone,

I’m looking for advice on where to find or purchase a comprehensive, up-to-date global dataset of tech companies and startups for 2026.

I need a global reach (US, EU, Asia) and specifically require datasets that include:

• Company Name

• Official Website URL

• Verified Business Email

I want to avoid outdated lists and “dead” websites from previous years. Does anyone know of reliable providers, directories, or platforms that offer high-quality global exports for this year?

Any recommendations for tools or marketplaces that specialize in recently updated business data would be greatly appreciated.

Thanks!

submitted by /u/Embarrassed_Fig_566
[link] [comments]

Anyone Working With RGB-D Datasets That Preserve Realistic Sensor Failures (missing Depth On Glass, Mirrors, Reflective Surfaces)?

I’ve been looking for large-scale RGB-D datasets that actually keep the naturally occurring depth holes from consumer sensors instead of filtering them out or only providing clean rendered ground truth. Most public RGB-D datasets (ScanNet++, Hypersim, etc.) either avoid challenging materials or give you near-perfect depth, which is great for some tasks but useless if you’re trying to train models that handle real sensor failures on glass, mirrors, metallic surfaces, etc.

Recently came across the data released alongside the LingBot-Depth paper (“Masked Depth Modeling for Spatial Perception”, arXiv:2601.17895). They open-sourced 3M RGB-D pairs (2M real + 1M synthetic) specifically curated to preserve the missing depth patterns you get from actual hardware.

What’s in the dataset:

Split Samples Source Notes
LingBot-Depth-R 2M Real captures (Orbbec Gemini, Intel RealSense, ZED) Homes, offices, gyms, lobbies, outdoor scenes. Pseudo GT from stereo IR matching with left-right consistency check
LingBot-Depth-S 1M Blender renders + SGM stereo 442 indoor scenes, includes speckle-pattern stereo pairs processed through semi-global matching to simulate real sensor artifacts
Combined training set ~10M Above + 7 open-source datasets (ClearGrasp, Hypersim, ARKitScenes, TartanAir, ScanNet++, Taskonomy, ADT) Open-source splits use artificial corruption + random masking

Each real sample includes synchronized RGB, raw sensor depth (with natural holes), and stereo IR pairs. The synthetic samples include RGB, perfect rendered depth, stereo pairs with speckle patterns, GT disparity, and simulated sensor depth via SGM. Resolution is 960×1280 for the synthetic branch.

The part I found most interesting from a data perspective is the mask ratio distribution. Their synthetic data (processed through open-source SGM) actually has more missing measurements than the real captures, which makes sense since real cameras use proprietary post-processing to fill some holes. They provide the raw mask ratios so you can filter by corruption severity.

The scene diversity table in the paper covers 20+ environment categories: residential spaces of various sizes, offices, classrooms, labs, retail stores, restaurants, gyms, hospitals, museums, parking garages, elevator interiors, and outdoor environments. Each category is roughly 1.7% to 10.2% of the real data.

Links:

HuggingFace: https://huggingface.co/robbyant/lingbot-depth

GitHub: https://github.com/robbyant/lingbot-depth

Paper: https://arxiv.org/abs/2601.17895

The capture rig is a 3D-printed modular mount that holds different consumer RGB-D cameras on one side and a portable PC on the other. They mention deploying multiple rigs simultaneously to scale collection, which is a neat approach for anyone trying to build similar pipelines.

I’m curious about a few things from anyone who’s worked with similar data:

  1. For those doing depth completion or robotic manipulation research, is 2M real samples with pseudo GT from stereo matching sufficient, or do you find you still need LiDAR-quality ground truth for your use cases?
  2. The synthetic pipeline simulates stereo matching artifacts by running SGM on rendered speckle-pattern stereo pairs rather than just adding random noise to perfect depth. Has anyone compared this approach to simpler corruption strategies (random dropout, Gaussian noise) in terms of downstream model performance?
  3. The scene categories are heavily weighted toward indoor environments. If you’re working on outdoor robotics or autonomous driving with similar sensor failure issues, what datasets are you using for the transparent/reflective object problem?

submitted by /u/Electrical-Shape-266
[link] [comments]

[Dataset] [Soccer] [Sports Data] 10 Year Dataset: Top-5 European Leagues Match And Player Statistics (2015/16–Present)

I have compiled a structured dataset covering every league match in the Premier League, La Liga, Bundesliga, Serie A, and Ligue 1 from the 2015/16 season to the present.

• Format: Weekly JSON/XML files (one file per league per game-week)

• Player-level detail per appearance: minutes played (start/end), goals, assists, shots, shots on target, saves, fouls committed/drawn, yellow/red cards, penalties (scored/missed/saved/conceded), own goals

• Approximate volume: 1,860 week-files (~18,000 matches, ~550,000 player records)

The dataset was originally created for internal analysis. I am now considering offering the complete archive as a one-time ZIP download.

I am assessing whether there is genuine interest from researchers, analysts, modelers, or others working with football data.

If this type of dataset would be useful for your work (academic, modeling, fantasy, analytics, etc.), please reply with any thoughts on format preferences, coverage priorities, or price expectations.

I can share a small sample week file via DM or comment if helpful to evaluate the structure.

submitted by /u/Specialist-Hand6171
[link] [comments]

S&P 500 Corporate Ethics Scores – 11 Dimensions

Dataset Overview

Most ESG datasets rely on corporate self-disclosures — companies grading their own homework. This dataset takes a fundamentally different approach. Every score is derived from adversarial sources that companies cannot control: court filings, regulatory fines, investigative journalism, and NGO reports.

The dataset contains integrity scores for all S&P 500 companies, scored across 11 ethical dimensions on a -100 to +100 scale, where -100 represents the worst possible conduct and +100 represents industry-leading ethical performance.

Fields

Each row represents one S&P 500 company. The key fields include:

  • Company information: ticker symbol, company name, stock exchange, industry sector (ISIC classification)

  • Overall rating: Categorical assessment (Excellent, Good, Mixed, Bad, Very Bad)

  • 11 dimension scores (-100 to +100):

  • planet_friendly_business — emissions, pollution, environmental stewardship

  • honest_fair_business — transparency, anti-corruption, fair practices

  • no_war_no_weapons — arms industry involvement, conflict zone exposure

  • fair_pay_worker_respect — labour rights, wages, working conditions

  • better_health_for_all — public health impact, product safety

  • safe_smart_tech — data privacy, AI ethics, technology safety

  • kind_to_animals — animal welfare, testing practices

  • respect_cultures_communities — indigenous rights, community impact

  • fair_money_economic_opportunity — financial inclusion, economic equity

  • fair_trade_ethical_sourcing — supply chain ethics, sourcing practices

  • zero_waste_sustainable_products — circular economy, waste reduction

What Makes This Different from Traditional ESG Data

Traditional ESG providers (MSCI, Sustainalytics, Morningstar) rely heavily on corporate sustainability reports — documents written by the companies themselves. This creates an inherent conflict of interest where companies with better PR departments score higher, regardless of actual conduct.

This dataset is built using NLP analysis of 50,000+ source documents including:

  • Court records and legal proceedings

  • Regulatory enforcement actions and fines

  • Investigative journalism from local and international outlets

  • Reports from NGOs, watchdogs, and advocacy organisations

The result is 11 independent scores that reflect what external evidence says about a company, not what the company says about itself.

Use Cases

  • Alternative ESG analysis — compare these scores against traditional ESG ratings to find discrepancies

  • Ethical portfolio screening — identify S&P 500 holdings with poor conduct in specific dimensions

  • Factor research — explore correlations between ethical conduct and financial performance

  • Sector analysis — compare industries across all 11 dimensions

  • ML/NLP research — use as labelled data for corporate ethics classification tasks

  • ESG score comparison — benchmark against MSCI, Sustainalytics, or Refinitiv scores

Methodology

Scores are generated by Mashini Investments using AI-driven analysis of adversarial source documents.

Each company is evaluated against detailed KPIs within each of the 11 dimensions.

Coverage

– 500 companies — S&P 500 constituents

– 11 dimensions — 5,533 individual scores

– Score range — -100 (worst) to +100 (best)

CC BY-NC-SA 4.0 licence.

Kaggle

submitted by /u/RevolutionaryGate742
[link] [comments]

Early Global Stress Dataset Based On Anonymous Wearable Data

I’ve recently started collecting an early-stage, fully anonymous dataset

showing aggregated stress scores by country and state.

The data is derived from on-device computations and shared only as a single

daily score per region (no raw signals, no personal data).

Coverage is still limited, but the dataset is growing gradually.

Sharing here mainly to document the dataset and gather early feedback.

Public overview and weekly summaries are available here:

https://stress-map.org/reports

submitted by /u/maxstrok
[link] [comments]

[PAID] EU Amazon Product & Price Intelligence Dataset – 4M+ High-Value Products, Continuously Updated

Hi everyone,

I’m offering a large-scale EU Amazon product intelligence dataset with 4 million+ entries, continuously updated.
The dataset is primarily focused on high resale-value products (electronics, lighting, branded goods, durable products, etc.), making it especially useful for arbitrage, pricing analysis, and market research. US Amazon data will be added shortly.

What’s included:

  • Identifiers: ASIN(s), EAN, corresponding Bol.com product IDs (NL/BE)
  • Product details: title, brand, product type, launch date, dimensions, weight
  • Media: product main image
  • Pricing intelligence: historical and current price references from multiple sources (Idealo, Geizhals, Tweakers, Bol.com, and others)
  • Market availability: active and inactive Amazon stores per product
  • Ratings: overall rating and 5-star breakdown

Dataset characteristics:

  • Focused on items with higher resale and margin potential, rather than low-value or disposable products
  • Aggregated from multiple public and third-party sources
  • Continuously updated to reflect new prices, availability, and product changes

Delivery & Format:

  • JSON
  • Provided by store, brand, or product type
  • Full dataset or custom slices available

Who this is for:

  • Amazon sellers and online resellers
  • Price comparison and deal discovery platforms
  • Market researchers and brand monitoring teams
  • E-commerce analytics and data science projects

Sample & Demo:
A small sample (10–50 records) is available on request so you can review structure and data quality before purchasing.

Pricing & Payment:

  • Dataset slices (by store, brand, or product type): €30–€150
  • Full dataset: €500–€1,000
  • Payment via PayPal (Goods & Services)
  • Private seller, dataset provided as-is
  • Digital dataset, delivered electronically, no refunds after delivery

If this sounds useful, feel free to DM me — happy to share a sample or discuss a custom extract.

submitted by /u/Fun_Internal1460
[link] [comments]

Diabetes Indicators Dataset – 1,000,000 Rows (Privacy-Compliant) Synthetic “paid”

Hello everyone, I’d like to share a high-fidelity synthetic dataset I developed for research and testing purposes.

Please note that the link is to my personal store on Gumroad, where the dataset is available for sale.

Technical Details:

I generated 1,000,000 records based on diabetes health indicators (original source BRFSS 2015) using Gaussian Copula models (SDV library).

• Privacy: The data is 100% synthetic. No risk of re-identification, ideal for development environments requiring GDPR or HIPAA compliance.

• Quality: The statistical correlations between risk factors (BMI, hypertension, smoking) and diabetes diagnosis were accurately preserved.

• Uses: Perfect for training machine learning models, benchmarking databases, or stress-testing healthcare applications.

Link to the dataset: https://borghimuse.gumroad.com/l/xmxal

Feedback and questions about the methodology are welcome!

submitted by /u/Same_Asparagus_1979
[link] [comments]

Looking For Retail Data Analysis Project Ideas / References

Hi everyone,

I’m working on building a retail data analysis portfolio project and wanted to ask if anyone here has worked on or built a good retail analysis project that they’d be willing to share or let me refer to.

I’m mainly looking for project ideas, problem statements, datasets, or dashboards that reflect real-world retail use cases (sales analysis, customer behavior, inventory, forecasting, etc.).

Any links, GitHub repos, or brief descriptions would be really helpful.

Thank you in advance.. I would really appreciate your time and help! 😊

submitted by /u/msfarahs
[link] [comments]

CAR-bench: A Benchmark For Task Completion, Capability Awareness, And Uncertainty Handling In Multi-turn, Policy-constrained Scenarios In The Automotive Domain. [Mock]

LLM agent benchmarks like τ-bench ask what agents can do. Real deployment asks something harder: do they know when they shouldn’t act?

CAR-bench (https://arxiv.org/abs/2601.22027), a benchmark for automotive voice assistants with domain-specific policies, evaluates three critical LLM Agent capabilities:

1️⃣ Can they complete multi-step requests?
2️⃣ Do they admit limits—or fabricate capabilities?
3️⃣ Do they clarify ambiguity—or just guess?

Three targeted task types:

Base (100 tasks): Multi-step task completion
Hallucination (90 tasks): Admit limits vs. fabricate
Disambiguation (50 tasks): Clarify vs. guess

tested in a realistic evaluation sandbox:
58 tools · 19 domain policies · 48 cities · 130K POIs · 1.7M routes · multi-turn interactions.

What was found: Completion over compliance.

  • Models prioritize finishing tasks over admitting uncertainty or following policies
  • They act on incomplete info instead of clarifying
  • They bend rules to satisfy the user

SOTA model (Claude-Opus-4.5): only 52% consistent success.

Hallucination: non-thinking models fabricate more often; thinking models improve but plateau at 60%.

Disambiguation: no model exceeds 50% consistent pass rate. GPT-5 succeeds 68% occasionally, but only 36% consistently.

The gap between “works sometimes” and “works reliably” is where deployment fails.

🤖 Curious how to build an agent that beats 54%?

📄 Read the Paper: https://arxiv.org/abs/2601.22027

💻 Run the Code & benchmark: https://github.com/CAR-bench/car-bench

We’re the authors – happy to answer questions!

submitted by /u/Frosty_Ad_6236
[link] [comments]

Platinum-CoT: High-Value Technical Reasoning. Distilled Via Phi-4 → DeepSeek-R1 (70B) → Qwen 2.5 (32B) Pipeline

I’ve just released a preview of Platinum-CoT, a dataset engineered specifically for high-stakes technical reasoning and CoT distillation.

What makes it different? Unlike generic instruction sets, this uses a triple-model “Platinum” pipeline:

  1. Architect: Phi-4 generates complex, multi-constraint Staff Engineer level problems.
  2. Solver: DeepSeek-R1 (70B) provides the “Gold Standard” Chain-of-Thought reasoning (Avg. ~5.4k chars per path).
  3. Auditor: Qwen 2.5 (32B) performs a strict logic audit; only the highest quality (8+/10) samples are kept.

Featured Domains:

Systems: Zero-copy (io_uring), Rust unsafe auditing, SIMD-optimized matching.

Cloud Native: Cilium networking, eBPF security, Istio sidecar optimization.

FinTech: FIX protocol, low-latency ring buffers.

Check out the parquet preview on HuggingFace:

https://huggingface.co/datasets/BlackSnowDot/Platinum-CoT

submitted by /u/BlackSnowDoto
[link] [comments]

Urgent Help Needed Regarding A Dataset!!!

Urgently need a dataset with Indian vehicles of autos, cars, trucks, buses etc with some pedestrians if possible in some of the images. Told to create a custom dataset by clicking some images of my own but I don’t have enough time to do so. Anyone having a similar dataset with them, or is there any available dataset online. Just need around 500-600 images. PLSS HELPPP!!!

submitted by /u/Slow_Mo_1505
[link] [comments]

Q4 2025 Price Movements At Sephora Australia — SKU-Level Analysis Across Categories

Hi all, I’ve been tracking quarterly price movements at SKU level across beauty retailers and just finished a Q4 2025 cut for Sephora Australia.

Scope

  • Prices in AUD (pre-discount)
  • Categories across skincare, fragrance, makeup, haircare, tools & bath/body

Category averages (Q4)

  • Bath & Body: +6.0% (10 SKUs)
  • Fragrance: +4.5% (73)
  • Makeup: +3.3% (24)
  • Skincare: +1.7% (103)
  • Tools: +0.6% (13)
  • Haircare: -18.5% (10), the decline is caused by price cut from Virtue Labs, GHD and Mermade Hair.

I’ve published the full breakdown + subcategory cuts and SKU-level tables in the link at the comment. The similar dataset for Singapore, Malaysia and HK are also available on the site.

submitted by /u/IntelligentHome2342
[link] [comments]

Best Resource For Managing Large Datasets?

I hope this is the best place to ask this question. What would be the best approach to managing a large dataset of about 60 million rows where several columns would need to be manipulated to either find duplicates or to perform calculations on financial columns? The end goal would be to produce a file with no duplicate rows and final figures. Thanks in advance!

submitted by /u/MsVee21
[link] [comments]

How Do I Access The AMIGOS Dataset For A Dissertation?

I’m trying to access the Dataset and use it for my dissertation, I’m new to this kind of thing and I’m so confused. The online website for it doesn’t work (eecs.qmul.ac.uk/…). It says service unavailable. It’s not temporary as I’ve tried multiple times over months. I thought it’d check with the lovely men and women of Reddit to see if anyone has a solution? I need it soon!

submitted by /u/Smart_Luck7151
[link] [comments]