Is There A Market For Expert-annotated Coding Trajectory Datasets (multi-turn, Step-level)?

I’m a senior software engineer (Clojure, Python, Rust, TypeScript/JavaScript, etc.) who works with LLMs daily for real development work, mainly on side projects. I’ve been building tooling to capture and annotate these sessions — not just the final code, but the full multi-turn trajectory with per-step expert annotations: correctness, engineering quality rating, error taxonomy (wrong approach, bad idiom, overengineering, etc.), and how errors were recovered (model self-corrected, expert redirected, expert rewrote).

The closest existing thing I’m aware of is PRM800K for math reasoning, but nothing equivalent exists publicly for code. SWE-bench has pass/fail outcomes but no step-level human quality judgments. Here’s what I want to know:

  1. Is anyone actually buying this kind of data? I know Scale AI, Surge, etc. hire coders for annotation work, but is there demand for independently produced, expert-annotated trajectory datasets?
  2. Is the implicit signal from product usage (accepting/rejecting model outputs in tools like Copilot, Claude Code, Cursor) making explicit annotation redundant? Labs get millions of implicit preference signals for free from their users. Does manual expert annotation add something that’s worth paying for?
  3. Does niche language coverage (e.g., Clojure, Haskell) change the calculus? Underrepresented languages have less implicit data, but does that make expert trajectories in those languages more valuable, or is the buyer pool too small to matter in the first place?
  4. Am I stuck (i.e., probably better off) just contracting with annotation vendors directly? Rather than selling a dataset, should I be applying to Scale/Surge/DataAnnotation with this tooling and expertise? Or is the tooling even unnecessary for those platforms too

For context, each annotated session includes: the full transcript (readable + machine-parseable), git diffs tied to specific turns, structured YAML annotations with a documented rubric, and session metadata (model used, duration, complexity). I’m still working on the annotation schema but it’s is “informed” by PRM800K, HelpSteer2, and UltraFeedback conventions.

I’m trying to figure out if this is a real product or if I’m building something the market doesn’t need. Honest feedback appreciated.

submitted by /u/emfuhsiss
[link] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *