1M+ Explainable Linguistic Typos (Traceable JSONL, C-Based Engine)

I’ve managed to make a “Mutation Engine” that can generate (currently) 17 linguistically-inspired errors (metathesis, transposition, fortition, etc.) with a full audit trail.

The Stats:

  • Scale: 1M rows made in ~15 seconds (done in the C programming language, hits .75 microseconds per operation).
  • Traceability: Every typo includes the logical reasoning and step-by-step logs.
  • Format: JSONL.

Currently, it’s English-only and has a known minor quirk with the duplication operator (occasionally hits a u0000).

Link here.

I’m curious if this is useful for anyone’s training pipelines or something similar, and I can make custom sets if needed.

submitted by /u/Nitro224
[link] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *