I’ve managed to make a “Mutation Engine” that can generate (currently) 17 linguistically-inspired errors (metathesis, transposition, fortition, etc.) with a full audit trail.
The Stats:
- Scale: 1M rows made in ~15 seconds (done in the C programming language, hits .75 microseconds per operation).
- Traceability: Every typo includes the logical reasoning and step-by-step logs.
- Format: JSONL.
Currently, it’s English-only and has a known minor quirk with the duplication operator (occasionally hits a u0000).
I’m curious if this is useful for anyone’s training pipelines or something similar, and I can make custom sets if needed.
submitted by /u/Nitro224
[link] [comments]