{"id":39655,"date":"2026-03-15T07:14:42","date_gmt":"2026-03-15T06:14:42","guid":{"rendered":"https:\/\/www.graviton.at\/letterswaplibrary\/open-source-tool-for-schema-driven-synthetic-data-generation-for-testing-data-pipelines\/"},"modified":"2026-03-15T07:14:42","modified_gmt":"2026-03-15T06:14:42","slug":"open-source-tool-for-schema-driven-synthetic-data-generation-for-testing-data-pipelines","status":"publish","type":"post","link":"https:\/\/www.graviton.at\/letterswaplibrary\/open-source-tool-for-schema-driven-synthetic-data-generation-for-testing-data-pipelines\/","title":{"rendered":"Open-source Tool For Schema-driven Synthetic Data Generation For Testing Data Pipelines"},"content":{"rendered":"<p><!-- SC_OFF --><\/p>\n<div class=\"md\">\n<p>Testing data pipelines with realistic data is something I\u2019ve struggled with in several projects. In many environments, we can\u2019t use production data because of privacy constraints, and small handcrafted datasets rarely capture the complexity of real schemas (relationships, constraints, distributions, etc.). <\/p>\n<p>I\u2019ve been experimenting with a schema-driven approach to synthetic data generation and wanted to get feedback from others working on data engineering systems. <\/p>\n<p>The idea is to treat the **schema as the source of truth** and attach generation rules to it. From that, you can generate datasets that mirror the structure of production systems while remaining reproducible. <\/p>\n<p>Some of the design ideas I\u2019ve been exploring: <\/p>\n<p>\u2022 define tables, columns, and relationships in a schema definition <\/p>\n<p>\u2022 attach generation rules per column (faker, uuid, sequence, range, weighted choices, etc.) <\/p>\n<p>\u2022 validate schemas before generating data <\/p>\n<p>\u2022 generate datasets with a run manifest that records configuration and schema version <\/p>\n<p>\u2022 track lineage so datasets can be reproduced later <\/p>\n<p>I built a small open-source tool around this idea while experimenting with the approach. <\/p>\n<p>Tech stack is fairly straightforward: <\/p>\n<p>Python (FastAPI) for the backend and a small React\/Next.js UI for editing schemas and running generation jobs. <\/p>\n<p>If you\u2019ve worked on similar problems, I\u2019m curious about a few things: <\/p>\n<p>\u2022 How do you currently generate realistic test data for pipelines? <\/p>\n<p>\u2022 Do you rely on anonymised production data, synthetic data, or fixtures? <\/p>\n<p>\u2022 What features would you expect from a synthetic data tool used in data engineering workflows? <\/p>\n<p>Repo for reference if anyone wants to look at the implementation: <\/p>\n<p>[<a href=\"https:\/\/github.com\/ojasshukla01\/data-forge%5C%5D(https:\/\/github.com\/ojasshukla01\/data-forge)\">https:\/\/github.com\/ojasshukla01\/data-forge](https:\/\/github.com\/ojasshukla01\/data-forge)<\/a><\/p>\n<\/div>\n<p><!-- SC_ON -->   submitted by   <a href=\"https:\/\/www.reddit.com\/user\/Business-Quantity-15\"> \/u\/Business-Quantity-15 <\/a> <br \/> <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/1ru6899\/opensource_tool_for_schemadriven_synthetic_data\/\">[link]<\/a><\/span>   <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/1ru6899\/opensource_tool_for_schemadriven_synthetic_data\/\">[comments]<\/a><\/span><\/p><div class='watch-action'><div class='watch-position align-right'><div class='action-like'><a class='lbg-style1 like-39655 jlk' href='javascript:void(0)' data-task='like' data-post_id='39655' data-nonce='72e055e984' rel='nofollow'><img class='wti-pixel' src='https:\/\/www.graviton.at\/letterswaplibrary\/wp-content\/plugins\/wti-like-post\/images\/pixel.gif' title='Like' \/><span class='lc-39655 lc'>0<\/span><\/a><\/div><\/div> <div class='status-39655 status align-right'><\/div><\/div><div class='wti-clear'><\/div>","protected":false},"excerpt":{"rendered":"<p>Testing data pipelines with realistic data is something I\u2019ve struggled with in several projects. In many environments,&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[85],"tags":[],"class_list":["post-39655","post","type-post","status-publish","format-standard","hentry","category-datatards","wpcat-85-id"],"_links":{"self":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/39655","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/comments?post=39655"}],"version-history":[{"count":0,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/39655\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/media?parent=39655"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/categories?post=39655"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/tags?post=39655"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}