[Synthetic][self-promotion]Released A Synthetic Multimodal PHI De-identification Benchmark: Streaming Audit Log With 5 Policy Comparisons

Most PHI datasets evaluate masking on static single-modality documents. This one is different.

It captures per-event masking decisions across a simulated longitudinal stream, the same subject appearing across clinical notes, ASR transcripts, imaging proxies, waveform data, and audio metadata over time. The idea is to evaluate how re-identification risk accumulates across events rather than within a single record.

Five policies are included for comparison: raw, weak, pseudo, redact, and adaptive. The adaptive controller is the interesting one, it escalates masking strength only when cumulative exposure actually justifies it.

Dataset is fully open, no DUA required. Everything runs on synthetic data, no real patient records anywhere.

Hugging Face: https://huggingface.co/datasets/vkatg/streaming-phi-deidentification-benchmark

Code to regenerate: https://github.com/azithteja91/phi-exposure-guard

Happy to answer questions on the schema or the benchmark design.

submitted by /u/Visual_Music_4833
[link] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *