I’ve been working on a side project and ended up compiling a dataset that may be useful beyond what I originally needed it for, so I’m considering releasing it publicly.
At a high level, the dataset contains:
- structured records collected over a multi-year period
- consistent timestamps and identifiers
- minimal preprocessing (basic cleaning + deduplication only)
It’s not tied to a specific paper or product, more something that could support exploratory analysis, modeling, or benchmarking, depending on the use case.
Before publishing, I wanted to sanity-check with this community:
- what details do you usually look for to judge dataset quality?
- is light preprocessing preferred, or raw + processed versions?
- anything that would immediately make this more usable for research?
Happy to share more specifics if there’s interest, and open to feedback before release.
submitted by /u/crowpng
[link] [comments]