Hi,
I’ve just released my latest work: CodeReality.
For now, you can access a 19GB evaluation subset, designed to give a concrete idea of the structure and value of the full dataset, which exceeds 3TB.
- Dataset link: CodeReality on Hugging Face
- Inside you’ll find:
- the complete analysis also performed on the full 3TB dataset,
- benchmark results for code completion, bug detection, license detection, and retrieval,
- documentation and notebooks to help experimentation.
I’m currently working on making the full dataset available directly on Hugging Face.
In the meantime, if you’re interested in an early release/preview, feel free to contact me.
[vincenzo.galllo77@hotmail.com](mailto:vincenzo.galllo77@hotmail.com)
submitted by /u/CodeStackDev
[link] [comments]