Hey, I am currently preparing my master thesis experiment and was looking for datasets. My experiment will use LLMs as baseline with different RAG variations. Data contamination is a big topic for LLMs, because if the LLM has already been trained on the data I want use, then the whole experiment is pointless. The dataset I found on zenodo.org is for vulnerability detection.
Public and readable datasets are problematic, but what’s about downloadable datasets that do not have a preview on its side?
Should I be worried ?
submitted by /u/Apprehensive_Win662
[link] [comments]