When To Worry About Data Contamination In LLM Experiments?

Hey, I am currently preparing my master thesis experiment and was looking for datasets. My experiment will use LLMs as baseline with different RAG variations. Data contamination is a big topic for LLMs, because if the LLM has already been trained on the data I want use, then the whole experiment is pointless. The dataset I found on zenodo.org is for vulnerability detection.

Public and readable datasets are problematic, but what’s about downloadable datasets that do not have a preview on its side?

Should I be worried ?

submitted by /u/Apprehensive_Win662
[link] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *