Free Dataset: 3250 Graded LLM Runs On Whether Models Trust In-context Docs Over The Actual Code

I ran a benchmark for a tool I built and figured the dataset might be useful to others. It took ~$100 of API credits to produce.

The test is simple: I give the agent a document describing a piece of code it can’t directly see, then record whether it double-checks the doc against the real code or just takes the doc’s word for it. The doc is sometimes accurate and sometimes out of date, so the data captures how each model handles documentation it can and can’t trust. The writeup covers what I found; the dataset lets you check it or look for your own patterns.

Dataset
Outcome

Star the repo if it’s useful. Cheers.

submitted by /u/AverageGradientBoost
[link] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *