I’m assembling a small resource to evaluate and improve “idiomaticity” in LLM rewrites (outputs can be fluent but still feel literal).
For that, I’m looking for datasets of English idioms expressions with:
- idiom text (canonical form if possible)
- meaning
- example sentences
- ideally some frequency signal
- licensing that allows research
Questions
- Are there any well-known public idiom corpora you’d recommend?
- Any good frequency proxies you’ve used for idioms?
- If you’ve built something similar: what fields ended up being most important?
If helpful, I can share the exact retrieval endpoint I’m using for testing — but mostly I’m looking for dataset pointers.
submitted by /u/Own-Importance3687
[link] [comments]