I’m assembling a small resource to evaluate and improve “idiomaticity” in LLM rewrites (outputs can be fluent but still feel literal).
For that, I’m looking for datasets of English idioms expressions with:

idiom text (canonical form if possible)
meaning
example sentences
ideally some frequency signal
licensing that allows research

Questions

Are there any well-known public idiom corpora you’d recommend?
Any good frequency proxies you’ve used for idioms?
If you’ve built something similar: what fields ended up being most important?

If helpful, I can share the exact retrieval endpoint I’m using for testing — but mostly I’m looking for dataset pointers.

submitted by /u/Own-Importance3687
[link] [comments]

Looking For Public Datasets Of English Idioms (idiom Text + Meaning + Example Sentences + Frequency If Possible)

Questions

Leave a Reply Cancel reply

Recent Posts

Recent Comments

18+ Content

Questions

Leave a Reply Cancel reply

Recent Posts

Recent Comments