Seeking a dataset of English lemmas with recognizability scores

I checked out the word prevalence dataset of 62,000 lemmas. But it has some limitations:

It hasn’t been updated since 2019.
It misses modern terms like TikTok.
It doesn’t cover phrases.

I’ve scored about a million English entries from Wiktionary for recognizability. I built this for a pun tool. But I want to use the data for a new language project.

The dataset is too bloated because it’s full of inflected forms. Even if I set the recognizability threshold at 50 percent, I’m still looking at 100K words and 100K phrases. Going through a list that size is a waste of time. I need to filter the data through the English lemmas category from Wiktionary and split the single words from the multi-word phrases into separate lists.

Since the hard part of scoring is done, the rest should be easy peasy lemma squeezy. I just want to avoid reinventing the wheel if I can.

Before I spin up a separate repository to handle this, I’m checking if a similar dataset already exists. Has anyone seen a project that offers this?

submitted by /u/8ta4
[link] [comments]

Seeking A Dataset Of English Lemmas With Recognizability Scores

Leave a Reply Cancel reply

Recent Posts

Recent Comments

18+ Content

Leave a Reply Cancel reply

Recent Posts

Recent Comments