{"id":40775,"date":"2026-05-03T00:27:44","date_gmt":"2026-05-02T22:27:44","guid":{"rendered":"https:\/\/www.graviton.at\/letterswaplibrary\/seeking-a-dataset-of-english-lemmas-with-recognizability-scores\/"},"modified":"2026-05-03T00:27:44","modified_gmt":"2026-05-02T22:27:44","slug":"seeking-a-dataset-of-english-lemmas-with-recognizability-scores","status":"publish","type":"post","link":"https:\/\/www.graviton.at\/letterswaplibrary\/seeking-a-dataset-of-english-lemmas-with-recognizability-scores\/","title":{"rendered":"Seeking A Dataset Of English Lemmas With Recognizability Scores"},"content":{"rendered":"<p><!-- SC_OFF --><\/p>\n<div class=\"md\">\n<p>I checked out <a href=\"https:\/\/link.springer.com\/article\/10.3758\/s13428-018-1077-9\">the word prevalence dataset<\/a> of 62,000 lemmas. But it has some limitations:<\/p>\n<ul>\n<li>\n<p>It hasn&#8217;t been updated since 2019.<\/p>\n<\/li>\n<li>\n<p>It misses modern terms like TikTok.<\/p>\n<\/li>\n<li>\n<p>It doesn&#8217;t cover phrases.<\/p>\n<\/li>\n<\/ul>\n<p>I&#8217;ve <a href=\"https:\/\/github.com\/8ta4\/pun\/blob\/5260ff7960912935c46c4968a71ed0905a10ad84\/DONTREADME.md#vocabulary\">scored<\/a> about a million English entries from Wiktionary for recognizability. I built this for <a href=\"https:\/\/github.com\/8ta4\/pun\">a pun tool<\/a>. But I want to use <a href=\"https:\/\/github.com\/8ta4\/pun-data\/blob\/4b5a2c1eeb992d2c1b8faea2488768eaac6be9dc\/normalized.edn.gz\">the data<\/a> for a new language project.<\/p>\n<p>The dataset is too bloated because it&#8217;s full of inflected forms. Even if I set the recognizability threshold at 50 percent, I&#8217;m still looking at 100K words and 100K phrases. Going through a list that size is a waste of time. I need to filter the data through <a href=\"https:\/\/en.wiktionary.org\/wiki\/Category:English_lemmas\">the English lemmas category<\/a> from Wiktionary and split the single words from the multi-word phrases into separate lists.<\/p>\n<p>Since the hard part of scoring is done, the rest should be easy peasy lemma squeezy. I just want to avoid reinventing the wheel if I can.<\/p>\n<p>Before I spin up a separate repository to handle this, I&#8217;m checking if a similar dataset already exists. Has anyone seen a project that offers this?<\/p>\n<\/div>\n<p><!-- SC_ON -->   submitted by   <a href=\"https:\/\/www.reddit.com\/user\/8ta4\"> \/u\/8ta4 <\/a> <br \/> <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/1t22d7p\/seeking_a_dataset_of_english_lemmas_with\/\">[link]<\/a><\/span>   <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/1t22d7p\/seeking_a_dataset_of_english_lemmas_with\/\">[comments]<\/a><\/span><\/p><div class='watch-action'><div class='watch-position align-right'><div class='action-like'><a class='lbg-style1 like-40775 jlk' href='javascript:void(0)' data-task='like' data-post_id='40775' data-nonce='65e0e39b87' rel='nofollow'><img class='wti-pixel' src='https:\/\/www.graviton.at\/letterswaplibrary\/wp-content\/plugins\/wti-like-post\/images\/pixel.gif' title='Like' \/><span class='lc-40775 lc'>0<\/span><\/a><\/div><\/div> <div class='status-40775 status align-right'><\/div><\/div><div class='wti-clear'><\/div>","protected":false},"excerpt":{"rendered":"<p>I checked out the word prevalence dataset of 62,000 lemmas. But it has some limitations: It hasn&#8217;t&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[85],"tags":[],"class_list":["post-40775","post","type-post","status-publish","format-standard","hentry","category-datatards","wpcat-85-id"],"_links":{"self":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/40775","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/comments?post=40775"}],"version-history":[{"count":0,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/40775\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/media?parent=40775"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/categories?post=40775"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/tags?post=40775"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}