Anyone have a dataset of English bigrams and/or trigrams extracted from the OpenSubtitles dataset?
Preferably Creative Commons.
So far I’m only aware of this frequency list: https://github.com/hermitdave/FrequencyWords
submitted by /u/CheBiblioteca
[link] [comments]