Tal Linzen

The Hebrew Blog Corpus

A 165-million word corpus of blog posts, automatically part-of-speech tagged and morphologically disambiguated. Search the corpus (experimental interface; feedback is welcome).

There's also a frequency database based on the corpus. It includes orthographic form frequencies and lemma frequencies.

... And an Excel sheet with the 5000 most common Hebrew lemmas for each part-of-speech.

You can find the Python code that was used to scrape the website and build the corpus and search engine on github.

Software

Lexvars: Contextual predictors for word recognition research, i.e. predictors that have to do with the way the recognized word tends to behave in texts, such as morphological family entropy or verb subcategorization entropy (see paper). Lexvars includes a Python wrapper for the CELEX corpus.