Hungarian language statistics

Hungarian Word Statistics

Hungarian language is agglutinative, a number of suffixes can be connected to each word. As a result, the length scale of words ranges from single-syllable words to very long word forms. The statistics were made from printed text. The word was considered as a letter sequence between spaces in the orthographic text.

Word statistics as a function of syllable count.

The linguistic material of the measurement was a text corpus of 80 million words, from which we selected the word forms (by sorting their written forms). Two words were considered different, if the letter line between two spaces differed in minimum a single character. As a result, we obtained a corpus of 1.5 million different words in which each one occurred only once. The number of syllables of each word form was determined. So every word was represented by a number. The statistics are as follow.

alter-text

The picture of the distribution by the number of syllables shows that in Hungarian texts the 4 and 5 syllable word forms have the most, followed by the 6 and 3 syllable ones. It has the least number of words with 1 and 2 syllables and 8 and 9 syllables.

Frequency of Hungarian words by syllable numbers in texts

The material of the measurement: Hungarian National Text Database (150 million words). The a, az articles were excluded in the measurement. The statistc data show, that two-syllable words are most commonly used in texts.

alter-text
Hungarian Letter Staistics in Texts

The linguistic material of the study is the full text of the 2005 version of the Hungarian National Text Database containing 187.6 million words in the topics of press, literature, science, adminstration and personal texts. Computer algorithms were developed for the collection, sorting and measurements. The basic sequence of letters examined was the word (a sequence of letters between two spaces). The distribution of Hungarian letters in texts is shown in the table.

alter-text