ProfiVox HMM

The Profivox- HMM TTS converter is based on statistical machine learning and uses hidden Markov models to generate parameters representing the speech signal to be synthesized. The development of computer technology made possible to realise this idea. No deep phonetic or linguistic knowledge is required. Speech melody and rhythm is also learned, no post signal-processing is required. The synthesized waveform is provided by the output of a speech encoder. The basis of learning is a large speech database (many hours of speech) created with several speakers. The algorithm determines the parameters for the middle speech sound of a quint-phone sequece step by step. It takes into account the time position (place) of the examined element at word- and sentence level, and also uses the word boundaries and the length of the word information during learning. As a result of learning an optimal parameter database is created, that is much smaller than the original speech database. Teaching process needs to be done only once. HMM-based teaching is a time-consuming and knowledge-intensive process. During the synthesis, Profivox-HMM selects data from the parameter database, based on the input text. The systm can pronounce declarative sentences and also questions correctly. The synthesis is fast and does not require much resources. You can slow down and speed up the speech. The advantage of this method is easy adaptation. It is possible to create a parameter database from the voice of another person. Only 10-20 minutes of newly recorded speech is enough for an adaptation. More details can be found here, in the summery of the PhD dissertation of Pál Bálit Tóth, who developed the system.

ProfiVox HMM voices

Listen to some synthesized voices in Hungarian