Speech databases

Databases of various speech representations for developments

Speech technology requires data sets that represent speech in some way. Such datasets are called SPEECH DATABASES. The structure and size of speech databases are closely related to the development of computers and memory capacity. In the 1960s, speech synthesis was performed on the basis of phonetic data (the phonetic library of the first Hungarian formant-based speech synthesizer (HungaroVOX) consisted of only 370 building data elements and needed 1 kbyte of memory capacity. Later, speech sound combinations databases were created for speech synthesis that contained read-aloud but meaningless short sequences(logatom database). Their total size was already a few Mbyte. The Hungarian ProfiVox-diad and triad speech synthesizers has built the speech signal from such waveform elements. Speech databases containing real speech were first created for automatic speech recognition (ASR) to handle the variety of individual pronunciations. One such database was BABEL. Later, speech databases were also created from live speech for statistical speech synthesis (Profivox-HMM). These databases were precisely annotated and labeled. Since 2010, researchers have been working with many gigabyte speech databases.

Designing and creating speech databases is a complex job. Planning the text you have to read, making audio recordings, and labeling takes a lot of time and needs precise work. With the development of machine learning algorithms, it has become possible to shorten this work, and even unattended speech databases can be created in the future.

The phonetic database

| Speech database elements specified with phonetic data

Details

Waveform database

| Diad elements

Details

Waveform database

| Diad-triad combined

Details

Speech databases

| For machine speech

Details

PPBA

| Hungarian Parallel Speech Database

Details

The BEA speech database

| Institute of Linguistics of the Hungarian Academy

Details