Siberian Federal University: Master Student of the SSIT creates an algorithm for speech recognition by video
Anton Dzyuba, master student at the School of Space and Information Technology, Siberian Federal University, has taught neural networks to read lips in videos. The experimental studies were carried out using independently obtained videos with Russian-speaking speakers.
Speech recognition is performed in two stages. First, a face search is performed and the lip area is selected in a separate frame of the video sequence using Haar features. Next, the sequence of frames inputs the deep learning convolutional and recurrent neural networks for speech viseme recognition.
Experimental studies used a dataset containing 768 different statements uttered by different speakers. The videos were obtained independently, as the experiment required a dataset of Russian-language speakers. The statements are labeled with the same labels as the training dataset. The test words were: “бегу”, “пила”, “милый”, “усы”, “вулкан”, “банан”, “тонуть”. The best speech recognition accuracy by articulation was 93.7% for the word “банан”, and the average accuracy was 68%.
“Visual speech recognition is a critical task when communicating with hearing-impaired people. Speech recognition by articulation is also used in areas not related to medicine, for example, in law enforcement. Visemes and phonemes do not have a one-to-one correspondence. There are 42 phonemes in Russian. Of these, 6 are vowels and 36 are consonants. Often, multiple phonemes correspond to the same viseme and appear the same on the face of the person speaking. In the future, we plan to improve the algorithm, increase its accuracy and expand the number of recognized words,” said Anton Dzyuba.
A scientific supervisor of the student is Anna Pyataeva, assistant professor of the Department of Artificial Intelligence Systems, SSIT. The results of the work were presented at the VIII International Scientific Conference on Regional Problems of Earth Remote Sensing.