Bidirectional deep architecture for Arabic speech recognition

Citation:

Zerari, Naima, et al. 2019. “Bidirectional deep architecture for Arabic speech recognition”. Open Computer Science 9 (1).

Abstract:

Nowadays, the real life constraints necessitatescontrolling modern machines using human interventionby means of sensorial organs. The voice is one of the hu-man senses that can control/monitor modern interfaces.In this context, Automatic Speech Recognition is princi-pally used to convert natural voice into computer text aswell as to perform an action based on the instructionsgiven by the human. In this paper, we propose a generalframework for Arabic speech recognition that uses LongShort-Term Memory (LSTM) and Neural Network (Multi-Layer Perceptron: MLP) classifier to cope with the non-uniform sequence length of the speech utterances issuedfrom both feature extraction techniques, (1) Mel FrequencyCepstral Coefficients MFCC (static and dynamic features),(2) the Filter Banks (FB) coefficients. The neural architec-ture can recognize the isolated Arabic speech via classifi-cation technique. The proposed system involves, first, ex-tracting pertinent features from the natural speech signalusing MFCC (static and dynamic features) and FB. Next,the extracted features are padded in order to deal with thenon-uniformity of the sequences length. Then, a deep ar-chitecture represented by a recurrent LSTM or GRU (GatedRecurrent Unit) architectures are used to encode the se-quences of MFCC/FB features as a fixed size vector that willbe introduced to a Multi-Layer Perceptron network (MLP)to perform the classification (recognition). The proposedsystem is assessed using two different databases, the firstone concerns the spoken digit recognition where a com-parison with other related works in the literature is per-formed, whereas the second one contains the spoken TVcommands. The obtained results show the superiority ofthe proposed approach.

Publisher's Version

See also: Equipe 2