Nowadays, the real life constraints necessitatescontrolling modern machines using human interventionby means of sensorial organs. The voice is one of the hu-man senses that can control/monitor modern interfaces.In this context, Automatic Speech Recognition is princi-pally used to convert natural voice into computer text aswell as to perform an action based on the instructionsgiven by the human. In this paper, we propose a generalframework for Arabic speech recognition that uses LongShort-Term Memory (LSTM) and Neural Network (Multi-Layer Perceptron: MLP) classifier to cope with the non-uniform sequence length of the speech utterances issuedfrom both feature extraction techniques, (1) Mel FrequencyCepstral Coefficients MFCC (static and dynamic features),(2) the Filter Banks (FB) coefficients. The neural architec-ture can recognize the isolated Arabic speech via classifi-cation technique. The proposed system involves, first, ex-tracting pertinent features from the natural speech signalusing MFCC (static and dynamic features) and FB. Next,the extracted features are padded in order to deal with thenon-uniformity of the sequences length. Then, a deep ar-chitecture represented by a recurrent LSTM or GRU (GatedRecurrent Unit) architectures are used to encode the se-quences of MFCC/FB features as a fixed size vector that willbe introduced to a Multi-Layer Perceptron network (MLP)to perform the classification (recognition). The proposedsystem is assessed using two different databases, the firstone concerns the spoken digit recognition where a com-parison with other related works in the literature is per-formed, whereas the second one contains the spoken TVcommands. The obtained results show the superiority ofthe proposed approach.
Nowadays, the real life constraints necessitatescontrolling modern machines using human interventionby means of sensorial organs. The voice is one of the hu-man senses that can control/monitor modern interfaces.In this context, Automatic Speech Recognition is princi-pally used to convert natural voice into computer text aswell as to perform an action based on the instructionsgiven by the human. In this paper, we propose a generalframework for Arabic speech recognition that uses LongShort-Term Memory (LSTM) and Neural Network (Multi-Layer Perceptron: MLP) classifier to cope with the non-uniform sequence length of the speech utterances issuedfrom both feature extraction techniques, (1) Mel FrequencyCepstral Coefficients MFCC (static and dynamic features),(2) the Filter Banks (FB) coefficients. The neural architec-ture can recognize the isolated Arabic speech via classifi-cation technique. The proposed system involves, first, ex-tracting pertinent features from the natural speech signalusing MFCC (static and dynamic features) and FB. Next,the extracted features are padded in order to deal with thenon-uniformity of the sequences length. Then, a deep ar-chitecture represented by a recurrent LSTM or GRU (GatedRecurrent Unit) architectures are used to encode the se-quences of MFCC/FB features as a fixed size vector that willbe introduced to a Multi-Layer Perceptron network (MLP)to perform the classification (recognition). The proposedsystem is assessed using two different databases, the firstone concerns the spoken digit recognition where a com-parison with other related works in the literature is per-formed, whereas the second one contains the spoken TVcommands. The obtained results show the superiority ofthe proposed approach.