Explain in details speech vs silence discrimination

103views

written 6.0 years ago by

(i) The start and end points of speech amidst noise is sighted as one of the important problems in several areas of speech processing.

(ii) The difficulty is detecting speech in the presence of background noise is not negligible, except for those cases where recordings are done in a sound proof room, but this is not always possible.

(iii) A method that can be put to use for algorithm is based on two simple two-domain measurements energy and zero crossing rate.

(iv) There are various problems that arise while locating the beginning and end of a speech utterance, some of which are as follows.

(v) In the figure, the background noise is easily distinguished from the speech.

(vi) Here the basic or fundamental change between speech and background noise with respect to energy of the waveform is an indication to the beginning of the utterance.

(vii) In the figure, it is easy to locate the beginning of the speech.

enter image description here

(viii) In figure, it is extremely difficult to locate the beginning of speech signal.

(ix) Locating the beginning or end of an utterance become difficult under the conditions:

b) When the speech segment contains weak plosives |P| (bilabial voiceless stop) at the beginning or at the end.

c) When there nasals (for eg |m|,|n|) at the end.

d) De-voicing of voiced fricative like |v|, which become devoiced at the end.

e) Trailing off of vowel sounds at the end of utterance.

(x) Despite the problems cited by the above situations, the combination of energy and zero-crossing rate representations are suitable for developing a useful algorithm for locating the beginning and end of a speech signal.

(xi) Rabiner and Sambur have presented one such algorithm in the context of an isolated word speech recognition system.

(xii) In this system a speaker utters a word during a prescribed recording interval, and the entire interval is sampled and stored for processing.

(xiii) The algorithm makes use of 10 msec frame and a sampling rate 100times/sec is used.

(xiv) It starts with an assumption that the first 100 ms of the interval contains no speech.

ADD COMMENT EDIT