Sign in / Join

Part 2 : How the brain perceives and processes vocal sounds

by Karen Lederer

The Acoustic and Auditory Phonetics of Human Beatboxing

Part 2 : How the brain perceives and processes vocal sounds

In Part 2 of this series Karen Lederer outlines some general theories on the perception of speech and how speech and non-speech sounds are processed in the brain.

2.1 Effects of speaking rate on phonetic categories

Summerfield (1981) and Green and Miller (1985) (Both in Miller et al 1997 p.121) are two of many phoneticians who found that on hearing speech,

"[listeners alter] the precise mapping between acoustic signal and phonetic category as speaking rate varies."

In other words listeners alter the ‘allowed’ voice onset time (VOT) of consonants according to the rate of speech. In rapid speech the VOT of /p/ is shorter than it is in slow speech, yet /p/ is always perceived as /p/ even though its VOT may vary considerably. In voiced/voiceless pairs such as /p/ and /b/ the VOT is essential to their correct interpretation. If the /p/ of rapid speech was heard in slow speech, it may be interpreted as /b/ due to its shorter VOT.

When perceiving speech, listeners react accordingly such that the location of the voiced/voiceless category boundary is altered according to the rate of speech. If a voiced bilabial beatboxed sound (such as the simple kick drum /b/ in Stowell’s Beatbox Alphabet) is combined with speech, it may be unlikely to be perceived as /b/ unless its VOT is within the ‘allowed’ time for a /b/ in normal speech at that rate.

The sounds of beatboxing are not phonemic as they do not combine to form meaningful units like words. They carry no arbitrary meaning and are not categorised into phonemes according to their specific properties. Each sound is considered individually rather than being assigned to a certain group or phoneme and every element of their acoustics affects how they are perceived.

Evidence for a separate mode for speech in the brain is important when considering the interpretation of beatboxing because it is made in the same way as speech and by the same system. However, beatboxing is not confused with speech - even when performed by the same person and at the same time. Dichotic listening gives some insight into this.

2.2 Dichotic Listening

When a complete sound with three formants is presented to both ears simultaneously a single speech sound such as /da/ or /ga/ is perceived, depending on the third formant transition. However, experiments show that when a ‘base’ (the 1st and 2nd formants plus the steady part of the third formant) is presented to one ear and the isolated transition of the third formant presented to the other ear, the listener perceives two separate sounds. In the ear to which the base is presented, s/he hears a speech sound /da/ or /ga/ but in the other ear (to which the isolated third formant transition is presented) a non-speech ‘chirp’ is perceived. (Mann and Liberman 1983 in Moore 1989)

Through investigations into dichotic listening (Rand, 1974, Mann and Liberman, 1983) it has been found that an isolated formant transition can be interpreted both as speech and non-speech. In speech, formant transitions rarely occur in isolation, however in beatboxing, sounds tend to be of very short duration and formant transitions often occur in isolation as the sound is cut short before the formant settles into a steady state. It is possible that beatboxed sounds could be perceived as either speech or non-speech or both.

Miriam Makeba’s ‘The Wedding Song’ provides an interesting example of a similar phenomenon related to the perception of non-speech sounds in a speech context. The song is sung in Zulu, an African click language and Slaney (1995) observes that,

"[when the title is spoken], the click is definitely heard as part of the word. Yet when the same type of click is heard in the song it separates from the speech and becomes part of the instrumental track. To my American-English ears, a click is not normally part of a language and when placed into an ambiguous context, such as a song, the click is no longer heard as a part of the speech signal." Slaney (1995)

This is the same effect that occurs in beatboxing. Although many sounds of beatboxing are similar to those used in speech, they are not the same and are separated from speech in the same way as Zulu clicks in ‘The Wedding Song’.

Notes : Although clicks are speech sounds in Zulu, they are not used in English so to native-English speakers are classed as non-speech sounds.

2.3 The prominence of speech in adverse conditions

Another piece of evidence pointing towards the separate ‘speech mode’ in the brain is that that speech is intelligible even in severely adverse conditions. Moore (1989, p.280) notes that the average speech level should exceed that of noise by at least 6dB for satisfactory communication, however, 50% of words uttered can still be understood when the noise level is the same as the speech level. Speech may even be intelligible when the noise level exceeds speech level. Moore also notes that speech is still 90% intelligible when all frequencies below 1000Hz and above 2000Hz are removed from the signal. He concludes that,

"no single aspect of the speech wave is essential for speech perception." (Moore, 1989 p281)

Moore’s findings show that the human brain is able to detect speech even under the most adverse conditions and with very limited information. A possible explanation for this advanced ability for speech perception is the amplification of typical speech frequencies by the human vocal and auditory systems. This evidence is noted by Everest (2001 pp.43-44) who observes that,

"the resonance effect of the ear canal increases sound pressure at the eardrum at certain frequencies. … For the important speech frequencies, sound pressure at the eardrum is increased 5 dB." (Everest 2001, p.43-44)

Everest later identifies these ‘important speech frequencies’ as 2000Hz and 3000Hz (Everest, 2001 p43). Since these frequencies tend to be prominent in sounds created in the mouth, they are effectively amplified twice (once by the mouth and once by the ear) in normal speech and hearing. This amplification allows the important information carried at these speech frequencies to reach the brain even when listening conditions are bad.


Leave a reply