Sign in / Join

Part 6 : Results and Analysis

by Karen Lederer

The Acoustic and Auditory Phonetics of Human Beatboxing

Part 6 : Results and Analysis

In Part 6 of Karen Lederer's series on the Phonetics of Beatboxing, an acoustic comparison of beatboxed and machine-made sounds is made. Each sound studied is dealt with in a separate subsection and the results of each brought together in a conclusion at the end of this thread.

6.0 Results and Analysis

This section analyses the sounds investigated in detail. The descriptions are made from auditory and instrumental analyses and Tyte’s descriptions of how to make each sound have also been considered to help determine the exact articulation of each sound. Results are divided into subsections, each dealing with a single sound. Each subsection is treated as a separate module of the investigation and is comprised as follows:

  • Articulatory description of the beatboxed sound.
  • Spectrogram, waveform, power spectrum and averaged spectrum of electronic and beatboxed sounds
  • Table of measurements taken.
  • Description and acoustic comparison of both sounds with reference to articulation.
  • Summary of findings.

Section 6.4 will then summarise the overall findings.

For specifics of measurements taken, terms and abbreviations used, please see Appendix 2.

6.1 Clave click

Electronic Clave

Beatboxed clave

6.1.1.a 6.1.1.b
6-1-1a_large 6-1-1b_large
Key: AP Atmposheric pressure, -P Less Pressure

Fig. 6.1.1 The articulation of a clave click.

6.1.1 Articulation

A clave is a wooden block hit by a wooden stick and the beatboxed clave click imitates the TR808’s version of this instrument. Although clicks are used in many paralinguistic contexts, there are relatively few languages known to employ them as phonemes. Some languages of Southern Africa and Bantu, which has borrowed clicks, are among these. The clave click, however, is made further back in the mouth than most clicks used in speech. It is made on the velaric ingressive airstream mechanism and involves making two closures in the mouth by placing the tip of the tongue on the hard palate and the dorsum on the velum. A pocket of air is created in the mouth between these points of closure and a sucking action is made which brings the dorsum further back in the mouth. (fig 6.1.1.a) This increases the size of the air pocket and rarefies the air within it. When the tip of the tongue is released from the hard palate, air rushes in to equalise pressure (fig.6.1.1.b) and this burst is what creates the clave click sound. The lips are spread for the articulation of the sound, giving it a high frequency to emulate the sound created by the TR808.



Fig.6.1.2 Spectrogram, waveform and power spectrum of electronic clave


Fig 6.1.3 Averaged spectrum for electronic clave



Fig 6.1.4 Spectrogram, waveform and power spectrum of beatboxed clave


Fig 6.1.5 Averaged spectrum of the beatboxed clave

Property DM Clave HBB Clave Difference
Total duration of sound (ms) 34 113 79
Burst duration (ms) 2 6 4
Rate of fade of burst (dB/ms) 1.15 33.3 32.15
Rate of fade post-burst (dB/ms) 1.15 0.18 0.97
Total rate of fade (dB/ms) 1.15 0.36 0.79
Resonant Frequency (Hz) 2412 2335 77
Bandwidth of Resonant Frequency (Hz) 1206 1219 13

Fig 6.1.6 Table to show properties of electronic (DM) and beatboxed (HBB) clave

6.1.2 Comparison of sounds

It is important to note that scales on each spectrogram are different. The spectrogram of the electronic clave shows frequencies up to 22kHz whereas frequencies are only visible up to 6,750Hz in the spectrogram of the beatboxed clave.

Despite the different scales, it can still be seen that there is a similar distribution of energy in these two sounds. Both begin with an energy burst covering all visible frequencies and in both, the energy of the burst quickly becomes concentrated at its resonant frequency. Before the specifics of the energy burst and resonances are discussed, it is necessary to account for some of the unavoidable differences between the electronic and vocally produced sounds.

6.1.3 Energy distribution

The electronic clave is much ‘cleaner’ than the vocally produced clave. There is no energy outside the resonant frequency of the sound, the waveform is sinusoidal and both the waveform and the power spectrum fade at a uniform rate. All the lines on the graphs are smooth and straight.

There are two components to the electronic clave sound. It is transient yet it is also periodic. The energy burst forms the first period of a sinusoidal waveform which resonates at 2,412Hz. There is no energy present outside the resonant frequency and the sound produced is a pure tone.

The beatboxed clave is a transient sound with a complex, aperiodic waveform. Energy resonates at 2,335Hz but there is also energy present outside this frequency. The patchiness of the spectrogram and the irregularity of the waveform are features of the beatboxed clave click that are impossible to avoid.

A pure sinusoidal waveform cannot easily be produced by the vocal tract because vocally produced sounds are created by the build-up and subsequent release of intra-oral pressure. On the release of pressure, air particles vibrate in response to many signals at once so energy is present across a range of frequencies. The most intense vibration is at the natural resonant frequency of the space in which the sound is created, hence the concentration of energy at 2,335Hz, but molecules vibrating at other frequencies cause additional energy. Due to the nature and production of vocally produced sounds, additional noise outside the resonant frequency cannot be avoided and a pure tone cannot be achieved.

6.1.4 Rate Of Fade

Another feature of the beatboxed clave click that is difficult to control is its rate of fade (RF). The power spectrums show that the RF of the electronic clave is uniform throughout at 1.15dB/ms but the beatboxed clave does not follow such a pattern.

The beatboxed power spectrum shows a peak of 81dB as the click sound is generated, but this high level of power is quickly reduced to 61dB in 6ms; RF 33.3dB/ms. After the initial reduction in power, the RF slows to 0.18dB/ms; almost ten times slower than the RF of the electronic clave. This contributes to the longer overall duration of the sound.

In the beatboxed sound the peak in power is created when the tip of the tongue is released and air rushes into the space behind it to equalise pressure. At this point, energy is vibrating at many different frequencies. Most of it is then quickly absorbed by the soft walls of the vocal tract but the resonant frequency energy (and a few other frequencies as discussed above) remains resonating. The absorption of most frequencies accounts for the sudden decrease in power after the energy burst.

A beatboxer has little control over the rate of fade of a transient click like the clave because once the pressure is released subsequent energy is purely resonant. There is no further sound source and no way of prolonging or cutting short the transient sound without altering its other properties. The duration of the sound can, be controlled by increasing or decreasing the original amplitude of the click. The rate of fade of the click will remain the same but if the original amplitude is increased, the time taken for the sound to fade to silence is longer.

6.1.5. Resonant Frequency and Burst Duration

Despite the features of the electronic clave that are impossible to copy vocally, the clave click is imitated remarkably well in other areas.

The burst duration of the electronic clave is 2ms, whereas that of the beatboxed clave is 6ms. Perceptually, the difference between 2ms and 6ms is insignificant. To a person, both sound like a short sharp burst.

The frequency at which the electronic sound resonates is 2,412Hz with a bandwidth of 1,206Hz and the beatboxed clave resonates 77Hz below this with a bandwidth just 13Hz less than that of the electronic clave.

The resonant frequency and the burst duration of the sound both reproduced by the beatboxer remarkably accurately.

6.1.6 Averaged Spectrums

The spectrums of each sound are very similar however there is an unexplainable notch in the averaged spectrum of the electronic clave which indicates a decrease in power at the resonant frequency of the sound. This is a strange phenomenon that is not visible on the spectrogram where there is a solid band of energy. The existence of energy concentrated at two close frequencies is also disproved by the simple sinusoidal waveform of the sound.

A single spectrum of the sound is taken at 294ms shows a single peak of energy at 2,490Hz. It is only after the cessation of the burst that two energy peaks become evident. (fig.6.1.9)



Fig 6.1.7 Averaged spectrum for electronic clave



Fig 6.1.8. Single spectrum of electronic clave taken at 294ms



Fig 6.1.9 Single sprectrum of electronic clave taken at 298ms

The peaks in fig 6.1.9 are 355Hz apart and there is an intensity difference of just 0.2dB.

The averaged spectrum for the beatboxed clave does not show a notch like in the electronic clave, but when a single spectrum (Fig 6.1.10) is taken at the centre of the burst (4ms into the sound (882ms on scale shown)) two close peaks are evident 561Hz apart with an intensity difference of 8dB.

As time elapses throughout the duration of the sound the intensities of these these peaks are not kept constant. An animated spectrum shows that at times there are two peaks in intensity but at other times there is just one. This indicates the fading in and out of the less intense ‘formant’ throughout the sound. This is also illustrated by the single peak in fig 6.1.11.



Fig 6.1.10 Single spectrogram of beatboxed clave taken at 882ms. (4ms into the sound)



Fig 6.1.11 Single spectrum of the beatboxed clave at 897ms (19ms into the sound)

Both spectrums show two close formants that merge at different points in each sound. The electronic spectrum shows a single peak in the burst energy and later this splits into two, yet the beatboxed sound shows two peaks in the burst energy that are later and intermittently reduced to one.

The formant-like bands of energy of the beatboxed sound are 437Hz further apart than those of the electronic sound.

Although a strange phenomenon in the electronic sound, the beatboxed sound imitates the two peak energy values at the resonant frequency of the sound.

6.1.7 Summary of comparison

The beatboxed clave imitates the properties of the electronic clave accurately in terms of the burst duration, resonant frequency and the presence of two close formant-like bands of energy, however the nature of sound production in the vocal tract limits beatboxers greatly in their ability to recreate the purity of a simple waveform. Energy exists in the beatboxed sound that does not exist in the electronic one.

The rate of fade is also not imitated accurately and this is, like the waveform, something over which the beatboxer has no control.

6.2 Kick drum

Electronic kick drum

Beatboxed kick drum

This sound is used as a bass drum in beatboxing. It is produced on the pulmonic egressive airstream mechanism and is a forceful, short, snappy bilabial trill articulated with tense musculature. The ‘trill’ component of the sound is made by compressing air in the mouth behind tightly closed lips causing pressure to build up immensely. In much the same way as phonation occurs in the glottis, a series of short pulses of high-pressure air are allowed to escape through tightly pressed together lips. Air pressure is maintained in the mouth by a continuous flow of air from the lungs.

Simultaneously to the production of the trill, a nasal tone is created. Air from the lungs is forced through the larynx where modal phonation occurs , and the phonated air then moves through the open velo-pharyngeal port and out through the nose. In order to best emulate the deep bass sound, the pharyngeal resonance chamber is made as large as possible by lowering the larynx. The voicing of the classic kick begins as soon as air begins to pass through the larynx after the initial burst of the trill.

The sound is made by the sudden release of high pressure air and the vibration of the vocal folds.

Note: Modal phonation is created by the build up of air pressure behind the vocal folds. When the pressure is sufficient, the vocal folds are forced apart and peel open from the bottom. A burst of air is able to escape before the vocal folds are then sucked back together and the process begins again. This process is repeated about 120 times per second in adult males and about 220 times per second in adult females. It generates the tone that is used in normal speech.


Fig.6.2.1. Spectrogram, waveform and power spectrum of the electronic kick drum


Fig. 6.2.2 Averaged spectrum of the electronic kick sound



Fig. 6.2.3 Spectrogram, waveform and power spectrum of the beatboxed classic kick drum


Fig. 6.2.4 Averaged spectrum of the beatboxed kick drum

Property Electronic Kick drum Beatboxed Kick drum Difference
Total duration of sound (ms) 362 401 39
Burst duration (ms) 12 8 4
Rate of fade of burst (dB/ms) 0.85 0.375 0.475
Rate of fade post-burst (dB/ms) 0.021 0.141 0.12
Total rate of fade (dB/ms) 0.079 0.132 0.053
Resonant Frequency (Hz) 258 219 39
Bandwidth of resonant frequency (Hz) 183 195 12
No. of Striations 14 12 2
Average interval between striations (ms) 20 9.1 10.9

Fig.6.2.5 Table to show properties of electronic and beatboxed kick drums

6.2.1 Comparison of sounds

The electronic kick drum and the beatboxed kick drum were recorded at different sampling rates so the spectrograms are of different scales. The spectrogram for the electronic kick drum shows energy up to 5,500Hz and the beatboxed kick drum shows energy up to 6,500Hz.

6.2.2 Distribution of energy

As for the clave click, there are certain features of the electronic kick drum sound that cannot be accurately imitated by the vocal tract. The most notable is the ‘cleanliness’ of the electronic spectrogram.

In the electronic kick drum there is a burst of energy covering all frequencies followed by a concentrated band of resonant frequency energy at 258Hz. As well as the resonant frequency energy, there is a series of vertical striations of up to 1,759Hz. There is no energy whatsoever other than that which is contained within the burst, the resonant frequency and the striations.

The intense energy of the beatboxed sound follows a similar pattern to that of the electronic sound; there is a burst covering all visible frequencies, a concentrated band of low frequency energy and a series of vertical striations, however, in the beatboxed kick there is also constant weak energy covering all frequencies until about 141ms into the sound. (1,420ms on scale shown). Which is not present in the elctronic kick.

The constant weak energy is present for the same reasons as the surplus energy in the beatboxed clave, however it is stronger for longer in the kick because the bilabial trill involves the repeated disturbance of air whereas in the clave, air molecules are disturbed just once. Every time a single burst of air escapes during the bilabial trill (i.e. every time a striation appears on the spectrogram), energy at high frequencies is produced and the unwanted frequencies are therefore present until after the last burst of air of the trill is released.

In order to emulate the kick drum sound of the TR808, a large volume of air must be moved through the vocal tract at a high velocity. Inevitably, this moving air causes friction against the walls of the vocal tract and this contributes further to the high frequency energy in the kick drum sound.

6.2.3 Striations

Vertical striations are present in both kick drum sounds however those in the beatboxed kick appear very weak in comparison to those in the electronic one.

The striations in the electronic kick are more prominent because electronic sounds are created synthetically from scratch and are heavily refined so that surplus energy is eliminated. Energy vibrates at specified frequencies and times only, hence the ‘tidiness’ of the electronic spectrograms: no energy other than the resonant frequency energy is present between striations.

The apparent weakness of the striations (including the initial burst) in the beatboxed kick can be partially accounted for by the presence of energy from air molecules vibrating across all frequencies. The striations are effectively ‘hidden’ or ‘masked’ by additional energy, but they are still distinct enough to be detected. Their times of occurrence and the intervals between each striation are shown in figure 6.2.5. Figure 6.2.6 shows the same information for the striations in the electronic kick.

No. of striations Time on scale shown (ms) Time from beginning of sound (ms) interval between striations
1 1291 11  
2 1298 18 7
3 1306 26 8
4 1314 34 8
5 1322 42 8
6 1331 51 9
7 1338 58 7
8 1347 67 9
9 1357 77 10
10 1369 89 12
11 1379 99 10
12 1389 109 10

Figure 6.2.5 Striations in the beatboxed kick drum

Total number of striations: 12

Average interval between striations: 9.1ms

No. of striations Time on scale shown (ms) Time from beginning of sound (ms) Interval between striations
1 393 63  
2 413 83 20
3 433 103 20
4 453 123 20
5 473 143 20
6 493 163 20
7 513 183 20
8 533 203 20
9 553 223 20
10 573 243 20
11 593 263 20
12 613 283 20
13 633 303 20
14 653 323 20

Figure 6.2.6 Striations in the electronic kick drum

Total number of striations: 14

Average interval between striations: 20ms

The intervals between striations in the electronic sound are almost twice as long as those in the beatboxed sound, however the number of striations differs by just two.

The intervals between each striation can be lengthened by the beatboxer by two methods: air can be forced out of the mouth more slowly so that it takes longer to acheive sufficient pressure to force the lips open, or the lips can be relaxed slightly so that they vibrate more slowly. These methods seem to contradict each other but the relaxation of the lips can be likened to the slackening of a taught elastic band. The taughter an elastic band, the faster it vibrates and the slacker it is, the slower it vibrates. To emulate an electronic kick drum more accurately than in this example, beatboxers should reduce the rate at which the lips vibrate.

6.2.4 Resonant Frequency

Property Electronic Kick drum Beatboxed Kick drum Difference
Resonant Frequency (Hz) 258 219 39
Bandwidth of resonant frequency (Hz) 183 195 12

Fig 6.2.7 Resonant frequencies of the beatboxed and electronic kick drums

The resonant frequency of the beatboxed sound is just 39dB lower than that of the electronic one and the bandwidth is just 12Hz wider. This shows that energy is marginally less concentrated in the beatboxed sound and the resonant frequency is reproduced with a high degree of accuracy.

Note: It must be noted that there is no boundary or rule as to what might constitute an ‘accurate’ imitation of a drum machine sound as we have no comparisons or controls, but when there are thousands of available frequencies to choose from and the beatboxer is only 39Hz off target, the proportions are put into better perspective. If on the other hand there were 50 frequencies available and the beatboxer was ‘out’ by 39Hz, this would not be such an accurate imitation.

6.2.5 Rate of Fade

Property Electronic Kick drum Beatboxed Kick drum Difference
Total duration of sound (ms) 362 401 39
Burst duration (ms) 12 8 4
Rate of fade of burst (dB/ms) 0.85 0.375 0.475
Rate of fade post-burst (dB/ms) 0.021 0.141 0.12
Total rate of fade (dB/ms) 0.079 0.132 0.053

Figure 6.2.8 Rates of fade of electronic and beatboxed kick drum

The differences between the shapes of each power spectrum reflect the different rates at which power in each sound fades. The RF of the electronic sound varies between 0.85dB/ms immediately after the initial burst to 0.021dB/ms throughout the rest of the sound.

The beatboxed sound has a steadier RF which varies between 0.375 dB/ms and 0.141dB/ms. Rather than an initial peak and then a tailing off of energy as in the electronic sound, the intensity of the beatboxed sound fades in steps. The intensity of the initial burst is 88dB. There is an immediate fall of 3dB before intensity is maintained at 85dB for 23ms. Intensity then fades at 0.375dB/ms for 223ms before being held at 58dB for 93ms. Intensity then fades back to its original level of 35dB in 50ms.

The reason that there is such a small drop in power (3dB) after the initial burst is that a trill requires a constant flow of high pressure air to make the lips vibrate. The first burst of air to force through the lips needs to be only slightly more powerful than subsequent ones to set the lips in motion and to keep the lips in motion a constant flow of high-pressure air is needed. High pressure is maintained until 34ms into the sound. After this, pressure is lost and the intensity of the sound begins to fade. This is evident when power is no longer held at 85dB. At 180ms into the sound (1.461s on scale shown) there is no longer enough pressure to force the lips open at all so the trill and the high frequency energy associated with it stop.

Subsequent intense energy of the sound comes from the nasal tone. The beatboxer can potentially continue the voicing of the kick drum sound and maintain its intensity for several seconds as air can be released from the lungs at a steady rate. The intensity is maintained at 58dB for 93ms before it begins to fade to its original level of 35dB.

6.2.6 Nasal tone in the beatboxed kick drum

On listening to the electronic kick drum sound, it is evident that there is a tonal element to it. This is illustrated in the periodicity of the waveform.

The beatboxed sound also has periodicity which is achieved by both the bilabial trill and the voicing of the nasal tone.

After the cessation of the bilabial trill, it is difficult to ascertain exactly where the nasal tone begins. There is low frequency energy (below 500Hz) characteristic of nasal tones from the beginning of the sound, however no nasal tone can be heard at least for the first 54ms (1,332ms on scale shown). At the end there is a tiny spec of energy at 1,277Hz which, according to Heselwood (2004), can be characteristic of an English nasal tone.

6.2.7 Summary of comparison

The distribution of intense energy is similar in the electronic and beatboxed kick drum sounds. The resonant frequencies of the sounds are within 39Hz of each other and both have a number of striations, but the beatboxed sound contains some high frequency energy that is not present in the electronic sound.

The RF of the electronic sound is varied, involving sharp falls in intensity whereas the RF of the beatboxed sound is more gradual, falling in ‘steps’ where power is held at the same frequency for some time.

The beatboxed kick drum shows frequencies below 500Hz that are typical of an English nasal tone.

6.3 Open Hi-hat

Electronic hi-hat

Beatboxed hi-hat

This sound is a strongly articulated alveolar affricate. Air is forced up from the lungs and allowed to build up behind a closure at the alveolar ridge. On release, the tongue moves away from the alveolar ridge yet maintains a stricture of close approximation causing friction and turbulence in the air rushing through.




Fig 6.3.1 Spectrogram, waveform and power spectrum of the electronic open hi-hat


Fig 6.3.2 Averaged spectrum of the electronic open hi-hat





Fig 6.3.3 Spectrogram, waveform and power spectrum of the beatboxed open hi-hat


Fig 6.3.4. Averaged spectrum of the beatboxed open hihat

Property DM Hat HBB Hat Difference
Total duration of sound (ms) 361 225 136
Energy Density Maximum (EDM) (Hz) 4976 3985 991
Average interval between EDMs (Hz) 822 1763 941
Amplitude rise time (ms) 22 51 29
Hold time (ms) 0 40 40
Rate of fade (dB/Hz) 0.056 0.196 0.14

Fig. 6.3.5 Table to show properties of elecrtronic (DM) and beatboxed (HBB) open hi-hat sounds

6.3.1 Comparison of sounds.

Both the electronic and the beatboxed hi-hat sounds contain energy of much higher frequencies than are displayed in the spectrograms in figures 6.3.1 and 6.3.3. Analysis of the sounds can only be carried out on those frequencies displayed so it is only the lower frequencies of each sound that can be compared.

6.3.2 Formant frequencies

Both electronic and beatboxed hi-hat sounds show continuous aperiodic energy over all visible frequencies. But the distribution of energy is different in each sound.

In the electronic hi-hat, intensity increases with frequency so the highest frequencies are the loudest and most prominent. The most intense frequencies are above 4,000Hz and remain strong until the end of the sound while energy fades from the lowest frequencies first.

There are formant-like bands of energy at 4,290Hz and 4,976Hz. Speech Station 2’s formant tracker also detected harmonics at 2,510Hz and 3,517Hz, but these are very weak. (See fig 6.3.6) The peaks in the averaged spectrum (fig. 6.3.2) highlight the most prominent frequencies over the whole sound.

Formant-like bands of energy are found on the spectrogram at:

  1. 2510Hz
  2. 3517Hz
  3. 4290Hz
  4. 4976Hz



Fig 6.3.6 Formant Frequencies of Electronic hi-hat

The beatboxed hi-hat also contains formants but they are more prominent than those of the electronic sound. They occur at:

  1. 872Hz
  2. 1961Hz
  3. 3985Hz
  4. 6161Hz

The distribution of energy in the beatboxed hi-hat is more even than in the electronic hi-hat. Energy is most intense around 4,000Hz, rather than in the higher frequencies as in the electronic sound. According to Kent and Read (1992 p123) ‘the major region of noise energy for the alveolar fricatives is above 4kHz.’ And Heselwood (2004) observes that ‘[in spoken English alveolar fricatives] there is hardly any energy below 3,500Hz’. We cannot see far above 4,000Hz on the available spectrograms and there may be hidden EDMs in both sounds that comply with the energy patterns of alveolar fricatives. Even so, the presence of energy below 3,500Hz in the beatboxed kick shows that it also has features that are uncharacteristic of an English alveolar fricative.

The discussion in part 3.0 suggests that beatboxers should avoid typical speech frequencies to produce accurate drum imitations when lyrics are not present. We cannot see whether this is achieved in the hi-hat, however it does seem that non-typical speech frequencies are accentuated by the intensity of energy below 3,500Hz.

There are potentially important frequencies of both hi-hat sounds that cannot be seen on the spectrograms used, however the comparison of EDMs in the frequencies displayed can reveal important differences between each sound.

Electronic formants (Hz) Interval between formants (Hz) Beatboxed formants (Hz) Interval between formants (Hz)
2510   872  
3517 1007 1961 1089
4290 773 3985 2024
4976 686 6161 2176

Fig 6.3.7. Distribution and intervals between EDMs of electronic and beatboxed hi-hat sounds

Average interval: 822

Average interval: 1763

The most intense formant of the beatboxed sound is nearly 1,000Hz lower than that of the electronic sound. Figure 6.3.7 shows that the formants in the beatboxed sound are also an average of 941Hz further apart than those of the electronic sound.

As mentioned previously, however, information on the intensity patterns of energy at higher frequencies is not available and may prove otherwise.

Accurate comparisons of the hi-hat sounds cannot be made with reference to their lower formant frequencies only and other features of both sounds must be carefully studied.

6.3.3 Cessation of energy at the end of the sound.

An important difference that is evident in figures 6.3.1 and 6.3.3 is the order in which frequencies are lost from each sound.

In the electronic sound, energy fades from the lower frequencies before the higher ones. Most of the energy below 1,400Hz has faded out by 190ms into the sound (492ms on scale shown) and energy continues to be lost gradually from the lower frequencies first.

In the beatboxed sound, energy is lost less gradually. At 170ms into the sound (203ms on scale shown) energy at all frequencies becomes simultaneously less intense. After this point, energy is still present at all visible frequencies but it is weaker than previously. When the sound ends completely, energy at all frequencies ceases at the same time.

Energy is lost simultaneously from all frequencies due to a sudden reduction in the velocity of air flowing through the stricture of close approximation at the alveolar ridge. The rate of fade of the hi-hat sound is controlled more by the velocity of airflow than the movement of active articulators so all frequencies change intensity in the same way at the same time. It is possible to stop the sound dead by cutting off airflow or even to increase intensity after the beginning of a sound by increasing velocity.

If the tongue was moved away from the alveolar ridge slowly and in synchrony with the reduction in velocity of airflow, there would be a more gradual reduction in frequencies and the sound would stop less abruptly, however it would not necessarily emulate the electronic sound more accurately as the highest frequencies would most likely be lost before the lower ones. To consciously control the movement of the tongue away from the alveolar ridge and gradually reduce airflow at the same time is likely to increase the duration of the beatboxed sound so that it no longer emulates the electronic sound accurately.

6.3.4 Rate Of Fade

The cessation of energy over different frequencies can be related to the rate of fade of a sound. The power spectrum in Figure 6.3.3 shows that the beatboxed sound is held at its maximum intensity for 40dB before it begins to fade whereas the electronic sound begins to fade immediately after its maximum power is reached. The eventual RF of the beatboxed sound is 140dB/ms faster than that of the electronic sound.

The spectrogram of the beatboxed hi-hat suggests that the airflow is maintained at a steady velocity throughout the hold (i.e. where the power spectrum shows a constant intensity for 40ms), slowed at 170ms (203ms on scale shown) and then stopped completely at 217ms (250 on scale shown). This pattern does not emulate the more uniform rate of fade of the electronic sound.

An important difference between beatboxed and electronic sounds is that the intensity of a beatboxed sound does not necessarily have to fall after the onset of the sound. Unlike a real or electronic hi-hat sound, the source of energy is not momentary and power can be increased, decreased or maintained at the will of the speaker. To increase power, airflow is increased or strictures are made smaller. The ability for adaption like this allows beatboxers to maintain energy at a given intensity or in the case of the beatboxed hi-hat, a long amplitude rise time can be achieved which gives the impression of the sound ‘fading in’.

Note: A real hi-hat is different to both the electronic and the beatboxed hi-hat but the electronic hi-hat aims to emulate its properties, including its gradual and consistent rate of fade.

6.3.5 Amplitude rise time.

The beatboxed waveform is an almond shape and the power spectrum forms a dome whereas the electronic waveform is more triangular with a wedge shaped power spectrum. This shows that the beatboxed sound takes longer to reach its maximum amplitude than the electronic sound. The maximum power of the electronic sound is reached within 22ms but the beatboxed sound takes 51ms, over twice as long, to reach maximum amplitude.

6.3.6 Duration

The beatboxer has much control over the duration of the hi-hat but the beatboxed version is 136ms shorter than the electronic one. In order to create a more accurate hi-hat imitation, the rate of fade of the beatboxed sound should be more gradual and this would automatically increase duration.

6.3.7 Summary of comparison

As far as can be seen from the visible frequencies, the beatboxed open hi-hat is not an accurate imitation of the electronic sound. Both sounds contain formant-like bands of continuous aperiodic energy however the distribution of this energy does not follow the same patterns across frequencies or across time. The rate and manner in which both sounds fade is different, with the electronic sound fading gradually from the lower frequencies first and the beatboxed sound fading more abruptly with all frequencies ceasing at once. The beatboxed hi-hat contains frequencies that are both typical and atypical of speech sounds made at the same place of articulation.

6.4 Summary of Results

In the beatboxed clave and kick drum sounds the patterns of intense energy matched those of the corresponding electronic sounds very accurately. The hi-hat, however did not (as far as we can see) match the energy patterns of the electronic sound so closely.

Electronic sounds are made synthetically from scratch and programmed to vibrate at specified frequencies only. The electronic generation of transient sounds does not involve large volumes of air being shifted and the frequencies which vibrate are determined only by the signal recorded to the magnetic tape. In contrast, transient clicks and plosives generated in the mouth involve the compression and release of air which vibrates across more than one frequency. Therefore the purity of transient electronic sounds cannot be replicated perfectly in the mouth. As we have seen, energy is present in the beatboxed clave and kick sounds that is not present in the corresponding electronic sounds.

The constant turbulent energy of the hi-hat sound is easily replicated but despite this advantage, it does not imitate the electronic hi-hat accurately as the energy density maximums of both sounds do not follow the same pattern.

The rate of fade of all three sounds was never copied accurately. When it was constant in the electronic sound (as for the clave), the beatboxed RF was varied, however when the electronic RF was varied, it was more constant in the beatboxed sound.

The most accurately imitated RF was that of the hi-hat, with both electronic and beatboxed sounds fading slowly at first and then more quickly, however the amplitude rise time and the rate at which individual frequencies were lost were not imitated accurately.

It is relevant to mention here that the rate of fade of the clave cannot be controlled by the beatboxer because it is a transient sound. Although the kick sound is also transient, the beatboxer has more control over its rate of fade because the source of the sound is ongoing.

The production of sound in the mouth involves the movement of different articulators at different times to create the desired effect and, particularly in the beatboxed kick drum, these movements are reflected in the acoustic patterns of the sound.

In the spectrogram of the kick drum it is evident when the lips stop vibrating. The vocal fold vibration can be easily identified as a low-frequency bar of energy. The spectrograms for the electronic sounds are not so modular and it is not possible to separate the components of them so easily.

The peak energy frequencies of the clave and hi-hat are always closer together in electronic sounds than they are in beatboxed sounds. This reflects that the concentration of energy into narrow bandwidths is not easily achievable by the vocal tract.

As a general conclusion to this section, it seems that the best replicated features of electronic sounds vary according to the kind of sound being copied; in the transient sounds (the clave and kick), the resonant frequencies, burst durations and all features that can be controlled by the beatboxer are replicated very accurately. Only those features that cannot be controlled, such as the rate of fade and the presence of surplus energy, differ from the electronic sound.

In the hi-hat, however, the beatboxer has more control over the rate of fade and duration of the sound, however these are not replicated as well as in the clave and kick. The hi-hat also fails to reproduce the resonant frequencies of the electronic sound as well as the clave and click sounds do, however, the constant aperiodic energy of energy of the electronic hi-hat is replicated well. The hi-hat sound relies more on formants than the clave and kick sounds and these are what render it an inaccurate imitation of the electronic version.

It can be concluded that the accuracy with which an electronic sound is reproduced by a beatboxer does not depend on how much control a beatboxer has over a sound, but on the nature of the sound he must produce and whether a similar sound exists in his language. The beatboxer has least control over the clave click sound and most over the hi-hat, however the degree of accuracy with which each are reproduced is in the opposite order to this. The clave is the most accurately reproduced sound and the hi-hat the least.


Leave a reply