Audio Processing

Audio Processing
by Edward Chow

Reference: Material here adapted from the following Related web pages and literature.

Davis Pan, "A Tutorial on MPEG/Audio Compression", IEEE Multimedia, pp. 60-74, 1995.
Audio Compression Overview by Dr. Ze-Nian Li at Simon Fraser University
A good technical overview of psychoacoustics, MPEG Audio and other audio compression algorithms.
Tutorial paper on MPEG/Audio Compression by Davis Pan
Dolby AC-3: Multichannel Perceptual Coding at Dolby
FAQ about MP3, MPEG Audio Layer 3 by Fraunhofer Institute.
Modern Audio Technology by Martin Clifford, 1992, Prentice Hall,
Data and Computer Communications, by William Stallings, 5th edition, 1999, Macmillan.

Element of Sound

Longitudinal wave: medium vibrate in the same direction of wave advance.
Sound can be converted via transducer to other form of energy for storage.
Inverse Square Law: the intensity of the sound radiation decreases in proportion to the square of its distance from the sound source.
Acoustic masking- one sound can cover (mask) another making it inaudible (especially in mid- to treble range)
Sound Pressure Level (SPL) is a variation above and below normal atmospheric pressure.
Sound depends on the production of changes in atmospheric pressure.
Audio band- 20-20kHz

0-20Hz- infrasonic

above 20kHz- ultrasonic (supersonic)

Material here adapted from “Modern Audio Technology” by Martin Clifford, 1992, Prentice Hall, and “Data and Computer Communications”, by William Stallings, 4th edition, 1994, Macmillan.

Hearing-response Characteristics

Basic characteristics of a sound wave

Harmonics and Tone Color

The same note play on different musical instruments generate different tone.
The harmonics composed in tones make them different.

Here are some math related to the harmonics

Example of periodical signals

Example of phase difference

Time domain and frequency domain

Fletcher-Munson Curve

Human hearing does not have flat frequency response.
More audio energy is needed for low and high audio frequencies to produce the same sensation of loudness as for midrange tones.
But once it is too loud, the “sensation” of loudness is relatively uniform.

Noise and frequency range of musical instruments

Sound Measurement

The Decibel is used for comparing strengths of a pair of acoustical or electrical powers, voltages, currents. It is a ratio not specific absolute value.
db formula
where P_RL is strength of the reference signal and P₁ is the strength of signal to be compared.

When P1=3.1PRL, dB=20*Log10(3.1PRL/PRL)=20*Log10(3.1)=20*0.49993=9.827 ~ 10. If we use 3.162 instead 3.1, it will closer to 10.

Measurement of Sound Pressure

bar (barometric)- an CGS unit of atmospheric pressure
= 106 dyn/cm2.
= 14.7 lb/in2at sea level
The sound pressure received by microphone is much smaller and is measured in terms of a millionth of an atmosphere or in microbars.
a threshold of hearing begin ~ 0.0002 mbar.
Dyn is defined as the force that will produce a velocity of 1 cm/sec when acting on a mass of

Microphone Measurements

Sound range in dBs

Digital vs. Analog

Reality: signal got distorted over distance because

impedance (within transmission medium)
interference (outside force, e.g. cloud, lightening)

Data Encoding and Modulation

Modulation of analog signals for digital data

Pulse Code Modulation (PCM)

Quantization error can be reduced by having higher sampling rate (sample more frequently) and more quantization level (more bits for each sample).

Example: With Stereo, 16 bits/sample, 44kHz sampling rate, PCM encoding, how many bits of data will be generated by a three minute sound recording?
Ans: 3min*60sec/min*44000samples/sec*16bits/sample*2(channel)/8bits/byte=31.68MB/s

Masking

A phenomenon of the human hearing system. Normal human ears are sensitive to a wide range of frequencies. However, when a lot of signal energy is present at one frequency, the ear cannot hear lower energy at nearby frequencies.
We say that the louder frequency masks the softer frequencies. The louder frequency is called the masker. Here is the digram the depict what masking phenomenon:

Figure 2.5. Audio masking threshold. (from Mirinal Mandal's "Multimedia Signals and Systems") , The threshold of hearing determines the weakest sound audible by the human ear in the 20-20KHz range. A masker tone (at 300 Hz) raises this threshold in the neighboring frequency range. It is observed that two tones at 180 and 450 Hz are masked by the masker, i.e., these tones will not be audible. SPL: sound pressure level.

Sub Band Coding (SBC)

The basic idea of SBC is to save signal bandwidth by throwing away information about frequencies which are masked.
The result won't be the same as the original signal, but if the computation is done right, human ears can't hear the difference.
Divide signal into bands and perform masking computation and throw away weak signals in each band.

MP3: MPEG Audio Layer 3

It is a perceptual audio coding scheme, exploiting the masking property of the human ear, and trying to maintain the original sound quality as far as possible.
MPEG Audio specifies a family of three audio coding schemes, simply called Layer-1, Layer-2, and Layer-3. From Layer-1 to Layer-3, encoder complexity and performance (sound quality per bitrate) are increasing.
In MPEG 1 Audio standard,

all three layers may use 32, 44.1 or 48 kHz sampling frequency.
All Layers are allowed to work with similar bitrates:
                            Layer-1: from 32 kbps to 448 kbps
                             Layer-2: from 32 kbps to 384 kbps
                             Layer-3: from 32 kbps to 320 kbps

In MPEG-1: max. 1.5 Mbits/sec for audio and video, About 1.2 Mbits/sec for video, 0.3 Mbits/sec for audio
In MPEG 2 Audio standard, there two new extensions:

low sample rate extension: extend sampling rates to 8, 16, 22.05 or 24 kHz.
multichannel extension: address surround sound applications, with up to 5 main audio channels (left, center, right, left surround, right surround) and optionally 1 extra "low frequency enhancement (LFE)" channel for subwoofer signals;
multilingual extension: allow the inclusion of up to 7 more audio channels.

Compression ratios: MP3 achieve 1:10-1:12 compression ratio, at about 64 kbit/s per audio channel, while maintaining the original CD sound quality .
MPEG Audio Algorithm

Steps in algorithm:

Use convolution filters to divide the audio signal (e.g., 48 kHz sound) into frequency subbands that approximate the 32 critical bands --> sub-band filtering.
Determine amount of masking for each band caused by nearby band using the results shown above (this is called the psychoacoustic model).
If the power in a band is below the masking threshold, don't encode it.
Otherwise, determine number of bits needed to represent the coefficient such that noise introduced by quantization is below the masking effect (Recall that 1 bit of quantization introduces about 6 dB of noise).
Format bitstream

Example:

After analysis, the first levels of 16 of the 32 bands are these:

----------------------------------------------------------------------

Band        1  2   3   4  5  6   7   8   9  10  11  12  13  14  15  16  

Level (db)  0  8  12  10  6  2  10  60  35  20  15   2   3   5   3   1

----------------------------------------------------------------------

If the level of the 8th band is 60dB,

it gives a masking of 12 dB in the 7th band, 15dB in the 9th.

Level in 7th band is 10 dB ( < 12 dB ), so ignore it.

Level in 9th band is 35 dB ( > 15 dB ), so send it.

--> Can encode with up to 2 bits (= 12 dB) of quantization error.

MPEG Layers

MPEG defines 3 layers for audio. Basic model is same, but codec complexity increases with each layer.
Divides data into frames, each of them contains 384 samples, 12 samples from each of the 32 filtered subbands as shown below.

Figure: Grouping of Subband Samples for Layer 1, 2, and 3

Layer 1: DCT type filter with one frame and equal frequency spread per band. Psychoacoustic model only uses frequency masking.
Layer 2: Use three frames in filter (before, current, next, a total of 1152 samples). This models a little bit of the temporal masking.
Layer 3: Better critical band filter is used (non-equal frequencies), psychoacoustic model includes temporal masking effects, takes into account stereo redundancy, and uses Huffman coder.

Audio Recording and Editing Tool: Adobe Audition

Audition is a audio editing software that records, playbacks, and editing sound files, and provides utilities for coverting the sournd files in different format.
For the recording exercise, you can use PCs in El Pomar media library, and bring your own headset with a microphone. The red/microphone connector go to the socket with the red microphone symobl in the sound card or that under the front panel cover. The black/headset speaker connector go to the socket with the headset symobl in the sound card or that under the front panel cover.
Due to health reason, the school no long lends out the headset. I have a light microphone you can check out from me.
Select Start | Program | Audition 1.5 or click on the Audition icon, on the desktop, and the following main window appear.

Select File | "new" menuitem, "New Waveform" dialog window appear.
Specify the sampling frequency, bits per sample, and mono/stereo.

For human speech we can get by with 11kHz, 8bit, and mono setting.

For singing, we can increase from 11 kHz to 22 kHz.

For music recording, use 44kHz and 16 bit.

Hit OK after choosing the sampling parameters.

Press "Record" button on the lower left corner of the screen to start recording.
Hit "Stop" button when you are done.
The recorded wave form will be displayed on the main window.
Most of the time, the signal will be pretty weak.
Select Transform | Amplify | Normalized menuitem to enhance the signal to a "normal" level.
Editng the sound file by selecting the area (point and drag) of the wave form and hit "delete" button.
Select File | save as menuitem. Choose filename and audio file type, typically .wav (Microsoft) format or .mp3 format.
.ra provides quite impressive compression ratio.

Audition can save sound as .wav or .mp3 but not .ra file. Real Audio encoding considers the speed of the transmission media and provides even higher compression ratio. See the example below. It is more than 5:1 ratio.

To convert among wave, mp3, to rm audio file, you can use file converter such as mp3 rm converter, http://www.mp3towav.org/MP3-RM-Converter/

To play back real audio file, download the basic real player 10 from http://www.real.com/.

Here are three sample audio files generated by CoolEdit an earlier version of Audition:
Vincent singing 176KB, au format and

Vincent singing, 178KB, wav format,

Vincent singing, 113KB, mp3 format and

Vincent singing, 33KB, realaudio format,

WavePad:

NCH Swift Sound provides a freeware WavePad audio editing tool. It has the same basic editing capability. Besides saving pcm .wave and MP3 files, it can also generate RSS Podcast audio format (actually it is .rss with .mp3).
- Example of .RSS file generated:
  <?xml version="1.0" ?>
  <rss version="2.0">
  <channel>
  <title>MikeInStadium</title>
  <description>MikeInStadium</description>
  <link>http://cs.uccs.edu/~cs525/audio/wavepad/</link>
  <item>
  <title>MikeInStadium</title>
  <description>MikeInStadium</description>
  <enclosure url="http://cs.uccs.edu/~cs525/audio/wavepad/MikeInStadium.mp3" length="94877" type="audio/mpeg" />
  </item>
  </channel>
  </rss>
Text-To-Speech: WavePad also contains text speech option (need to download speechapi.exe and mstts.exe from http://www.nch.com.au/speech/ or http://www.microsoft.com/msagent/downloads.htm
Here is the sound file that include first my voice recording, following by a text-to-speech synthesis results for "This is a CS525 test." with "Mike in Stadium" voice characteristic setting.
Potential research project: Study Agent SDK and explore the use of a window machine as a server, that given an text input (stream), produce a mp3 voice stream.

Exercises:

With mono, 16 bits, 22kHz, PCM encoding, how many bytes of data will be generated by a three minute sound recording?
Use Audition to record a less than 5 second voice with mono, 16 bit, 22kHz, encoding. Edit out the unnecessary silence portion of the sound track and apply the normalized special effect. Save it in .wav file and .mp3 formats. sFtp the two audio files to your cs525 home page directory and create a hyperlink in your class personal web page.
In EN233/EN138/A210 lab, we have MATLAB software package which allows to specify waveform and generate the correponding audio wave file for testing. We can use it to verify the audio masking concept used in MP3. The following MATLAB code from Mirinal Mandal's "Multimedia Signals and Systems" textbook example 2.1 will generate three wave files that demo strate the audio masking concept.

fs=44100; % sampling frequency
nb=16; % 16-bit/sample
sig1=0.5*sin(2*pi*(2000/44100)*[1:1*44100]); % 2000 Hz, 1 sec audio
sig2=0.5*sin(2*pi*(2150/44100)*[1:1*44100]); % 2150 Hz, 1 sec audio
sig3=[sig1 sig1+sig2]; % 2000 Hz and 2150 Hz tones are equally strong
sig4=[sig1 sig1+0.1*sig2]; % 2000 Hz is 20dB stronger than 2150 Hz
sig5=[sig1 sig1+0.01*sig2]; % 2000 Hz is 40dB stronger than 2150 Hz
wavwrite(sig1, fs, nb, 'C:\work\cs525\S2010\chow\sig1.wav');
wavwrite(sig2, fs, nb, 'C:\work\cs525\S2010\chow\sig2.wav');
wavwrite(sig3, fs, nb, 'C:\work\cs525\S2010\chow\test1.wav');
wavwrite(sig4, fs, nb, 'C:\work\cs525\S2010\chow\test2.wav');
wavwrite(sig5, fs, nb, 'C:\work\cs525\S2010\chow\test3.wav');

Follow the following step to generate the wave files.
- Login to A210 PC with your csnet account. Double click on the MATLAB R2009a icon, similar to on the desktop to start the application.
- Alternatively, you can also use remote desktop connection to login to rats or lats.eas.uccs.edu from home with your ufp account. Make sure you select "ufp" domain on the 3rd "log on to " selection.
  - They are win2003 servre. For MathLab, select the MATLAB R2009b icon on the desktop, or start | All Programs | MATLAB | R2007B | MATLAB R2009b.
- In the "My Computer" verify that your ufp directory is mapped as Z: drive.
- If it is, create a cs525 directory there and copy the above Mathlab code into the lower right command window. It will generate the thress test wave files.
- If you can not find yoru ufp directory and you would like to use the c:\temp directory for saving the , you need to replace the "Z:\cs525\" path in the last three command of the above mathlab code with "C:\temp" and then copy it into the command window.
- Play the three wave files. Verify that in test2.wav you barely detect the 2150 Hz sound which is 20dB or 0.1 weaker, and in test3.wav, you cannot detect the 2150 Hz sound which is 40dB or 0.01 weaker.
- Note that the masking threshold is subjective, varying from person to person. You may want to experiment with the dB ratio and see how sensitive your ear is.