[Audio] General Audio Coding (AAC based)

General Audio Coding (AAC based)

This key component of MPEG-4 Audio covers the bitrate range of 16 kbit/s per channel up to bitrates higher than 64 kbit/s per channel. Using MPEG-4 General Audio quality levels between better than AM up to transparent audio quality can be achieved. MPEG-4 General Audio supports four so-called Audio Object Types (see the paper on MPEG-4 Profiling in this issue), where AAC Main, AAC LC, AAC SSR are derived from MPEG-2 AAC [2], adding some functionalities to further improve the bitrate efficiency. The fourth Audio Object Type, AAC LTP is unique to MPEG-4 but de.ned in a backwards compatible way.

Since MPEG-4 Audio is defined in a way that it remains backwards compatible to MPEG-2 AAC, it supports all tools defined in MPEG-2 AAC including the tools exclusively used in Main Profile and Scalable Sampling Rate (SSR) Profile, namely Frequency domain prediction and SSR filterbank plus gain control.

Figure 2: Building Blocks of the MPEG-4 General Audio Coder

Additionally, MPEG-4 Audio de.nes ways for bitrate scalability. The supported methods for bitrate scalability are described here. Fig. 2 shows the arrangement of the building blocks of an MPEG-4 GA encoder in the processing chain. These building blocks will be described in the following subsections. The same building blocks are present in a decoder implementation, performing the inverse processing steps. For the sake of simplicity we omit references to decoding in the following subsections unless explicitly necessary for understanding the underlying processing mechanism.

Filterbank and block switching

One of the main components in each transform coder is the conversion of the incoming audio signal from the time domain into the frequency domain. MPEG-2 AAC supports two different approaches to this. The standard transform is a straight forward Modified Discrete Cosine Transform (MDCT). However, in the AAC SSR Audio Object Type a different conversion using a hybrid filter bank is applied.

Standard Filterbank

The filterbank in MPEG-4 GA is derived from MPEG-2 AAC, i.e. it is a MDCT supporting block lengths of 2048 points and 256 points which can be switched dynamically. Compared to previously known transform coding schemes the length of the long block transform is rather high, offering improved coding efficiency for stationary signals. The shorter of the two block length is rather small, providing optimized coding capabilities for transient signals. MPEG-4 GA supports an additional mode with block lengths of 1920 / 240 points to facilitate scalability with the speech coding algorithms in MPEG-4 Audio. All blocks are overlapped by 50% with the preceding and the following block.

For improved frequency selectivitiy the incoming audio samples are windowed before doing the transform. MPEG-4 AAC supports two different window shapes that can be switched dynamically. The two different window shapes are a sine shaped window and a Kaiser-Bessel Derived (KBD) Window offering improved far-off rejection compared to the sine shaped window.

An important feature of the time-to-frequency transform is the signal adaptive selection of the transform length. This is controlled by analyzing the short time variance of the incoming time signal.

To assure block synchronicity between two audio channels with different block length sequences 8 short transforms are performed in a row using 50% overlap each and specially designed transition windows at the beginning and the end of a short sequence. This keeps the spacing between consecutive blocks at a constant level of 2048 input samples.

For further processing of the spectral data in the quantization and coding part the spectrum is arranged in so-called scalefactor bands roughly reflecting the bark scale of the human auditory system.

Filterbank and Gain Control in SSR Profile

In the SSR profile the MDCT is preceded by a processing block containing an uniformly spaced 4-band Polyphase Quadrature Filter (PQF) and a Gain control module. The Gain control can attenuate or amplify the output of each PQF band to reduce preecho effects. After the gain control is performed, an MDCT is calculated on each PQF band, having a quarter of the length of the original MDCT.

Frequency domain prediction

The Frequency domain prediction improves redundancy reduction of stationary signal segments. It is only supported in the Audio Object Type AAC Main. Since stationary signals can nearly always be found in long transform blocks, it is not supported in short blocks. The actual implementation of the predictor is a second order backwards adaptive lattice structure, independently calculated for every frequency line. The use of the predicted values instead of the original ones can be controlled on a scalefactor band basis and is decided based on the achieved prediciton gain in that band. To improve stability of the predictors, a cyclic reset mechanism is applied which is synchronized between encoder and decoder via a dedicated bitstream element. The required processing power of the frequency domain prediction and the sensitivity to numerical imperfections make this tool hard to use on fixed point platforms. Additionally, the backwards adaptive structure of the predictor makes such bitstreams quite sensitive to transmission errors.

Long term prediction (LTP)

Long term prediction (LTP) is an efficient tool for reducing the redundancy of a signal between successive coding frames newly introduced in MPEG-4. This tool is especially effective for the parts of a signal which have clear pitch property. The implementation complexity of LTP is significantly lower than the complexity of the MPEG-2 AAC frequency domain prediction. Because the Long Term Predictor is a forward adaptive predictor (prediction coefficients are sent as side information), it is inherently less sensitive to round-off numerical errors in the decoder or bit errors in the transmitted spectral coefficients.

Quantization

The adaptive quantization of the spectral values is the main source of the bitrate reduction in all transform coders. It assignes a bit allocation to the spectral values according to the accuracy demands determined by the perceptual model, realizing the irrelevancy reduction. The key components of the quantization process are the actually used quantization function and the noise shaping that is achieved via the scalefactors. The quantizer used in MPEG-4 GA has been designed similar to the one used in MPEG 1/2 Layer-3. It is a non-linear quantizer with an x characteristic. The main advantage of this non-linear quantization over a conventional linear quantizer is the implicit noise shaping that this quantization creates. The absolute quantizer stepsize is determined via a specific bitstream element. It can be adjusted in 1.5dB steps.

Scalefactors

While there already is an inherent noise shaping in the non-linear quantizer it is usually not sufficient to achieve acceptable audio quality. To improve the subjective quality of the coded signal the noise is further shaped via scalefactors. The way the scalefactors are working is the following: Scalefactors are used to amplify the signal in certain spectral regions (the scalefactor bands) to increase the signal-to-noise ratio in these bands. Thus they implicitly modify the bit-allocation over frequency since higher spectral values usually need more bits to be coded afterwards. Like the global quantizer the stepsize of the scalefactors is 1.5dB. To properly reconstruct the original spectral values in the decoder the scalefactors have to be transmitted within the bitstream. MPEG-4 GA uses an advanced technique to code the scalefactors as efficiently as possible. First, it exploits the fact that scalefactors usually do not change too much from one scalefactor band to another. Thus a differential encoding already provides some advantage. Second, it uses a Huffman code to further reduce the redundancy within the scalefactor data.

Noiseless coding

The noiseless coding kernel within an MPEG-4 GA encoder tries to optimize the redundancy reduction within the spectral data coding. The spectral data is encoded using a Huffman code which is selected from a set of available code books according to the maximum quantized value. The set of available codebooks includes one to signal that all spectral coefficients in the respective scalefactor band are "0", implying that there are neither spectral coefficients nor a scalefactor transmitted for that band. The selected table has to be transmitted inside the so-called section_data, creating a certain amount of side-information overhead. To find the optimum tradeoff between selecting the optimum table for each scalefactor band and minimizing the number of section_data elements to be transmitted an efficient grouping algorithm is applied to the spectral data.

Joint stereo coding

Joint stereo coding methods try to increase the coding efficiency when encoding stereo signals by exploiting commonalties between the left and right signal. MPEG-4 GA contains 2 different joint stereo coding algorithms, namely Mid-Side (MS) stereo coding and Intensity stereo coding. MS stereo applies a matrix to the left and right channel signals, computing sum and difference of the two original signals. Whenever a signal is concentrated in the middle of the stereo image, MS stereo can achieve a significant saving in bitrate. Even more important is the fact that by applying the inverse matrix in the decoder the quantization noise becomes correlated and falls in the middle of the stereo image where it is masked by the signal.

Intensity stereo coding is a method that achieves a saving in bitrate by replacing the left and the right signal by a single representing signal plus directional information. This replacement is psychoacoustically justified in the higher frequency range since the human auditory system is insensitive to the signal phase at frequencies above approximately 2kHz.

Intensity stereo is by definition a lossy coding method thus it is primarily useful at low bitrates. For coding at higher bitrates only MS stereo is used.

Temporal noise shaping

Conventional transform coding schemes often encounter problems with signals that vary heavily over time, especially speech signals. The main reason for this is that the distribution of quantization noise can be controlled over frequency but is constant over a complete transform block. If the signal characteristic changes drastically within such a block without leading to a switch to shorter transform lengths e.g. in the case of pitchy speech signals this equal distribution of quantization noise can lead to audible artifacts.

To overcome this limitation, a new feature called Temporal Noise Shaping (TNS) [3] was introduced into MPEG-2 AAC. The basic idea of TNS relies on the duality of time and frequency domain. TNS uses a prediction approach in the frequency domain to shape the quantization noise over time. It applies a filter to the original spectrum and quantizes this filtered signal. Additionally, quantized filter coefficients are transmitted in the bitstream. These are used in the decoder to undo the filtering performed in the encoder, leading to a temporally shaped distribution of quantization noise in the decoded audio signal.

TNS can be viewed as a postprocessing step of the transform, creating a continuous signal adaptive filter bank instead of the conventional two step switched filter bank approach. The actual implementation of the TNS approach within MPEG-2 AAC and MPEG-4 GA allows for up to three distinct filters applied to different spectral regions of the input signal, further improving the exibility of this novel approach.

Perceptual noise substitution (PNS)

A feature newly introduced into MPEG-4 GA, i.e. not available within MPEG-2 AAC is the Perceptual Noise Substitution (PNS) [4]. It is a feature aiming at a further optimization of the bitrate efficiency of AAC at lower bitrates.

The technique of Perceptual Noise Substitution is based on the observation that one noise sounds like the other. This means that the actual fine structure of a noise signal is of minor importance for the subjective perception of such a signal. Consequently, instead of transmitting the actual spectral components of a noisy signal, the bitstream would just signal that this frequency region is a noise-like one and give some additional information on the total power in that band. PNS can be switched on a scalefactor band basis so even if there just are some spectral regions with a noisy structure PNS can be used to save bits. In the decoder, a randomly generated noise will be inserted into the appropriate spectral region according to the power level signaled within the bitstream.

From the above description it is obvious that the most challenging task in the context of PNS is not to enter the appropriate information into the bitstream but reliably determining which spectral regions may be treated as noiselike and thus may be coded using PNS without creating severe coding artifacts. A lot of work has been spent on this task, most of which is reflected in [5].

*****************출처********************

http://www.chiariglione.org/mpeg/tutorials/papers/icj-mpeg4-si/09-natural_audio_paper/gacoding.html

Maple Story

[Audio] General Audio Coding (AAC based)

General Audio Coding (AAC based)

Filterbank and block switching

Standard Filterbank

Filterbank and Gain Control in SSR Profile

Frequency domain prediction

Long term prediction (LTP)

Quantization

Scalefactors

Noiseless coding

Joint stereo coding

Temporal noise shaping

Perceptual noise substitution (PNS)

티스토리툴바

[Audio] General Audio Coding (AAC based)

General Audio Coding (AAC based)

Filterbank and block switching

Standard Filterbank

Filterbank and Gain Control in SSR Profile

Frequency domain prediction

Long term prediction (LTP)

Quantization

Scalefactors

Noiseless coding

Joint stereo coding

Temporal noise shaping

Perceptual noise substitution (PNS)

'Software/Docs' Related Articles

티스토리툴바