digital audio intro

Introduction to Digital Audio

Estimated Read Time: 9 minute(s)
Common Topics: noise, audio, khz, snr, db

Introduction

First, we need some background in Digital Signals. This can be mathematically quite advanced, but since I would like this article to be accessible to as wide an audience as possible, here is a link that explains what is needed (not even Calculus is required):

https://brianmcfee.net/dstbook-site/content/intro.html

An important point not emphasised in the above is if we have a signal with maximum frequency f, Shannon guarantees not only can it be reconstructed if sampled at 2f, it can be exactly reconstructed – no phase shift, ringing, blurring, etc, but exact reconstruction.   This may seem strange.  I will do another insights article explaining it in a way that is easy to understand.   But for now, accept it.   Usually, some quite advanced math is needed.   I will reference that in the article, but that will only be for mathematically advanced readers.   It will also explain exactly how upsampling and downsampling is done.

The most fundamental building block of modern digital audio is quickly becoming outdated: CD audio sampled at 44.1 kHz, with each sample 16 bits, abbreviated as 44.1/16.

Dither

We know 44.1/16 has a signal-to-noise ratio (SNR) of 96 dB from the link on digital signals.  But enter dither:

What is dither, and is it still relevant in the Hi-Res audio age?

The actual SNR using triangular dither (TDPF) is about 112 dB SNR.

44.1/16 allows transmitting frequencies up to 22 kHz from Shannon.

To determine the number of bits needed, the background noise of high-quality recordings needs to be examined. The minimum SNR under ideal conditions with the best equipment is 110 dB. 130 dB would be a very reasonable SNR to aim at allowing a good margin of safety . Indeed, that is close to the thermal limit noise of a Digital to Analogue Converter (DAC).   More than that would seem like ‘gilding the lily’.

Using 16 bits with dither has an SNR of 112db.  As we will see, we can increase this further, achieving well over 130db. Fourteen bits are found to be enough to achieve 130 db with modern high quality DACs

Consequences of Aliasing in a DAC

A fascinating phenomenon happens when you convert digital to analogue due to aliasing. You get your original audio plus reflections of it that go on forever. It needs to be filtered at about 20 kHz to eliminate those. They are above audibility, so leaving them there has no audible consequences but can play havoc with amplifiers, etc., when listening to audio. Some don’t bother when designing a DAC. They are called NOS DACs, but most designers like to remove them.

Combine this with a filter to limit the signal to 22 kHz so Shannon holds, hence can be exactly reproduced without aliasing; two hard-to-design steep analogue filters are required. Well, life is not perfect, and the first DACs to appear did just this.

Then engineers started to have bright ideas.

Oversampling

Is there an easier way to tackle the filter issue in the CD player? While the minimum frequency you can sample at to have 22 kHz reproduction at is 44 kHz, nothing stops DAC designers at the other end from increasing the sampling frequency, let us say, eight times to 352k – it is called oversampling. You take one 44.1 k sample, then seven zero samples, and continue this way. Designing a 22 kHz digital filter that uses this upsampled data is straightforward as will be explained in the article on exact reproduction. Now, you have all these copies at 176 kHz instead of 22 kHz. It’s much easier to filter. Oversampling was the first idea.

This had the following important byproduct. If dithered, adding extra zero samples means all the samples are no longer dithered. The noise is concentrated in the non-zero samples. Applying the 22 kHz filter spreads the noise evenly across all samples. For eight times oversampling, the overall noise is now eight times less. Each halving of the noise means 3 dB less noise. So, you now have not 112 dB SNR, but 115 SNR. 8 times oversampling means we have 121 dB SNR. The first DAC chips could not handle 16 bits – 14 bits was the max. But using four times oversampling and an early form of noise shaping (to be discussed later), they were made equivalent to 16-bit DACs.

The Details of a Modern High Quality DAC

As always, from those early days, things move on.

Let’s look at a modern DAC like the PS Audio Direct Stream (DS). I use that as an example because I own one and have investigated how it works. It is nothing special; most other high-quality DACs these days work similarly.

It over-samples a whopping 1280 times or about 56 mHz sampling. Consider what this oversampling does to SNR. Let’s keep dividing it by 2: 640, 320, 160, 80, 40, 20, 10, 5, 2.5. 1.25. Count the number of doublings, and we get ten doublings. This is an extra 30 dB on the 112 dB we have after dithering, giving an SNR of 142 dB, way over what is required when the thermal limit is considered.  Fourteen bits give 130 dB SNR.  If a degradation in SNR of 130 is acceptable, 12 or even 8 bits could be used, giving an SNR of 118 dB and 102 dB, respectively.  Considering the DS has an overall noise floor of 120 dB, 12 bits would be acceptable. An even better strategy would be to locate the noise floor of the recording and only transmit enough bits to reproduce above that. FLAC compression does not compress noise well, and doing this will reduce FLAC files considerably.

As an experiment, I took some 44.1/16, changed it to 44.1/8 with dither and played it on my computer. During quiet passages, you could hear a faint hiss. But through my Direct Stream DAC – it is dead quiet even with my ear next to the speaker. As I said, 130 db has a margin of safety on the best recordings, but even 102 db is good.

How is Modern Digital Audio Created

This leads us naturally to how modern audio is created.  The exact implementation will vary, but here goes.  We feed the output of a microphone into one side of a comparator.  It outputs a one if it is greater than the other side.  Otherwise, a zero.  This is sampled at a very high frequency, say 56 MHz (1280 oversampling), and then fed that into an integrator whose output voltage slowly rises if one is present and falls if zero is present.  This voltage is the other side of the comparator.   If the input voltage is positive, each sample will be one, and the integrator will slowly rise. Eventually, it will be greater than the input voltage, and a zero is output, so the voltage falls. Thus, we have a large number of zeroes and ones that are easy to convert to an analog signal by simply using a low pass filter like a capacitor or a high-quality transformer whose frequency drops off at, say, about 70 kHz.

DXD Audio

To create the master from which audio files are distributed, we digitally filter the 1280 oversampled one bit audio to eight times oversampled audio.  This is called DXD.  Why DXD?  Audio engineers want a format guaranteed to have a sampling frequency above any maximum possible audio frequency, so Shannon implies exact reconstruction. They decided to make it much more than necessary. Nearly all recordings have frequencies over 22 kHz that are not swamped by noise. A few recordings do have frequencies not masked by noise above 44 kHz. It is rare to come across a recording with frequencies above 88 kHz, and none, to my knowledge, are above 176 kHz.   24-bit resolution is used for the same reason.

Noise Shaped Dither

After downsampling the audio to DXD, the resolution is more like 8 bits than 24 bits.  This is where a trick called noise shaping comes in. It is explained here:

https://www.analog.com/en/technical-articles/behind-the-sigma-delta-adc-topology.html

The link covers what I said previously about increasing resolution using TDPF and upsampling, but explained a bit differently.   It also discusses another type of dither, called noise-shaped dither.  Noise shaped dither does not increase SNR equally across all frequencies.   The SNR increases compared to TDPF dither at the lower frequencies but much less at higher frequencies.  The sampling rate of the one-bit audio, e.g. 56 MHz, records frequencies up to 28 MHz. This is far too high to be of any concern, and we can have a horrid SNR at that frequency but a much better SNR of 24 bits at the DXD frequencies.

Further Details of Direct Stream DAC

Knowing this, we can complete how the DS DAC works.  Everything is upsampled to 1280 times the CD sampling rate. Then, it is downsampled ten times, uses the same process that created the 1-bit stream with noise shaping and passes it through a transformer to get rid of the digital high frequencies to give the audio output. The designer arranged it so that above about 70 kHz, the transformer’s frequency response drop cancels the rise in noise above 70 kHz from the one-bit converter and its noise shaper. The SNR is 120 dB to very high frequencies.

Why downsample 10 times before converting to one bit audio with noise shaping?   The other name for one bit Audio is Digital Signal Direct (DSD).   When first implemented it was done at 64 times oversampling.   Doubling that gives 128 times oversampling also called 2x DSD.   You have 4x DSD, 8x DSD, even 16x DSD.   As explained in the following 2x DSD is the sweet spot:

https://positive-feedback.com/audio-discourse/raising-the-sample-rate-of-dsd-is-there-a-sweet-spot/

Distributing Digital Audio

That’s basically how modern audio is recorded and played back. For those who want the ultimate fidelity, you can purchase the DXD master. But in most cases, everything is recovered by downsampling it to 176k or 88k.   44.1k is becoming less popular among those that want the highest quality audio because the 22 kHz filter removes actual recorded frequencies.  How audible this is, is a matter of debate.  But 88k, for nearly all recordings, is enough to preserve all frequencies.  Remember Shannon – provided the highest frequency is below half the sampling frequency, you get exact reproduction.   Many DAC designers put a 50 kHz filter on the output to reduce noise because so few recordings have content above 50 kHz that is not masked by recording noise.  If you use such a DAC (Chord DAC’s for example do this) 88.2 kHz sampled recordings are good enough.   If you really want to be careful 176.4 kHz may have some minor benefits, but certainly there is no need to go to DXD.   However note what I will say later about FLAC lossless compression.

Reduction in File Size Using FLAC

FLAC is a lossless audio compression standard that has a very good compression    It generally reduces files sizes by about 50%.   Like all lossless compression algorithms I am aware of it has an Achilles heal.   Noise – it does not compress noise well.   This is apparent when comparing 44.1/16 and 88.2/16.   Since the difference between the two is just low level high frequency information, one would expect not a great increase in file size when compressed.   But it turns out not to be true.   88.2/16 is compressed by about 50%, just like 44.1/16   The reason is noise.   Yes, the high-frequency information is small, but the noise level is still the same.   To increase the effectiveness of FLAC, reducing the noise will help considerably.

Noise resides mostly in the lower bits of a recording.   Removing those will help the efficiency of FLAC.  Now you understand dithering; we could use dither, but have only have 16, 14, 12 or even 8 bits instead of 24.

There is a further trick that can be used.  A program called XIFEO can be downloaded that determines the maximum frequency of a recording that is not masked by noise.    It applies a filter above that frequency and removes all noise greater than that frequency.   From Shannon this will not affect exact reproduction, but since noise is often present at high frequencies, the final file is better compressed by FLAC.   The only trouble is the company that sold the program went out of business.   A demo version is still available that does just the first minute of a recording, but that would usually be enough to find the bit depth and cut-off frequency.

IMHO, this may eventually become the standard way audio is distributed.

Another issue is something audio engineers noticed. As the sampling rate is increased, the audio sounds better. Not only this, but the effect continues well into mHz sampling rates. We can only hear up to 20 kHz, so it can’t be the possible reconstruction of higher frequencies. I won’t go into the hypothesised reasons for this, except to note it is a phenomenon well-known to audio engineers. However, as suggested above, we get exact reconstruction when played back if produced correctly. We upsample to a high sampling rate to simulate high sampling frequencies, which the upsampling of 1280 times in the PS Audio DAC does.

This is important. A system called MQA was devised to reduce time smear, one of the hypothesised reasons high sampling rates sound better. This is of no importance in the system I described because we have exact reproduction at a very high sampling rate – there is no time smear – simple as that. It caused a lot of heated debate in Hi-Fi circles. But IMHO, it is a non-issue because modern DACs have exact reproduction at very high sampling rates.

Next article: https://www.physicsforums.com/insights/digital-filtering-and-exact-reconstruction-of-digital-audio/

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply