Mac/PC Abstraction, Part 5: Sound and Ogg Vorbis

Overview

The next subject for cross-platform programming is sound. Granted, this probably won't apply to many apps, but I write lots of software for games and audio/video editing. For me, audio output is often a significant design factor.

There are a number of different APIs on the Mac that can be used to output audio data. However, when writing C++ code that needs to consume PCM data generated by the app (as opposed to playing a file from disk), the best choice is the AudioQueue functions. The down side to using AudioQueue is that audio data must be queued up one buffer at a time, which increases the latency of audio output. If all you need to do is play audio data, this is not much of an issue. But when low-latency is important (such as a game engine or real-time audio editing app), more testing and tweaking may be required to reduce latency without running the risk of audio dropping out when the CPU is heavily loaded.

For Windows, DirectSound offers a very nice, low-latency interface for outputting audio data. At least on XP. Vista moved the DirectSound functionality into software, which increases the latency for audio output (amongst other audio output changes that outraged many a gamer 'round the world). With DirectSound, it is possible to access the output buffer that is read by the hardware (or the software AudioStack on Vista), removing the need to write a full buffer of data at a time. This allows throughput latency to be reduced, but requires more logic to track read/write positions.

And of course there is always the option of trying a third-party library like Fmod or OpenAL. I cannot speak to the value of such libraries. Every time I've needed to write audio code, it has either needed to be at the lowest level of the OS, or inside a PCI device. As such, Fmod, OpenAL, and similar libraries have been the wrong tool for the job. While they may be good tools for other programmers, I have not personally used them enough to comment on their value.

The sample code discussed in this article is available in crossplat.zip. It contains a QzAudioTest project that can be built with both DevStudio 7 and Xcode 3. The relevant code is found in QzAudioTest/ and Qz/QzSoundDriver.h.

Terminology

Something worth mentioning is terminology. When dealing with audio data, each individual number (usually 16 bits) is referred to as a "sample". However, what do you call the left/right pair of samples when playing stereo data?

There are quite a few answers to that question (not one of them being standardized). A common term for them is simply "pair". Which is reasonable for stereo data, but not for mono or 5.1 audio. Some programmers refer to a pair of samples as a "sample" as well, making terminology confusing. Does a "sample" refer to one number, two numbers, or more?

Some other common terms are "block", "group", and "frame". The Mac's AudioQueue documentation uses "frame" as the term of choice. Another is "slice", which is often used by audio DSP programmers. Though thankfully not a DSP programmer myself, "slice" is the term I am most acquainted with using, so that is the one that appears throughout this article and the related code. (I've also heard "macroblock" and "tuple" used by some video engineers, but those terms apply to video data, not audio data.)

Needless to say, "sample" is an overloaded and confusing term when doing audio programming.

Another term that comes up in the Mac programming docs is "packet". This refers to a sequence of slices. For PCM data, a slice is a packet. But for compressed data, such as ADPCM, a packet may contain several hundred slices. Since this code only covers using PCM data, all mFramesPerPacket settings should be set to 1.

Base Class

To make life easier for myself, I have a single class definition for outputting audio data on both Windows and Mac:

class QzSoundDriver
{
private:
    // Each platform has a custom internal struct that contains the data
    // needed to interface with the sound player.  The struct is stored
    // as a black-box void* pointer so the rest of the app can remain
    // free from any platform-specific header files.
    void* m_pContext;

    U32   m_SampleRate;

public:
    QzSoundDriver(void);
    ~QzSoundDriver(void);

    bool Init(U32 sampleRate);
    bool Start(void);
    bool Stop(void);

    U32  MinWriteCount(void);
    U32  FreeBufferCount(void);
    U32  UpdatePosition(U32 lookahead);
    bool WriteToBuffer(S16 *pData, U32 sliceCount);

    // Only implemented for Mac version, since it is needed by the internal
    // callback handler.
    void IncFreeCount(void);
};

This header file is very short and simple. It also completely hides all of the platform-specific implementation within QzSoundDriverWin.cpp and QzSoundDriverMac.cpp. I'll cover the details of these two classes throughout the rest of this article.

Higher-level audio playback logic only knows about the common definition from QzSoundDriver.h, allowing it to run the same on both platforms. The Qz library does not have a full sound engine (at least not yet — one will no doubt be added in time as I use it for more projects).

Indeed, covering the design issues of a fully-featured sound engine are beyond the scope of this article. The only decent book I can recommend on the subject is Game Audio Programming, by James Boer, which designs and implements an audio engine, covering many of the relevant topics. The book uses DirectX for audio, and spends more time on DirectMusic (which is effectively deprecated) than it does DirectSound, and it dedicates lots of time to EAX and similar hardware technologies, which Microsoft seems to have killed off with Vista. At the time of this writing, Boer's book is six years old — positively ancient as programming books go — but still the only one I know of worth mentioning on the subject.

I can only assume that audio programming is nowhere near as sexy as graphics programming.

Mac AudioQueue

The Mac-specific code is found in Qz/QzSoundDriverMac.cpp. This uses the AudioQueue functionality to output audio data.

We'll start off with the code to initialize an AudioQueue.

bool QzSoundDriver::Init(U32 sampleRate)
{
    AudioStreamContext_t *pContext = reinterpret_cast<AudioStreamContext_t*>(m_pContext);

    // Protect against multiple calls to Init().
    FreeContext(pContext);

    pContext->DataFormat.mSampleRate       = sampleRate;
    pContext->DataFormat.mFormatID         = kAudioFormatLinearPCM;
    pContext->DataFormat.mFormatFlags      = kAudioFormatFlagIsSignedInteger
                                           | kAudioFormatFlagIsPacked;
    pContext->DataFormat.mBytesPerPacket   = c_BytesPerSlice;
    pContext->DataFormat.mFramesPerPacket  = 1;
    pContext->DataFormat.mBytesPerFrame    = c_BytesPerSlice;
    pContext->DataFormat.mChannelsPerFrame = c_AudioChannelCount;
    pContext->DataFormat.mBitsPerChannel   = c_SampleBitDepth;
    pContext->DataFormat.mReserved         = 0;

    m_SampleRate = sampleRate;

    // Create the audio queue that will be used to manage the array of audio
    // buffers used to queue samples.
    AudioQueueNewOutput(&(pContext->DataFormat), AudioCallback, this,
            NULL, NULL, 0, &(pContext->Queue));

    // Clear these values.  We will pre-queue three buffers of silence before
    // starting output.  When the callback handler is called for the first
    // time, it will indicate that buffer [0] is free, and will increment
    // FreeBufferCount.  This causes the first call to WriteToBuffer() to
    // fill up that buffer, and from there we rely on state logic to keep
    // filling the correct buffers in the correct order.
    pContext->FreeBufferCount = 0;
    pContext->NextFreeBuffer  = 0;

    // Allocate the three buffers we will be using, fill each with silence
    // (all zeroes), and queue them up so they are ready to go once audio
    // output is started.
    for (U32 i = 0; i < c_AudioBufferCount; ++i) {
        AudioQueueAllocateBuffer(pContext->Queue, c_BytesPerSlice,
                &(pContext->Buffers[i]));

        pContext->Buffers[i]->mAudioDataByteSize = c_BytesPerBuffer;        
        memset(pContext->Buffers[i]->mAudioData, 0, c_BytesPerBuffer);

        AudioQueueEnqueueBuffer(pContext->Queue, pContext->Buffers[i], 0, NULL);
    }

    // Prime the pump.  This will "decode" the PCM data.  However, since the
    // data is already PCM, this really doesn't do anything, and audio will
    // start up and play without this call.  But the docs recommend making
    // this call, and since someone may change this code to take non-PCM
    // audio data, it's a good idea to keep this here for completeness.
    AudioQueuePrime(pContext->Queue, 0, NULL);

    return true;
}

The first thing the code needs to do is fill in the DataFormat structure, which defines the formatting of the audio data. Some of the information is redundant, since we're using 16-bit, stereo, linear PCM. But even with PCM audio, there are many ways of representing the data, so we must fill in the data fields correctly or risk getting garbled noise out of the speakers.

mSampleRate: The sample rate of the audio data, typically 44,100 Hz. If audio needs to be played at different sample rates, a separate AudioQueue will be needed for each of them.
mFormatID: Specifies the base formatting of the audio (Mp3, PCM, ADPCM, etc.). In this case, we need to use linear PCM.
mFormatFlags: Additional flags are needed to specify variations of the format ID. PCM data could be represented as floats or integers, and integers could be 8, 16, 24, or 32 bits long — 16-bit PCM is a signed integer format. And we need to tag it as packed data, so the audio system won't try to interpret it as 16-bit data stored in 32-bit integers.
mBytesPerPacket: The packet size does not apply to PCM data. Set the packet size to be the same size as a slice (or "frame", to use Apple's terminology). If we were using a format like ADPCM, we would need to know the size of a packet (typically stored in the file header).
mFramesPerPacket: There is only one slice (or "frame") per packet with PCM data.
mBytesPerFrame: The frame size is channel_count × sample_size, or 4 bytes for 16-bit stereo.
mChannelsPerFrame: The code is hardwired for stereo data, so we set the channel count to 2.
mBitsPerChannel: "Channel" is yet another synonym for "sample". We're using 16-bit samples. Note that 16- and 24-bit sample can be stored in 32-bit integers, and aligned as the high or low bits. Make certain that the bits in mFormatFlags are correctly set up interpret the positioning of the sample bits.

Once the DataFormat struct has been filled in, we call into AudioQueueNewOutput to create a new AudioQueue object that uses exactly this format. Once it has been created, the AudioQueue can only accept audio data in this format, so all buffers created by this AudioQueue are assumed to contain data in exactly this format.

The call to AudioQueueNewOutput also sets the callback function that will be used to notify the app when each buffer has been consumed by the audio hardware. The Apple documentation takes the approach of having the callback handler refill the buffer and enqueue the buffer. However, this introduces multithreading issues into the code, and usually requires the callback handler to call into higher level code to fetch the audio data that will be played. This tends to make things more complex than necessary, since multithreading and reentrancy are complex subjects that can introduce subtle bugs.

The callback handler used here does nothing more than increment the FreeBufferCount field. This completely avoids multithreading and reentrancy. Instead, higher level code needs to periodically poll QzSoundDriver to find out if there are any empty buffers — if there are, it will fill another buffer with audio data and feed that in through WriteToBuffer. Look at the main loop in AudioTest.cpp for an example of how this works. (Note that this is a bad example, since driving audio output from the main loop is unreliable. Typically a separate thread is used to keep audio output fed, which is a topic worthy of an article all its own.)

After we have created the AudioQueue, we need to allocate some buffers with AudioQueueAllocateBuffer. As mentioned, these buffers are required to only contain audio data in the format specified by DataFormat.

Once we have a new buffer, we need to zero it out (for signed 16-bit audio, zeroes represent silence), then queue it with AudioQueueEnqueueBuffer. For now, the queued buffers will sit in the AudioQueue, waiting for playback to be started.

By queuing up several buffers of silence, we can guarantee that audio playback will start immediately, instead of having to wait for higher-level code to start filling buffers. In addition, by pre-queuing all of the buffers, we can set FreeBufferCount to zero, since all of the buffers are being used, and we set to NextFreeBuffer to zero, since the first buffer that will be consumed and returned will be buffer [0]. This effectively provides a very simple state machine that allows us to always know which buffer needs to be filled and re-queued next, without having to explicitly keep track of each pBuffer that gets handed to the callback function.

Once all of the buffers have been zeroed out and queued, we can call AudioQueuePrime to preemtively "decode" the contents of the buffers. This is not really necessary, since we are dealing with PCM data. However, if we were using any type of encoded data (such as MP3 or ADPCM), we would want to prime the queue to avoid extra delays when starting the playback.

Finally, after all of the above initialization is complete, we can start audio playback by calling AudioQueueStart:

bool QzSoundDriver::Start(void)
{
    AudioStreamContext_t *pContext = reinterpret_cast<AudioStreamContext_t*>(m_pContext);

    AudioQueueStart(pContext->Queue, NULL);

    return true;
}

Likewise, playback can be stopped by calling AudioQueueStop. Note that we also call FreeContext to assure that all of the audio buffers and the AudioQueue itself are deleted.

bool QzSoundDriver::Stop(void)
{
    AudioStreamContext_t *pContext = reinterpret_cast<AudioStreamContext_t*>(m_pContext);

    AudioQueueStop(pContext->Queue, true);

    FreeContext(pContext);

    return (noErr == status);
}

Now we get to the WriteToBuffer function. This takes an array of 16-bit stereo PCM samples, which are written into the next available audio buffer. This code assumes that the caller has first verified that there is at least one empty audio buffer available, and that it is providing the exact number of samples which are required to fill the buffer.

(As an aside, you could try to write fewer than c_SlicesPerBuffer into each buffer, but that means those buffers with less data in them will be consumed faster, producing a degree of unpredictability in how quickly buffers are consumed. It also can significantly increase the risk of the buffers being consumed faster than the higher-level code can fill them, causing audio output to drop out and introduce other artifacts. Always completely filling a buffer will avoid these dangers.)

bool QzSoundDriver::WriteToBuffer(S16 *pData, U32 sliceCount)
{
    AudioStreamContext_t *pContext = reinterpret_cast<AudioStreamContext_t*>(m_pContext);

    // We need at least one free buffer.  The caller should have used
    // UpdatePosition() to see if there is any buffer space available
    // before calling WriteToBuffer().
    if (0 == pContext->FreeBufferCount) {
        return false;
    }

    // We can only accept a full buffer of audio at a time.
    // (Actually, we could accept less, but that would make this
    // control logic more elaborate.)
    if (c_SlicesPerBuffer != sliceCount) {
        return false;
    }

    U32 index = pContext->NextFreeBuffer;

    // Blit the data into the next available buffer.
    pContext->Buffers[index]->mAudioDataByteSize = c_BytesPerBuffer;        
    memcpy(pContext->Buffers[index]->mAudioData, pData, c_BytesPerBuffer);

    // Queue the buffer for playback.
    AudioQueueEnqueueBuffer(pContext->Queue, pContext->Buffers[index], 0, NULL);

    // Decrement the count of empty buffers and advance NextFreeBuffer
    // around to index of the next buffer in the three-buffer ring.
    QzThreadSafeDecrement((S32*)&(pContext->FreeBufferCount));
    pContext->NextFreeBuffer = (pContext->NextFreeBuffer + 1) % c_AudioBufferCount;

    return true;
}

Note that at the end of WriteToBuffer, we decrement the count of empty buffers and advance NextFreeBuffer to the next buffer in the ring. We're doing this to maintain a simple state machine with the callback function. By always queuing buffers in a circular order, we always know the order in which they are processed, and which one needs to be filled next.

Next, we have the IncFreeCount function. This is called from the callback handler so we can increment FreeBufferCount. The higher-level code will periodically poll FreeBufferCount to see if there is an empty buffer ready to be filled.

void QzSoundDriver::IncFreeCount(void)
{
    AudioStreamContext_t *pContext = reinterpret_cast<AudioStreamContext_t*>(m_pContext);

    QzThreadSafeIncrement((S32*)&(pContext->FreeBufferCount));
}

And that's the basics of playing audio using AudioQueue. I'll go over how the higher-level code uses this functionality further down in the Driving the Driver section.

Windows DirectSound

The Windows-specific code is found in Qz/QzSoundDriverWin.cpp, which uses DirectSound for audio output.

Compared to AudioQueue, the DirectSound code is more involved, both for initialization and in accessing the contents of the audio buffer.

We'll start with the basic initialization routines. And once again, I'm leaving most of the error checking code out to keep these listings to a more manageable length.

bool QzSoundDriver::Init(U32 sampleRate)
{
    DirectSoundContext_t *pContext = reinterpret_cast<DirectSoundContext_t*>(m_pContext);

    m_SampleRate = sampleRate;

    HRESULT hr = S_OK;

    // Create the DirectSound object required for buffer allocation.
    hr = DirectSoundCreate8(NULL, &(pContext->pDxSound), NULL);

    // If we're running in normal window (QzMainWin.cpp), this
    // will be the handle where graphics are being rendered.
    HWND hWindow = g_hWindow;

    // However, if hWindow is not defined, assume we're running
    // from a console app, so we can fetch that window handle.
    if (NULL == hWindow) {
        hWindow = GetConsoleWindow();
    }

    // Set the cooperative level.  We need to use PRIORITY level
    // so we can call SetFormat() on the primary mixing buffer.
    hr = pContext->pDxSound->SetCooperativeLevel(hWindow, DSSCL_PRIORITY);

    // Fill in a struct that defines the primary mixing buffer.
    // Although we won't be directly writing to this buffer, we
    // do need to access it to set the sampling rate (otherwise
    // our audio data may be resampled on playback, which can
    // reduce quality).
    DSBUFFERDESC desc;
    SafeZeroVar(desc);
    desc.dwSize        = sizeof(DSBUFFERDESC);
    desc.dwFlags       = DSBCAPS_PRIMARYBUFFER;
    desc.dwBufferBytes = 0;
    desc.lpwfxFormat   = 0;

    IDirectSoundBuffer *pPrimaryBuffer = NULL;

    hr = pContext->pDxSound->CreateSoundBuffer(&desc, &pPrimaryBuffer, 0);

    // Now fill in a WAV struct that defines the formatting of the
    // audio data as 16-bit stereo PCM.
    WAVEFORMATEX format;
    SafeZeroVar(format); 
    format.wFormatTag      = WAVE_FORMAT_PCM; 
    format.nChannels       = c_AudioChannelCount; 
    format.nSamplesPerSec  = m_SampleRate;
    format.nBlockAlign     = 2 * c_AudioChannelCount;
    format.nAvgBytesPerSec = format.nSamplesPerSec * U32(format.nBlockAlign);
    format.wBitsPerSample  = c_SampleBitDepth;

    // Now we SetFormat() on the primary buffer, which sets the audio
    // sample rate that will be used for audio output.
    hr = pPrimaryBuffer->SetFormat(&format);

    // Release the reference to the primary buffer.
    // We won't be needing it again.
    SafeRelease(pPrimaryBuffer);

    // The context info contains the current position at which we will
    // start writing data.  Note that we're setting WriteOffset and
    // BytesRemaining as if that many samples of silence exist at the
    // start of the buffer.  That gives us this amount of playback time
    // after starting before we must write data into the buffer, or
    // risk glitching the audio output and screwing up the write
    // position.
    pContext->BufferSize     = m_SampleRate * c_BytesPerSlice;
    pContext->Position       = 0;
    pContext->WriteOffset    = c_AudioBufferCount * c_BytesPerBuffer;
    pContext->BytesRemaining = c_AudioBufferCount * c_BytesPerBuffer;

    // The exact same buffer settings are used to create the buffer
    // into which we will be writing audio data.  (I'm manually refilling
    // this struct, since I've had problems in the past with Win32 calls
    // changing the contents of the WAV struct.  It doesn't seem to
    // happen here, but once bitten...).
    SafeZeroVar(format);
    format.wFormatTag      = WAVE_FORMAT_PCM;
    format.nChannels       = c_AudioChannelCount;
    format.nSamplesPerSec  = m_SampleRate;
    format.nBlockAlign     = 2 * c_AudioChannelCount;
    format.nAvgBytesPerSec = format.nSamplesPerSec * U32(format.nBlockAlign);
    format.wBitsPerSample  = c_SampleBitDepth;

    // Set up a DSBUFFERDESC structure.  Note that this has different
    // settings than for the primary buffer.  Primarily, we need to be
    // able to get the current playback position so we can write into
    // the buffer as it is being played.
    SafeZeroVar(desc);
    desc.dwSize        = sizeof(DSBUFFERDESC);
    desc.dwFlags       = DSBCAPS_GETCURRENTPOSITION2;
    desc.dwBufferBytes = format.nAvgBytesPerSec;
    desc.lpwfxFormat   = &format;

    // Set this flag to make sound keep playing when app does not have
    // focus.  Otherwise sound becomes inaudible (but continues to be
    // processed) when mouse/keyboard focus switches to another window.
    desc.dwFlags |= DSBCAPS_GLOBALFOCUS;

    // This creates a basic DirectSound buffer.
    IDirectSoundBuffer *pBuffer = NULL;
    hr = pContext->pDxSound->CreateSoundBuffer(&desc, &pBuffer, NULL);

    // Now QI for a DirectSound 8 sound buffer.  This interface
    // will allow us to call GetCurrentPosition() on the buffer.
    hr = pBuffer->QueryInterface(g_SoundBuffer8Guid, (void**)&(pContext->pBuffer));

    // Release the reference to the basic sound buffer interface.
    // We will be using the DS 8 interface from here on out.
    SafeRelease(pBuffer);

    S16* pAdr      = NULL;
    U32  byteCount = 0;

    // Lock the buffer so we can zero out the entire thing.
    hr = pContext->pBuffer->Lock(0, pContext->BufferSize, (void**)&pAdr,
            &byteCount, NULL, NULL, DSBLOCK_ENTIREBUFFER);

    // It is most unlikely that the buffer could be lost if we're still
    // initializing the app, but try to restore it if we can.
    if (DSERR_BUFFERLOST == hr) {
        pContext->pBuffer->Restore();

        hr = pContext->pBuffer->Lock(0, pContext->BufferSize, (void**)&pAdr,
                &byteCount, NULL, NULL, DSBLOCK_ENTIREBUFFER);
    }

    // Zero-out the buffer so it starts off outputting silence.
    memset(pAdr, 0, pContext->BufferSize);

    hr = pContext->pBuffer->Unlock(pAdr, byteCount, NULL, 0);

    LogMessage("DirectSound driver Init() completed");

    return true;
}

The code above starts off with a call to DirectSoundCreate8 to create the basic DirectSound 8 object needed to access the device. Once we have this pointer, we can create all of the buffers and access the control routines needed to drive audio.

Next we need to call into SetCooperativeLevel. This achieves two goals. First, it associates audio playback with a specific window — by default, audio output is silenced when that window loses input focus (it gets minimized, another window moves in front of it, etc.). Second, this assigns PRIORITY rights to the device, which is needed to access the primary mixing buffer so we can adjust the buffer's settings.

Then we can create the primary output buffer. Since the primary buffer controls the audio sampling rate played from the output device, we need to call SetFormat so we can control the format and sample rate of the buffer. Once those settings have changed, we can release the reference, since we won't need to touch the primary buffer again: DirectSound keeps its own internal reference which will last until the device is shut down, making for one less COM reference for us to track.

Next, we create a second sound buffer. This is the one into which we will be writing audio data. However, we cannot use the standard COM interface, since that does not allow us to access the current read/write positions within the buffer. For that, we need to QueryInterface for a version 8 DirectSound buffer (and make certain that we have set the DSBCAPS_GETCURRENTPOSITION2 caps bit before creating the buffer). Then we can release the base COM reference, since we only need the version 8 interface.

Finally, we need to lock the contents of the buffer with the DSBLOCK_ENTIREBUFFER so we can zero out the entire contents of the buffer. This way audio output will start playing silence as soon as we start playback.

Now we can start audio playback:

bool QzSoundDriver::Start(void)
{
    DirectSoundContext_t *pContext = reinterpret_cast<DirectSoundContext_t*>(m_pContext);

    if ((NULL == pContext->pDxSound) || (NULL == pContext->pBuffer)) {
        LogErrorMessage("Start(): DS not initialized");
        return false;
    }

    HRESULT hr = S_OK;

    // Set playback to start from the beginning of the buffer.
    hr = pContext->pBuffer->SetCurrentPosition(0);

    // Start playback to loop infinitely over the contents of the buffer.
    hr = pContext->pBuffer->Play(0, 0, DSBPLAY_LOOPING);

    return true;
}

Here we need to call SetCurrentPosition to make certain that playback will start from the beginning of the buffer (by reason it should, but let's not take any chances — things could be messed up if playback was started, stopped, then started again). Then we can issue the Play call that starts audio output from the buffer, with output looping infinitely. After this point, we need to start writing audio data into the buffer too keep ahead of where the hardware is pulling audio data out of the buffer.

bool QzSoundDriver::Stop(void)
{
    DirectSoundContext_t *pContext = reinterpret_cast<DirectSoundContext_t*>(m_pContext);

    if ((NULL == pContext->pDxSound) || (NULL == pContext->pBuffer)) {
        LogErrorMessage("Stop(): DS not initialized");
        return false;
    }

    HRESULT hr = S_OK;

    hr = pContext->pBuffer->Stop();

    return true;
}

Stopping playback simply involves calling Stop. Note that stopping will be immediate, so if audio is still playing non-zero samples, a pop/click will be heard as audio comes to an abrupt halt. This may be more or less noticeable one some hardware. Cheap audio output like SoundMAX exhibit this problem worse, whereas SoundBlaster hardware does some extra audio output filtering, making the artifacts less obvious. The cleanest way to stop audio is to scale the audio data down to zero over time, given playback enough time to consume that ramped-down data, then stop playback. But that logic has to exist at a higher level.

And now we get to the DirectSound version of WriteToBuffer:

bool QzSoundDriver::WriteToBuffer(S16 *pData, U32 sliceCount)
{
    DirectSoundContext_t *pContext = reinterpret_cast<DirectSoundContext_t*>(m_pContext);

    if (NULL == pContext->pBuffer) {
        return false;
    }

    U32  size1     = 0;
    U32  size2     = 0;
    U08* ptr1      = NULL;
    U08* ptr2      = NULL;
    U32  byteCount = sliceCount * c_BytesPerSlice;

    // Lock a range of memory within the buffer, starting from where we last
    // wrote data, through to the end of where we are going to be writing
    // data this time around.  Since this is a ring buffer, we get back two
    // size values and two pointers.  We only need the second pair of values
    // if the data wraps around the end of the buffer.
    HRESULT hr = pContext->pBuffer->Lock(pContext->WriteOffset, byteCount,
            (void**)&ptr1, &size1, (void**)&ptr2, &size2, 0);

    if (DSERR_BUFFERLOST == hr) {
        hr = pContext->pBuffer->Restore();

        // NOTE: It is possible for the restore to fail if the app window
        // does not have focus.

        hr = pContext->pBuffer->Lock(pContext->WriteOffset, byteCount,
                (void**)&ptr1, &size1, (void**)&ptr2, &size2, 0);
    }

    // Copy the first segment of data.  If this is in the middle of the
    // buffer, we're done.
    if (NULL != ptr1) {
        memcpy(ptr1, pData, size1);
    }

    // However, if the data range wraps around the end of the buffer,
    // we need to copy the rest of the data to the second memory range,
    // which will start at the beginning of the buffer.
    if (NULL != ptr2) {
        memcpy(ptr2, reinterpret_cast<U08*>(pData) + size1, size2);
    }

    hr = pContext->pBuffer->Unlock(ptr1, size1, ptr2, size2);

    // Update the write position, keeping in mind that this is a ring buffer.
    pContext->WriteOffset     = (pContext->WriteOffset + byteCount) % pContext->BufferSize;
    pContext->BytesRemaining += byteCount;

    return true;
}

Unlike Mac's AudioQueue, where we are feeding full buffers of audio to output, with DirectSound we are writing audio data into the output buffer as the hardware is reading data from the same buffer. Since we have the buffer looping infinitely, this is simply a ring buffer. By keeping track of where we last wrote data, we know the offset at which the next write operation needs to start.

We prepare to write data by calling Lock. This maps that part of the buffer into addressable memory, allowing us to memcpy audio data into it. (Maybe. Many drivers actually provide a separate buffer for writing, then transfer data from there to hardware after the buffer is Unlocked. But we seldom need to know about how the DirectSound driver implements its internal functionality.)

Because we are dealing with a ring buffer, there is the special case when the range of data we need to write will wrap around the end of the buffer. That is why the Lock function will return two pointers and two size values. Normally, the second pointer will be NULL, so we only need to memcpy to the first pointer. However, if the second pointer is non-NULL, then a second memcpy is required to blit the rest of the sound data to the second memory address.

And that concludes the basic functionality withing the DirectSound version of QzSoundDriver.

Driving the Driver

Now we can turn our attention to how we can drive audio output using either of the two platform-specific implementations. The code for this is found in AudioTest.cpp.

The sample code there sits in a loop, polling to see if there is an empty buffer to process, filling it with audio data, then Sleeping for a few milliseconds before repeating. This is a functional but rather clumsy loop. Using a signal event is better than sleeping, since the code can be placed in a worker thread and wake up in response to events. But implementing a good audio engine is beyond the scope of this (already too long) article. Sleeping is adequate for the purpose of this simple demo.

Here is a simplified version of the main loop:

    QzSoundDriver driver;
    driver.Init(sampleRate);
    driver.Start();

    for (;;) {

        // How many empty buffers are there to fill up?  Make this
        // call only once, and do it before we enter the loop.
        // If we tried to repeatedly call this inside the inner
        // loop, we would end up in an infinite loop when running
        // with the DirectSound version of the driver.
        U32 freeCount = driver.FreeBufferCount();

        for (U32 i = 0; i < freeCount; ++i) {

            // Find out how much data is currently stored in the output buffer.
            U32 sliceCount = driver.UpdatePosition(lookahead);

            // If the amount of audio data is less than the desired
            // lookahead amount (and it always should be), we'll need
            // to mix up more audio to fill in the empty space in the
            // buffer.
            if (sliceCount > 0) {
                GenerateSamples(pScratch, sliceCount);
 
                driver.WriteToBuffer(pScratch, sliceCount);
            }
        }

        // Wake up about 30 times a second to feed in more audio data.
        // At a 44,100 Hz sample rate, sleeping for 30 milliseconds
        // will consume about 1,300 slices per iteration of the loop.
        QzSleep(30);
    }

    driver.Stop();

Once the driver is initialized and started, we drop into the main loop that keeps the driver fed with audio data. Periodically, the loop will wake up and call FreeBufferCount to find out if there are any buffers that need to be filled. If there is an empty buffer, it calls UpdatePosition to find out how much data is needed to fill the buffer. Then it can go off to generate the data (such as mixing its own audio data, or calling a codec to decompress audio from a file), then call WriteToBuffer to write the data into the next available buffer for output.

However, the AudioQueue and DirectSound code behave differently. Those differences are hidden behind the class interface, allowing this loop to work the same with both implementations.

The first difference is with FreeBufferCount. The AudioQueue implementation actually keeps track of how many buffers are empty, and returns the actual count. With DirectSound, however, FreeBufferCount always returns 1, since there is only one buffer, and it is always possible to write more data into the buffer since the hardware read pointer is constantly advancing around the buffer.

Therefore it is important for FreeBufferCount to only be called once, then the inner loop repeats exactly that many times. Otherwise the DirectSound implementation would cause this to drop into an infinite loop.

The next important function is UpdatePosition. For AudioQueue, this is trivial: it either returns the number of samples that fit into one buffer, or it returns zero if there are no empty buffers (which should not happen if the code is calling FreeBufferCount first).

However, the DirectSound implementation of UpdatePosition is rather more elaborate. Since a single ring buffer is being used for audio output, state logic needs to keep track of where data was written, read back the current "write" position from the hardware to figure out how much audio data has been consumed since the last update, then calculate how many slices need to be written to maintain the requested lookahead distance within the buffer.

U32 QzSoundDriver::UpdatePosition(U32 lookahead)
{
    DirectSoundContext_t *pContext = reinterpret_cast<DirectSoundContext_t*>(m_pContext);

    U32 write = 0;

    // Get the "write" position.  This is where it is safe to start writing
    // audio data.  Assume that all data in the buffer before this point
    // has been consumed, and plan to start writing data somewhere after
    // this position.
    HRESULT hr = pContext->pBuffer->GetCurrentPosition(NULL, &write);

    if (FAILED(hr)) {
        PrintHRESULT("GetCurrentPosition() failed", hr);
        return 0;
    }

    // Ring buffer logic to figure out how much data has been consumed
    // since the last time we updated the position information.
    U32 consumed = (pContext->BufferSize + write - pContext->Position) % pContext->BufferSize;

    pContext->Position = write;

    // How much valid data is stored in the buffer?
    if (consumed >= pContext->BytesRemaining) {
        // This is the bad condition: we haven't kept up, and the hardware
        // has consumed all of the data we've given it.  We can try to
        // write more data, but we've already had an audio glitch at this
        // point.  And by the time we write more data into the buffer,
        // the hardware will almost certain advanced beyond this new
        // write position, so another glitch will happen then.  Hopefully
        // after this point we can return to keeping ahead of the hardware.
        //
        // If not, we might consider stopping audio, clearing out the whole
        // buffer, then restarting with a long duration of silent data in
        // the buffer.  That will give higher-level code time to queue up
        // more audio data and once again try to keep in advance of where
        // the hardware is reading from the buffer.
        //
        pContext->BytesRemaining = 0;
        pContext->WriteOffset    = write;
    }
    else {
        pContext->BytesRemaining -= consumed;
    }

    // How many slices are currently sitting in the buffer.  
    U32 sliceCount = pContext->BytesRemaining / c_BytesPerSlice;

    // If there are not at least "lookahead" number of slices, return the
    // number of slices need to keep "lookahead" slices in advance of what
    // the hardware is processing.
    if (sliceCount < lookahead) {
        return lookahead - sliceCount;
    }

    return 0;
}

Hopefully the comments in the above code are enough to describe what it is doing. The key point is that since we are writing to the buffer as hardware is reading from the buffer, we must always keep ahead of the hardware. This is where the size of the "lookahead" distance is important.

Since we are writing to the output buffer, we can reduce the lookahead distance to reduce the latency of audio samples that are being played. This can make a game sound more responsive (e.g., bullet sound effects occur closer to the time at which a muzzle flash is displayed on the screen), or make audio timeline scrubbing more responsive in an audio/video editing app.

The downside of reducing lookahead distance is that we increase the chance of audio glitching if the CPU becomes too heavily loaded or the frame rate drops too much (which is a good reason to have the audio being driven from a separate thread, rather than trying to update it once per frame in the main rendering loop).

Increasing the lookahead distance will reduce the odds of a sound glitch, but makes audio output more laggy. There is no good answer to this problem. The only option is empirical: test the performance of your app under heavy loads, and increase lookahead to whatever gives the best results.

And to add to the problem, Vista removed hardware acceleration from DirectSound and moved all audio processing into software, so that the new software AudioStack can allow multiple apps to use DirectSound concurrently, and allow additional per-application control of sound. This will no doubt add even more latency to audio output. Trying to compensate for it by reducing the lookahead distance will only make the problem worse.

Thankfully, most people are not good at perceiving sound latency if it is kept down to 100-150 milliseconds. Try displaying an animation of an explosition, then play a sound effect 100 milliseconds later. Most people will think that the animation and the sound happened at the same time — that is the way our brains are wired, thanks to the speed of sound. Keeping latency down is not as essential as many programmers think.

In my experience, 100 milliseconds is adequate for most applications. And I say this having written audio control apps for real experts who literally can detect a difference of 16 milliseconds. Those individuals are incredibly rare and can be safely ignored until you need to write software for them to use. The rest of us normal folk are not trained for that kind of perception.

Ogg/Vorbis

While on the subject of sound, I'll throw in some notes on writing code to read Ogg Vorbis sound files. Check out the QzOggDecoder.cpp class for a working implementation.

For DevStudio, you'll just need to include vorbisfile.h in the project. Provided all of the relevant headers are in the path, DevStudio will pick up all of the files it needs to compile. To make life simpler, I built a custom libvorbis.dll that has all of the decode functionality combined into a single DLL, instead of the two or three different DLLs required by a regular build.

To build with Xcode, you'll need to obtain Ogg.framework and Vorbis.framework from Xiph.org. The one curiosity here is that these two frameworks ended up in /Library/Frameworks instead of /System/Library/Frameworks. To get the code to compile in Xcode, it is necessary to define a __MACOSX__ symbol to get the definitions in vorbistypes.h to work correctly. For whatever reason, this symbol is not defined when building with Xcode. (Maybe the folks at Xiph are out of sync with the current version of Xcode? Maybe I have no clue how to properly install software on a Mac?)

Since the header files are located in two different frameworks, the following explicit includes are required to build in Xcode:

#define __MACOSX__ // need to explicitly define __MACOSX__ for VorbisTypes.h
#include <Ogg/os_types.h>
#include <Ogg/ogg.h>
#include <Vorbis/vorbisfile.h>
#endif

Ogg uses a set of callbacks to access data from a file. There is a standard handler that can be used if you want to read from a file, but sometimes it is easier to read the whole Ogg file into memory and access from there. By implementing your own callback functions, you can have the Ogg library decode data from memory.

    ov_callbacks callbacks;
    callbacks.read_func  = CallbackRead;
    callbacks.seek_func  = CallbackSeek;
    callbacks.close_func = CallbackClose;
    callbacks.tell_func  = CallbackTell;

    ov_open_callbacks(this, &m_Context, NULL, 0, callbacks);

These four callback functions take the place of fread, fseek, fclose, and ftell.

The first function required is one that emulates the behavior of ftell. This just needs to return the current offset into the data buffer:

U32 QzOggDecoder::DataTell(void)
{
    return m_Offset;
}

The next function needed is the equivalent of fseek. Caution needs to be taken with this function, since Ogg may try to seek past the end of the file when doing the initial scan of the data. All three origin values (SEEK_CUR, SEEK_SET, and SEEK_END) need to be supported:

int QzOggDecoder::DataSeek(U32 offset, int origin)
{
    switch (origin) {
        case SEEK_CUR:
            m_Offset += offset;

            if (m_Offset > m_ByteCount) {
                m_Offset = m_ByteCount;
            }
            break;

        case SEEK_SET:
            if (offset < m_ByteCount) {
                m_Offset = offset;
            }
            else {
                m_Offset = m_ByteCount;
            }
            break;

        case SEEK_END:
            if (offset < m_ByteCount) {
                m_Offset = m_ByteCount - offset;
            }
            else {
                m_Offset = 0;
            }
            break;

        default:
            return -1;
    }

    return 0;
}

Another function required needs to behave the same as fread, including its behavior when given data size values other than 1. Caution also needs to be taken when attempting to read past the end of the buffer, which Ogg Vorbis often will do when it initially scans the data as part of the ov_open_callbacks function.

U32 QzOggDecoder::DataRead(void *pDest, U32 dataSize, U32 dataCount)
{
    U32 byteCount = dataSize* dataCount;

    // Ogg will happily read past the end of the file when starting up.
    // Do a range check to avoid running past the end of the buffer.
    // Make certain that the byte count to read is a multiple of the
    // requested element size.
    if ((m_Offset + byteCount) > m_ByteCount) {
        byteCount = ((m_ByteCount - m_Offset) / dataSize) * dataSize;
    }

    if (byteCount > 0) {
        memcpy(pDest, m_pData + m_Offset, byteCount);
    }

    m_Offset += byteCount;

    // Make certain that the value returned is the number of data units
    // that were read, not the total number of bytes.
    return byteCount / dataSize;
}

Ogg also requires a fclose function. Since the data resides in memory, this function doesn't need to do anything.

U32 QzOggDecoder::DataClose(void)
{
    return 0;
}

Decoding of the audio data is done through the ov_read function, which extracts the data into PCM format and writes the results to the given buffer. The data will be packed, so mono data only requires 16 bits per slice (assuming 16-bit format), while stereo data would be 32 bits per slice.

The one issue to plan for is that Ogg typically only decodes 4096 bytes at a time, storing this in an internal buffer. Each call to ov_read will only return as much data as is stored in the buffer. For example, if you ask for 3072 bytes per call, the first call would return 3072 bytes, but the second call would only return 1024 bytes. The next call to ov_read will decode another 4096 bytes of data.

Therefore the call to ov_read needs to reside within a loop that will keep running until the sample buffer has been filled, or until ov_read return zero, indicating that it has reached the end of the audio buffer.

U32 QzOggDecoder::Read(S16 samples[], U32 sliceCount)
{
    if (false == m_IsOpen) {
        return 0;
    }

    // The bitstream value is simply a status indicator for which bitstream
    // is currently being decoded.  The value is not needed for anything by
    // this code.
    int bitstream     = 0;
    U32 offset        = 0;
    U32 requestedSize = sliceCount * m_ChannelCount * sizeof(S16);

    char* pBuffer = reinterpret_cast<char*>(samples);

    // This may need to repeat several times to fill up the output buffer.
    // Ogg usually decodes data in 4KB blocks, which probably won't be enough
    // to fill the buffer in a single call to ov_read().
    //
    while (offset < requestedSize) {
        // Parameters 4, 5, and 6 are control parameters.
        //    param 4: 0 = little-endian, 1 = big-endian
        //    param 5: 1 = 8-bit samples, 2 = 16-bit samples
        //    param 6: 0 = unsigned,      1 = signed
        S32 byteCount = ov_read(&m_Context, pBuffer + offset,
#ifdef IS_BIG_ENDIAN
                    requestedSize - offset, 1, 2, 1, &bitstream);
#else
                    requestedSize - offset, 0, 2, 1, &bitstream);
#endif

        // Trap errors, and always return zero to trick the caller into
        // thinking the file is complete so it won't attempt to decode any
        // more data (note that this won't necessarily do any good if the
        // file is being looped, and could potentially produce infinite
        // loops if it keeps calling Loop() and Read()).
        if (byteCount < 0) {
            UtfFormat fmt;
            fmt.AddString(OggErrorToString(byteCount));
            LogErrorMessage("ov_read() %1;", fmt);
            return 0;
        }

        // End of file.
        if (0 == byteCount) {
            break;
        }

        offset += byteCount;
    }

    return offset / (m_ChannelCount * sizeof(S16));
}

To loop the contents of an Ogg file, call ov_raw_seek to reset back to the start of the file. This requires having a second level loop that calls ov_raw_seek after ov_read reports it has reached the end of the file.

As a final note, keep in mind that real-time decoding of Ogg data may not be the most efficient approach. Decoding an audio file can take 0.5% to more than 1.0% of the CPU. When dealing with lots of small audio clips that are going to be repeated, it can much more efficient to decode a small Ogg file once, cache the decoded results in a buffer, then play back from that buffer. But that is an example of a higher-level optimization, which is not shown in the demo code — this is the sort of thing that would need to be part of a full audio playback engine.

Those are the highlights of using Ogg Vorbis to decode audio data. Refer to QzOggDecoder.cpp for a commented class that does all of the interface work needed to decode a .ogg file.

Closing

This concludes my final planned article on writing code that can function identically on both Mac and Windows. The contents of these five articles cover everything I've ever needed to do to get my apps working on both platforms. But then, I don't do UI work, so being concerned with support for native GUI functionality is not something that often concerns me. If you need cross-platform GUI support, look at wxWidgets or something similar.

Granted, as I continue using Macs for more things, struggling to figure out the simplest of things due to the paucity of relevant samples, there will probably be more things I feel inclined to write about. If it takes me an entire weekend to figure out how to do something that should be simple, then it's probably worth writing an article about how it works.

Mac/PC Abstraction, Part 5:Sound and Ogg Vorbis