Overview
The next subject for cross-platform programming is sound. Granted, this probably won't apply to many apps, but I write lots of software for games and audio/video editing. For me, audio output is often a significant design factor.
There are a number of different APIs on the Mac that can be used to output audio data. However, when writing C++ code that needs to consume PCM data generated by the app (as opposed to playing a file from disk), the best choice is the AudioQueue functions. The down side to using AudioQueue is that audio data must be queued up one buffer at a time, which increases the latency of audio output. If all you need to do is play audio data, this is not much of an issue. But when low-latency is important (such as a game engine or real-time audio editing app), more testing and tweaking may be required to reduce latency without running the risk of audio dropping out when the CPU is heavily loaded.
For Windows, DirectSound offers a very nice, low-latency interface for outputting audio data. At least on XP. Vista moved the DirectSound functionality into software, which increases the latency for audio output (amongst other audio output changes that outraged many a gamer 'round the world). With DirectSound, it is possible to access the output buffer that is read by the hardware (or the software AudioStack on Vista), removing the need to write a full buffer of data at a time. This allows throughput latency to be reduced, but requires more logic to track read/write positions.
And of course there is always the option of trying a third-party library like Fmod or OpenAL. I cannot speak to the value of such libraries. Every time I've needed to write audio code, it has either needed to be at the lowest level of the OS, or inside a PCI device. As such, Fmod, OpenAL, and similar libraries have been the wrong tool for the job. While they may be good tools for other programmers, I have not personally used them enough to comment on their value.
The sample code discussed in this article is available in
crossplat.zip. It contains a
QzAudioTest project that can be built with both DevStudio 7 and Xcode 3.
The relevant code is found in QzAudioTest/
and
Qz/QzSoundDriver.h
.
Terminology
Something worth mentioning is terminology. When dealing with audio data, each individual number (usually 16 bits) is referred to as a "sample". However, what do you call the left/right pair of samples when playing stereo data?
There are quite a few answers to that question (not one of them being standardized). A common term for them is simply "pair". Which is reasonable for stereo data, but not for mono or 5.1 audio. Some programmers refer to a pair of samples as a "sample" as well, making terminology confusing. Does a "sample" refer to one number, two numbers, or more?
Some other common terms are "block", "group", and "frame". The Mac's AudioQueue documentation uses "frame" as the term of choice. Another is "slice", which is often used by audio DSP programmers. Though thankfully not a DSP programmer myself, "slice" is the term I am most acquainted with using, so that is the one that appears throughout this article and the related code. (I've also heard "macroblock" and "tuple" used by some video engineers, but those terms apply to video data, not audio data.)
Needless to say, "sample" is an overloaded and confusing term when doing audio programming.
Another term that comes up in the Mac programming docs is "packet". This
refers to a sequence of slices. For PCM data, a slice is a packet.
But for compressed data, such as ADPCM, a packet may contain several
hundred slices. Since this code only covers using PCM data, all
mFramesPerPacket
settings should be set to 1.
Base Class
To make life easier for myself, I have a single class definition for outputting audio data on both Windows and Mac:
class QzSoundDriver { private: // Each platform has a custom internal struct that contains the data // needed to interface with the sound player. The struct is stored // as a black-box void* pointer so the rest of the app can remain // free from any platform-specific header files. void* m_pContext; U32 m_SampleRate; public: QzSoundDriver(void); ~QzSoundDriver(void); bool Init(U32 sampleRate); bool Start(void); bool Stop(void); U32 MinWriteCount(void); U32 FreeBufferCount(void); U32 UpdatePosition(U32 lookahead); bool WriteToBuffer(S16 *pData, U32 sliceCount); // Only implemented for Mac version, since it is needed by the internal // callback handler. void IncFreeCount(void); };
This header file is very short and simple. It also completely hides all of
the platform-specific implementation within QzSoundDriverWin.cpp
and QzSoundDriverMac.cpp
. I'll cover the details of these two
classes throughout the rest of this article.
Higher-level audio playback logic only knows about the common definition
from QzSoundDriver.h
, allowing it to run the same on both
platforms. The Qz library does not have a full sound engine (at least not
yet — one will no doubt be added in time as I use it for more
projects).
Indeed, covering the design issues of a fully-featured sound engine are beyond the scope of this article. The only decent book I can recommend on the subject is Game Audio Programming, by James Boer, which designs and implements an audio engine, covering many of the relevant topics. The book uses DirectX for audio, and spends more time on DirectMusic (which is effectively deprecated) than it does DirectSound, and it dedicates lots of time to EAX and similar hardware technologies, which Microsoft seems to have killed off with Vista. At the time of this writing, Boer's book is six years old — positively ancient as programming books go — but still the only one I know of worth mentioning on the subject.
I can only assume that audio programming is nowhere near as sexy as graphics programming.
Mac AudioQueue
The Mac-specific code is found in Qz/QzSoundDriverMac.cpp
.
This uses the AudioQueue functionality to output audio data.
We'll start off with the code to initialize an AudioQueue.
bool QzSoundDriver::Init(U32 sampleRate) { AudioStreamContext_t *pContext = reinterpret_cast<AudioStreamContext_t*>(m_pContext); // Protect against multiple calls to Init(). FreeContext(pContext); pContext->DataFormat.mSampleRate = sampleRate; pContext->DataFormat.mFormatID = kAudioFormatLinearPCM; pContext->DataFormat.mFormatFlags = kAudioFormatFlagIsSignedInteger | kAudioFormatFlagIsPacked; pContext->DataFormat.mBytesPerPacket = c_BytesPerSlice; pContext->DataFormat.mFramesPerPacket = 1; pContext->DataFormat.mBytesPerFrame = c_BytesPerSlice; pContext->DataFormat.mChannelsPerFrame = c_AudioChannelCount; pContext->DataFormat.mBitsPerChannel = c_SampleBitDepth; pContext->DataFormat.mReserved = 0; m_SampleRate = sampleRate; // Create the audio queue that will be used to manage the array of audio // buffers used to queue samples. AudioQueueNewOutput(&(pContext->DataFormat), AudioCallback, this, NULL, NULL, 0, &(pContext->Queue)); // Clear these values. We will pre-queue three buffers of silence before // starting output. When the callback handler is called for the first // time, it will indicate that buffer [0] is free, and will increment // FreeBufferCount. This causes the first call to WriteToBuffer() to // fill up that buffer, and from there we rely on state logic to keep // filling the correct buffers in the correct order. pContext->FreeBufferCount = 0; pContext->NextFreeBuffer = 0; // Allocate the three buffers we will be using, fill each with silence // (all zeroes), and queue them up so they are ready to go once audio // output is started. for (U32 i = 0; i < c_AudioBufferCount; ++i) { AudioQueueAllocateBuffer(pContext->Queue, c_BytesPerSlice, &(pContext->Buffers[i])); pContext->Buffers[i]->mAudioDataByteSize = c_BytesPerBuffer; memset(pContext->Buffers[i]->mAudioData, 0, c_BytesPerBuffer); AudioQueueEnqueueBuffer(pContext->Queue, pContext->Buffers[i], 0, NULL); } // Prime the pump. This will "decode" the PCM data. However, since the // data is already PCM, this really doesn't do anything, and audio will // start up and play without this call. But the docs recommend making // this call, and since someone may change this code to take non-PCM // audio data, it's a good idea to keep this here for completeness. AudioQueuePrime(pContext->Queue, 0, NULL); return true; }
The first thing the code needs to do is fill in the DataFormat
structure, which defines the formatting of the audio data. Some of the
information is redundant, since we're using 16-bit, stereo, linear PCM.
But even with PCM audio, there are many ways of representing the data, so we
must fill in the data fields correctly or risk getting garbled noise out of
the speakers.
- mSampleRate
- The sample rate of the audio data, typically 44,100 Hz. If audio needs to be played at different sample rates, a separate AudioQueue will be needed for each of them.
- mFormatID
- Specifies the base formatting of the audio (Mp3, PCM, ADPCM, etc.). In this case, we need to use linear PCM.
- mFormatFlags
- Additional flags are needed to specify variations of the format ID. PCM data could be represented as floats or integers, and integers could be 8, 16, 24, or 32 bits long — 16-bit PCM is a signed integer format. And we need to tag it as packed data, so the audio system won't try to interpret it as 16-bit data stored in 32-bit integers.
- mBytesPerPacket
- The packet size does not apply to PCM data. Set the packet size to be the same size as a slice (or "frame", to use Apple's terminology). If we were using a format like ADPCM, we would need to know the size of a packet (typically stored in the file header).
- mFramesPerPacket
- There is only one slice (or "frame") per packet with PCM data.
- mBytesPerFrame
-
The frame size is
channel_count
×sample_size
, or 4 bytes for 16-bit stereo. - mChannelsPerFrame
- The code is hardwired for stereo data, so we set the channel count to 2.
- mBitsPerChannel
-
"Channel" is yet another synonym for "sample". We're using 16-bit samples.
Note that 16- and 24-bit sample can be stored in 32-bit integers, and aligned
as the high or low bits. Make certain that the bits in
mFormatFlags
are correctly set up interpret the positioning of the sample bits.
Once the DataFormat
struct has been filled in, we call into
AudioQueueNewOutput
to create a new AudioQueue object that uses exactly
this format. Once it has been created, the AudioQueue can only accept audio
data in this format, so all buffers created by this AudioQueue are assumed to
contain data in exactly this format.
The call to AudioQueueNewOutput
also sets the callback function
that will be used to notify the app when each buffer has been consumed by
the audio hardware. The Apple documentation takes the approach of having
the callback handler refill the buffer and enqueue the buffer. However,
this introduces multithreading issues into the code, and usually requires
the callback handler to call into higher level code to fetch the audio data
that will be played. This tends to make things more complex than necessary,
since multithreading and reentrancy are complex subjects that can introduce
subtle bugs.
The callback handler used here does nothing more than increment the
FreeBufferCount
field. This completely avoids multithreading
and reentrancy. Instead, higher level code needs to periodically poll
QzSoundDriver
to find out if there are any empty buffers —
if there are, it will fill another buffer with audio data and feed that
in through WriteToBuffer
. Look at the main loop in
AudioTest.cpp
for an example of how this works.
(Note that this is a bad example, since driving audio output from
the main loop is unreliable. Typically a separate thread is used to keep
audio output fed, which is a topic worthy of an article all its own.)
After we have created the AudioQueue, we need to allocate some buffers with
AudioQueueAllocateBuffer
. As mentioned, these buffers are
required to only contain audio data in the format specified by DataFormat
.
Once we have a new buffer, we need to zero it out (for signed 16-bit audio,
zeroes represent silence), then queue it with AudioQueueEnqueueBuffer
.
For now, the queued buffers will sit in the AudioQueue, waiting for playback
to be started.
By queuing up several buffers of silence, we can guarantee that audio playback
will start immediately, instead of having to wait for higher-level code to
start filling buffers. In addition, by pre-queuing all of the buffers, we can
set FreeBufferCount
to zero, since all of the buffers are being
used, and we set to NextFreeBuffer
to zero, since the first buffer
that will be consumed and returned will be buffer [0]
. This
effectively provides a very simple state machine that allows us to always know
which buffer needs to be filled and re-queued next, without having to explicitly
keep track of each pBuffer
that gets handed to the callback function.
Once all of the buffers have been zeroed out and queued, we can call
AudioQueuePrime
to preemtively "decode" the contents of the
buffers. This is not really necessary, since we are dealing with PCM data.
However, if we were using any type of encoded data (such as MP3 or ADPCM),
we would want to prime the queue to avoid extra delays when starting the
playback.
Finally, after all of the above initialization is complete, we can start
audio playback by calling AudioQueueStart
:
bool QzSoundDriver::Start(void) { AudioStreamContext_t *pContext = reinterpret_cast<AudioStreamContext_t*>(m_pContext); AudioQueueStart(pContext->Queue, NULL); return true; }
Likewise, playback can be stopped by calling AudioQueueStop
.
Note that we also call FreeContext
to assure that all of
the audio buffers and the AudioQueue itself are deleted.
bool QzSoundDriver::Stop(void) { AudioStreamContext_t *pContext = reinterpret_cast<AudioStreamContext_t*>(m_pContext); AudioQueueStop(pContext->Queue, true); FreeContext(pContext); return (noErr == status); }
Now we get to the WriteToBuffer
function. This takes an array of
16-bit stereo PCM samples, which are written into the next available audio
buffer. This code assumes that the caller has first verified that there is
at least one empty audio buffer available, and that it is providing the exact
number of samples which are required to fill the buffer.
(As an aside, you could try to write fewer than c_SlicesPerBuffer
into each buffer, but that means those buffers with less data in them will
be consumed faster, producing a degree of unpredictability in how quickly
buffers are consumed. It also can significantly increase the risk of the
buffers being consumed faster than the higher-level code can fill them,
causing audio output to drop out and introduce other artifacts. Always
completely filling a buffer will avoid these dangers.)
bool QzSoundDriver::WriteToBuffer(S16 *pData, U32 sliceCount) { AudioStreamContext_t *pContext = reinterpret_cast<AudioStreamContext_t*>(m_pContext); // We need at least one free buffer. The caller should have used // UpdatePosition() to see if there is any buffer space available // before calling WriteToBuffer(). if (0 == pContext->FreeBufferCount) { return false; } // We can only accept a full buffer of audio at a time. // (Actually, we could accept less, but that would make this // control logic more elaborate.) if (c_SlicesPerBuffer != sliceCount) { return false; } U32 index = pContext->NextFreeBuffer; // Blit the data into the next available buffer. pContext->Buffers[index]->mAudioDataByteSize = c_BytesPerBuffer; memcpy(pContext->Buffers[index]->mAudioData, pData, c_BytesPerBuffer); // Queue the buffer for playback. AudioQueueEnqueueBuffer(pContext->Queue, pContext->Buffers[index], 0, NULL); // Decrement the count of empty buffers and advance NextFreeBuffer // around to index of the next buffer in the three-buffer ring. QzThreadSafeDecrement((S32*)&(pContext->FreeBufferCount)); pContext->NextFreeBuffer = (pContext->NextFreeBuffer + 1) % c_AudioBufferCount; return true; }
Note that at the end of WriteToBuffer
, we decrement the count of empty
buffers and advance NextFreeBuffer
to the next buffer in the ring.
We're doing this to maintain a simple state machine with the callback function.
By always queuing buffers in a circular order, we always know the order in which
they are processed, and which one needs to be filled next.
Next, we have the IncFreeCount
function. This is called from the
callback handler so we can increment FreeBufferCount
. The higher-level
code will periodically poll FreeBufferCount
to see if there is an
empty buffer ready to be filled.
void QzSoundDriver::IncFreeCount(void) { AudioStreamContext_t *pContext = reinterpret_cast<AudioStreamContext_t*>(m_pContext); QzThreadSafeIncrement((S32*)&(pContext->FreeBufferCount)); }
And that's the basics of playing audio using AudioQueue. I'll go over how the higher-level code uses this functionality further down in the Driving the Driver section.
Windows DirectSound
The Windows-specific code is found in Qz/QzSoundDriverWin.cpp
,
which uses DirectSound for audio output.
Compared to AudioQueue, the DirectSound code is more involved, both for initialization and in accessing the contents of the audio buffer.
We'll start with the basic initialization routines. And once again, I'm leaving most of the error checking code out to keep these listings to a more manageable length.
bool QzSoundDriver::Init(U32 sampleRate) { DirectSoundContext_t *pContext = reinterpret_cast<DirectSoundContext_t*>(m_pContext); m_SampleRate = sampleRate; HRESULT hr = S_OK; // Create the DirectSound object required for buffer allocation. hr = DirectSoundCreate8(NULL, &(pContext->pDxSound), NULL); // If we're running in normal window (QzMainWin.cpp), this // will be the handle where graphics are being rendered. HWND hWindow = g_hWindow; // However, if hWindow is not defined, assume we're running // from a console app, so we can fetch that window handle. if (NULL == hWindow) { hWindow = GetConsoleWindow(); } // Set the cooperative level. We need to use PRIORITY level // so we can call SetFormat() on the primary mixing buffer. hr = pContext->pDxSound->SetCooperativeLevel(hWindow, DSSCL_PRIORITY); // Fill in a struct that defines the primary mixing buffer. // Although we won't be directly writing to this buffer, we // do need to access it to set the sampling rate (otherwise // our audio data may be resampled on playback, which can // reduce quality). DSBUFFERDESC desc; SafeZeroVar(desc); desc.dwSize = sizeof(DSBUFFERDESC); desc.dwFlags = DSBCAPS_PRIMARYBUFFER; desc.dwBufferBytes = 0; desc.lpwfxFormat = 0; IDirectSoundBuffer *pPrimaryBuffer = NULL; hr = pContext->pDxSound->CreateSoundBuffer(&desc, &pPrimaryBuffer, 0); // Now fill in a WAV struct that defines the formatting of the // audio data as 16-bit stereo PCM. WAVEFORMATEX format; SafeZeroVar(format); format.wFormatTag = WAVE_FORMAT_PCM; format.nChannels = c_AudioChannelCount; format.nSamplesPerSec = m_SampleRate; format.nBlockAlign = 2 * c_AudioChannelCount; format.nAvgBytesPerSec = format.nSamplesPerSec * U32(format.nBlockAlign); format.wBitsPerSample = c_SampleBitDepth; // Now we SetFormat() on the primary buffer, which sets the audio // sample rate that will be used for audio output. hr = pPrimaryBuffer->SetFormat(&format); // Release the reference to the primary buffer. // We won't be needing it again. SafeRelease(pPrimaryBuffer); // The context info contains the current position at which we will // start writing data. Note that we're setting WriteOffset and // BytesRemaining as if that many samples of silence exist at the // start of the buffer. That gives us this amount of playback time // after starting before we must write data into the buffer, or // risk glitching the audio output and screwing up the write // position. pContext->BufferSize = m_SampleRate * c_BytesPerSlice; pContext->Position = 0; pContext->WriteOffset = c_AudioBufferCount * c_BytesPerBuffer; pContext->BytesRemaining = c_AudioBufferCount * c_BytesPerBuffer; // The exact same buffer settings are used to create the buffer // into which we will be writing audio data. (I'm manually refilling // this struct, since I've had problems in the past with Win32 calls // changing the contents of the WAV struct. It doesn't seem to // happen here, but once bitten...). SafeZeroVar(format); format.wFormatTag = WAVE_FORMAT_PCM; format.nChannels = c_AudioChannelCount; format.nSamplesPerSec = m_SampleRate; format.nBlockAlign = 2 * c_AudioChannelCount; format.nAvgBytesPerSec = format.nSamplesPerSec * U32(format.nBlockAlign); format.wBitsPerSample = c_SampleBitDepth; // Set up a DSBUFFERDESC structure. Note that this has different // settings than for the primary buffer. Primarily, we need to be // able to get the current playback position so we can write into // the buffer as it is being played. SafeZeroVar(desc); desc.dwSize = sizeof(DSBUFFERDESC); desc.dwFlags = DSBCAPS_GETCURRENTPOSITION2; desc.dwBufferBytes = format.nAvgBytesPerSec; desc.lpwfxFormat = &format; // Set this flag to make sound keep playing when app does not have // focus. Otherwise sound becomes inaudible (but continues to be // processed) when mouse/keyboard focus switches to another window. desc.dwFlags |= DSBCAPS_GLOBALFOCUS; // This creates a basic DirectSound buffer. IDirectSoundBuffer *pBuffer = NULL; hr = pContext->pDxSound->CreateSoundBuffer(&desc, &pBuffer, NULL); // Now QI for a DirectSound 8 sound buffer. This interface // will allow us to call GetCurrentPosition() on the buffer. hr = pBuffer->QueryInterface(g_SoundBuffer8Guid, (void**)&(pContext->pBuffer)); // Release the reference to the basic sound buffer interface. // We will be using the DS 8 interface from here on out. SafeRelease(pBuffer); S16* pAdr = NULL; U32 byteCount = 0; // Lock the buffer so we can zero out the entire thing. hr = pContext->pBuffer->Lock(0, pContext->BufferSize, (void**)&pAdr, &byteCount, NULL, NULL, DSBLOCK_ENTIREBUFFER); // It is most unlikely that the buffer could be lost if we're still // initializing the app, but try to restore it if we can. if (DSERR_BUFFERLOST == hr) { pContext->pBuffer->Restore(); hr = pContext->pBuffer->Lock(0, pContext->BufferSize, (void**)&pAdr, &byteCount, NULL, NULL, DSBLOCK_ENTIREBUFFER); } // Zero-out the buffer so it starts off outputting silence. memset(pAdr, 0, pContext->BufferSize); hr = pContext->pBuffer->Unlock(pAdr, byteCount, NULL, 0); LogMessage("DirectSound driver Init() completed"); return true; }
The code above starts off with a call to DirectSoundCreate8
to
create the basic DirectSound 8 object needed to access the device. Once we
have this pointer, we can create all of the buffers and access the control
routines needed to drive audio.
Next we need to call into SetCooperativeLevel
. This achieves
two goals. First, it associates audio playback with a specific window —
by default, audio output is silenced when that window loses input focus
(it gets minimized, another window moves in front of it, etc.). Second,
this assigns PRIORITY
rights to the device, which is needed to
access the primary mixing buffer so we can adjust the buffer's settings.
Then we can create the primary output buffer. Since the primary buffer
controls the audio sampling rate played from the output device, we need
to call SetFormat
so we can control the format and sample
rate of the buffer. Once those settings have changed, we can release
the reference, since we won't need to touch the primary buffer again:
DirectSound keeps its own internal reference which will last until the
device is shut down, making for one less COM reference for us to track.
Next, we create a second sound buffer. This is the one into which we will be
writing audio data. However, we cannot use the standard COM interface,
since that does not allow us to access the current read/write positions
within the buffer. For that, we need to QueryInterface
for
a version 8 DirectSound buffer (and make certain that we have set the
DSBCAPS_GETCURRENTPOSITION2
caps bit before creating the
buffer). Then we can release the base COM reference, since we only need
the version 8 interface.
Finally, we need to lock the contents of the buffer with the
DSBLOCK_ENTIREBUFFER
so we can zero out the entire contents
of the buffer. This way audio output will start playing silence as soon
as we start playback.
Now we can start audio playback:
bool QzSoundDriver::Start(void) { DirectSoundContext_t *pContext = reinterpret_cast<DirectSoundContext_t*>(m_pContext); if ((NULL == pContext->pDxSound) || (NULL == pContext->pBuffer)) { LogErrorMessage("Start(): DS not initialized"); return false; } HRESULT hr = S_OK; // Set playback to start from the beginning of the buffer. hr = pContext->pBuffer->SetCurrentPosition(0); // Start playback to loop infinitely over the contents of the buffer. hr = pContext->pBuffer->Play(0, 0, DSBPLAY_LOOPING); return true; }
Here we need to call SetCurrentPosition
to make certain that
playback will start from the beginning of the buffer (by reason it should,
but let's not take any chances — things could be messed up if playback
was started, stopped, then started again). Then we can issue the Play
call that starts audio output from the buffer, with output looping infinitely.
After this point, we need to start writing audio data into the buffer too
keep ahead of where the hardware is pulling audio data out of the buffer.
bool QzSoundDriver::Stop(void) { DirectSoundContext_t *pContext = reinterpret_cast<DirectSoundContext_t*>(m_pContext); if ((NULL == pContext->pDxSound) || (NULL == pContext->pBuffer)) { LogErrorMessage("Stop(): DS not initialized"); return false; } HRESULT hr = S_OK; hr = pContext->pBuffer->Stop(); return true; }
Stopping playback simply involves calling Stop
. Note that stopping
will be immediate, so if audio is still playing non-zero samples, a pop/click
will be heard as audio comes to an abrupt halt. This may be more or less noticeable
one some hardware. Cheap audio output like SoundMAX exhibit this problem worse,
whereas SoundBlaster hardware does some extra audio output filtering, making the
artifacts less obvious. The cleanest way to stop audio is to scale the audio
data down to zero over time, given playback enough time to consume that ramped-down
data, then stop playback. But that logic has to exist at a higher level.
And now we get to the DirectSound version of WriteToBuffer
:
bool QzSoundDriver::WriteToBuffer(S16 *pData, U32 sliceCount) { DirectSoundContext_t *pContext = reinterpret_cast<DirectSoundContext_t*>(m_pContext); if (NULL == pContext->pBuffer) { return false; } U32 size1 = 0; U32 size2 = 0; U08* ptr1 = NULL; U08* ptr2 = NULL; U32 byteCount = sliceCount * c_BytesPerSlice; // Lock a range of memory within the buffer, starting from where we last // wrote data, through to the end of where we are going to be writing // data this time around. Since this is a ring buffer, we get back two // size values and two pointers. We only need the second pair of values // if the data wraps around the end of the buffer. HRESULT hr = pContext->pBuffer->Lock(pContext->WriteOffset, byteCount, (void**)&ptr1, &size1, (void**)&ptr2, &size2, 0); if (DSERR_BUFFERLOST == hr) { hr = pContext->pBuffer->Restore(); // NOTE: It is possible for the restore to fail if the app window // does not have focus. hr = pContext->pBuffer->Lock(pContext->WriteOffset, byteCount, (void**)&ptr1, &size1, (void**)&ptr2, &size2, 0); } // Copy the first segment of data. If this is in the middle of the // buffer, we're done. if (NULL != ptr1) { memcpy(ptr1, pData, size1); } // However, if the data range wraps around the end of the buffer, // we need to copy the rest of the data to the second memory range, // which will start at the beginning of the buffer. if (NULL != ptr2) { memcpy(ptr2, reinterpret_cast<U08*>(pData) + size1, size2); } hr = pContext->pBuffer->Unlock(ptr1, size1, ptr2, size2); // Update the write position, keeping in mind that this is a ring buffer. pContext->WriteOffset = (pContext->WriteOffset + byteCount) % pContext->BufferSize; pContext->BytesRemaining += byteCount; return true; }
Unlike Mac's AudioQueue, where we are feeding full buffers of audio to output, with DirectSound we are writing audio data into the output buffer as the hardware is reading data from the same buffer. Since we have the buffer looping infinitely, this is simply a ring buffer. By keeping track of where we last wrote data, we know the offset at which the next write operation needs to start.
We prepare to write data by calling Lock
. This maps that
part of the buffer into addressable memory, allowing us to memcpy
audio data into it. (Maybe. Many drivers actually provide a separate buffer
for writing, then transfer data from there to hardware after the buffer is
Unlock
ed. But we seldom need to know about how the DirectSound
driver implements its internal functionality.)
Because we are dealing with a ring buffer, there is the special case when
the range of data we need to write will wrap around the end of the buffer.
That is why the Lock
function will return two pointers and two
size values. Normally, the second pointer will be NULL
, so we only need to
memcpy
to the first pointer. However, if the second pointer
is non-NULL
, then a second memcpy
is required to
blit the rest of the sound data to the second memory address.
And that concludes the basic functionality withing the DirectSound version
of QzSoundDriver
.
Driving the Driver
Now we can turn our attention to how we can drive audio output using either
of the two platform-specific implementations. The code for this is found
in AudioTest.cpp
.
The sample code there sits in a loop, polling to see if there is an empty
buffer to process, filling it with audio data, then Sleep
ing
for a few milliseconds before repeating. This is a functional but rather
clumsy loop. Using a signal event is better than sleeping, since the code
can be placed in a worker thread and wake up in response to events. But
implementing a good audio engine is beyond the scope of this (already too
long) article. Sleeping is adequate for the purpose of this simple demo.
Here is a simplified version of the main loop:
QzSoundDriver driver; driver.Init(sampleRate); driver.Start(); for (;;) { // How many empty buffers are there to fill up? Make this // call only once, and do it before we enter the loop. // If we tried to repeatedly call this inside the inner // loop, we would end up in an infinite loop when running // with the DirectSound version of the driver. U32 freeCount = driver.FreeBufferCount(); for (U32 i = 0; i < freeCount; ++i) { // Find out how much data is currently stored in the output buffer. U32 sliceCount = driver.UpdatePosition(lookahead); // If the amount of audio data is less than the desired // lookahead amount (and it always should be), we'll need // to mix up more audio to fill in the empty space in the // buffer. if (sliceCount > 0) { GenerateSamples(pScratch, sliceCount); driver.WriteToBuffer(pScratch, sliceCount); } } // Wake up about 30 times a second to feed in more audio data. // At a 44,100 Hz sample rate, sleeping for 30 milliseconds // will consume about 1,300 slices per iteration of the loop. QzSleep(30); } driver.Stop();
Once the driver is initialized and started, we drop into the main loop
that keeps the driver fed with audio data. Periodically, the loop will
wake up and call FreeBufferCount
to find out if there are
any buffers that need to be filled. If there is an empty buffer, it
calls UpdatePosition
to find out how much data is needed
to fill the buffer. Then it can go off to generate the data (such as
mixing its own audio data, or calling a codec to decompress audio from
a file), then call WriteToBuffer
to write the data into
the next available buffer for output.
However, the AudioQueue and DirectSound code behave differently. Those differences are hidden behind the class interface, allowing this loop to work the same with both implementations.
The first difference is with FreeBufferCount
. The AudioQueue
implementation actually keeps track of how many buffers are empty, and
returns the actual count. With DirectSound, however, FreeBufferCount
always returns 1, since there is only one buffer, and it is always
possible to write more data into the buffer since the hardware read
pointer is constantly advancing around the buffer.
Therefore it is important for FreeBufferCount
to only be called
once, then the inner loop repeats exactly that many times. Otherwise the
DirectSound implementation would cause this to drop into an infinite loop.
The next important function is UpdatePosition
. For AudioQueue,
this is trivial: it either returns the number of samples that fit into one
buffer, or it returns zero if there are no empty buffers (which should not
happen if the code is calling FreeBufferCount
first).
However, the DirectSound implementation of UpdatePosition
is rather more elaborate. Since a single ring buffer is being used for
audio output, state logic needs to keep track of where data was written,
read back the current "write" position from the hardware to figure out
how much audio data has been consumed since the last update, then
calculate how many slices need to be written to maintain the requested
lookahead distance within the buffer.
U32 QzSoundDriver::UpdatePosition(U32 lookahead) { DirectSoundContext_t *pContext = reinterpret_cast<DirectSoundContext_t*>(m_pContext); U32 write = 0; // Get the "write" position. This is where it is safe to start writing // audio data. Assume that all data in the buffer before this point // has been consumed, and plan to start writing data somewhere after // this position. HRESULT hr = pContext->pBuffer->GetCurrentPosition(NULL, &write); if (FAILED(hr)) { PrintHRESULT("GetCurrentPosition() failed", hr); return 0; } // Ring buffer logic to figure out how much data has been consumed // since the last time we updated the position information. U32 consumed = (pContext->BufferSize + write - pContext->Position) % pContext->BufferSize; pContext->Position = write; // How much valid data is stored in the buffer? if (consumed >= pContext->BytesRemaining) { // This is the bad condition: we haven't kept up, and the hardware // has consumed all of the data we've given it. We can try to // write more data, but we've already had an audio glitch at this // point. And by the time we write more data into the buffer, // the hardware will almost certain advanced beyond this new // write position, so another glitch will happen then. Hopefully // after this point we can return to keeping ahead of the hardware. // // If not, we might consider stopping audio, clearing out the whole // buffer, then restarting with a long duration of silent data in // the buffer. That will give higher-level code time to queue up // more audio data and once again try to keep in advance of where // the hardware is reading from the buffer. // pContext->BytesRemaining = 0; pContext->WriteOffset = write; } else { pContext->BytesRemaining -= consumed; } // How many slices are currently sitting in the buffer. U32 sliceCount = pContext->BytesRemaining / c_BytesPerSlice; // If there are not at least "lookahead" number of slices, return the // number of slices need to keep "lookahead" slices in advance of what // the hardware is processing. if (sliceCount < lookahead) { return lookahead - sliceCount; } return 0; }
Hopefully the comments in the above code are enough to describe what it is doing. The key point is that since we are writing to the buffer as hardware is reading from the buffer, we must always keep ahead of the hardware. This is where the size of the "lookahead" distance is important.
Since we are writing to the output buffer, we can reduce the lookahead distance to reduce the latency of audio samples that are being played. This can make a game sound more responsive (e.g., bullet sound effects occur closer to the time at which a muzzle flash is displayed on the screen), or make audio timeline scrubbing more responsive in an audio/video editing app.
The downside of reducing lookahead distance is that we increase the chance of audio glitching if the CPU becomes too heavily loaded or the frame rate drops too much (which is a good reason to have the audio being driven from a separate thread, rather than trying to update it once per frame in the main rendering loop).
Increasing the lookahead distance will reduce the odds of a sound glitch, but makes audio output more laggy. There is no good answer to this problem. The only option is empirical: test the performance of your app under heavy loads, and increase lookahead to whatever gives the best results.
And to add to the problem, Vista removed hardware acceleration from DirectSound and moved all audio processing into software, so that the new software AudioStack can allow multiple apps to use DirectSound concurrently, and allow additional per-application control of sound. This will no doubt add even more latency to audio output. Trying to compensate for it by reducing the lookahead distance will only make the problem worse.
Thankfully, most people are not good at perceiving sound latency if it is kept down to 100-150 milliseconds. Try displaying an animation of an explosition, then play a sound effect 100 milliseconds later. Most people will think that the animation and the sound happened at the same time — that is the way our brains are wired, thanks to the speed of sound. Keeping latency down is not as essential as many programmers think.
In my experience, 100 milliseconds is adequate for most applications. And I say this having written audio control apps for real experts who literally can detect a difference of 16 milliseconds. Those individuals are incredibly rare and can be safely ignored until you need to write software for them to use. The rest of us normal folk are not trained for that kind of perception.
Ogg/Vorbis
While on the subject of sound, I'll throw in some notes on writing code to
read Ogg Vorbis sound files. Check out the QzOggDecoder.cpp
class for a working implementation.
For DevStudio, you'll just need to include vorbisfile.h
in the
project. Provided all of the relevant headers are in the path, DevStudio will
pick up all of the files it needs to compile. To make life simpler, I built
a custom libvorbis.dll
that has all of the decode functionality
combined into a single DLL, instead of the two or three different DLLs
required by a regular build.
To build with Xcode, you'll need to obtain Ogg.framework and Vorbis.framework
from Xiph.org. The one curiosity
here is that these two frameworks ended up in /Library/Frameworks
instead of /System/Library/Frameworks
. To get the code to
compile in Xcode, it is necessary to define a __MACOSX__
symbol
to get the definitions in vorbistypes.h
to work correctly. For whatever
reason, this symbol is not defined when building with Xcode. (Maybe
the folks at Xiph are out of sync with the current version of Xcode?
Maybe I have no clue how to properly install software on a Mac?)
Since the header files are located in two different frameworks, the following explicit includes are required to build in Xcode:
#define __MACOSX__ // need to explicitly define __MACOSX__ for VorbisTypes.h #include <Ogg/os_types.h> #include <Ogg/ogg.h> #include <Vorbis/vorbisfile.h> #endif
Ogg uses a set of callbacks to access data from a file. There is a standard handler that can be used if you want to read from a file, but sometimes it is easier to read the whole Ogg file into memory and access from there. By implementing your own callback functions, you can have the Ogg library decode data from memory.
ov_callbacks callbacks; callbacks.read_func = CallbackRead; callbacks.seek_func = CallbackSeek; callbacks.close_func = CallbackClose; callbacks.tell_func = CallbackTell; ov_open_callbacks(this, &m_Context, NULL, 0, callbacks);
These four callback functions take the place of
fread
,
fseek
,
fclose
, and
ftell
.
The first function required is one that emulates the behavior of ftell
.
This just needs to return the current offset into the data buffer:
U32 QzOggDecoder::DataTell(void) { return m_Offset; }
The next function needed is the equivalent of fseek
. Caution needs to
be taken with this function, since Ogg may try to seek past the end of the file
when doing the initial scan of the data. All three origin values
(SEEK_CUR
,
SEEK_SET
, and
SEEK_END
) need to be supported:
int QzOggDecoder::DataSeek(U32 offset, int origin) { switch (origin) { case SEEK_CUR: m_Offset += offset; if (m_Offset > m_ByteCount) { m_Offset = m_ByteCount; } break; case SEEK_SET: if (offset < m_ByteCount) { m_Offset = offset; } else { m_Offset = m_ByteCount; } break; case SEEK_END: if (offset < m_ByteCount) { m_Offset = m_ByteCount - offset; } else { m_Offset = 0; } break; default: return -1; } return 0; }
Another function required needs to behave the same as fread
,
including its behavior when given data size values other than 1. Caution
also needs to be taken when attempting to read past the end of the buffer,
which Ogg Vorbis often will do when it initially scans the data as part
of the ov_open_callbacks
function.
U32 QzOggDecoder::DataRead(void *pDest, U32 dataSize, U32 dataCount) { U32 byteCount = dataSize* dataCount; // Ogg will happily read past the end of the file when starting up. // Do a range check to avoid running past the end of the buffer. // Make certain that the byte count to read is a multiple of the // requested element size. if ((m_Offset + byteCount) > m_ByteCount) { byteCount = ((m_ByteCount - m_Offset) / dataSize) * dataSize; } if (byteCount > 0) { memcpy(pDest, m_pData + m_Offset, byteCount); } m_Offset += byteCount; // Make certain that the value returned is the number of data units // that were read, not the total number of bytes. return byteCount / dataSize; }
Ogg also requires a fclose
function. Since the data resides
in memory, this function doesn't need to do anything.
U32 QzOggDecoder::DataClose(void) { return 0; }
Decoding of the audio data is done through the ov_read
function,
which extracts the data into PCM format and writes the results to the given
buffer. The data will be packed, so mono data only requires 16 bits per slice
(assuming 16-bit format), while stereo data would be 32 bits per slice.
The one issue to plan for is that Ogg typically only decodes 4096 bytes at a time,
storing this in an internal buffer. Each call to ov_read
will
only return as much data as is stored in the buffer. For example, if you ask for
3072 bytes per call, the first call would return 3072 bytes, but the second call
would only return 1024 bytes. The next call to ov_read
will decode
another 4096 bytes of data.
Therefore the call to ov_read
needs to reside within a loop that
will keep running until the sample buffer has been filled, or until ov_read
return zero, indicating that it has reached the end of the audio buffer.
U32 QzOggDecoder::Read(S16 samples[], U32 sliceCount) { if (false == m_IsOpen) { return 0; } // The bitstream value is simply a status indicator for which bitstream // is currently being decoded. The value is not needed for anything by // this code. int bitstream = 0; U32 offset = 0; U32 requestedSize = sliceCount * m_ChannelCount * sizeof(S16); char* pBuffer = reinterpret_cast<char*>(samples); // This may need to repeat several times to fill up the output buffer. // Ogg usually decodes data in 4KB blocks, which probably won't be enough // to fill the buffer in a single call to ov_read(). // while (offset < requestedSize) { // Parameters 4, 5, and 6 are control parameters. // param 4: 0 = little-endian, 1 = big-endian // param 5: 1 = 8-bit samples, 2 = 16-bit samples // param 6: 0 = unsigned, 1 = signed S32 byteCount = ov_read(&m_Context, pBuffer + offset, #ifdef IS_BIG_ENDIAN requestedSize - offset, 1, 2, 1, &bitstream); #else requestedSize - offset, 0, 2, 1, &bitstream); #endif // Trap errors, and always return zero to trick the caller into // thinking the file is complete so it won't attempt to decode any // more data (note that this won't necessarily do any good if the // file is being looped, and could potentially produce infinite // loops if it keeps calling Loop() and Read()). if (byteCount < 0) { UtfFormat fmt; fmt.AddString(OggErrorToString(byteCount)); LogErrorMessage("ov_read() %1;", fmt); return 0; } // End of file. if (0 == byteCount) { break; } offset += byteCount; } return offset / (m_ChannelCount * sizeof(S16)); }
To loop the contents of an Ogg file, call ov_raw_seek
to reset back
to the start of the file. This requires having a second level loop that calls
ov_raw_seek
after ov_read
reports it has reached the
end of the file.
As a final note, keep in mind that real-time decoding of Ogg data may not be the most efficient approach. Decoding an audio file can take 0.5% to more than 1.0% of the CPU. When dealing with lots of small audio clips that are going to be repeated, it can much more efficient to decode a small Ogg file once, cache the decoded results in a buffer, then play back from that buffer. But that is an example of a higher-level optimization, which is not shown in the demo code — this is the sort of thing that would need to be part of a full audio playback engine.
Those are the highlights of using Ogg Vorbis to decode audio data.
Refer to QzOggDecoder.cpp
for a commented class that
does all of the interface work needed to decode a .ogg
file.
Closing
This concludes my final planned article on writing code that can function identically on both Mac and Windows. The contents of these five articles cover everything I've ever needed to do to get my apps working on both platforms. But then, I don't do UI work, so being concerned with support for native GUI functionality is not something that often concerns me. If you need cross-platform GUI support, look at wxWidgets or something similar.
Granted, as I continue using Macs for more things, struggling to figure out the simplest of things due to the paucity of relevant samples, there will probably be more things I feel inclined to write about. If it takes me an entire weekend to figure out how to do something that should be simple, then it's probably worth writing an article about how it works.