Simplifying Matrix Math

Overview

There are numerous ways in which matrices and vectors are represented in source code: left-handled vs. right-handled, row-major vs. column-major matrices, pre-multiply vs. post-multiply logic, row vectors vs. column vectors, vector-matrix multiplication where the vector is on the left or right side of the matrix, reduced 4×3 matrices, and probably a few other mutant varieties that have yet to add tasks to my bug list.

The problems that arise from these varied and inconsistent representations can result in headache-inducing debugging problems for any programmer who has to write code to bridge two or more libraries that process matrices.

And to make it even aggravating, the documentation may not reflect the actual memory layout of matrices. Take OpenGL for example. One naïve glance at the documentation might give you the impression that the coefficients for a translation matrix are arranged like this:

float f[16] =
{
    1.0f, 0.0f, 0.0f, x,
    0.0f, 1.0f, 0.0f, y,
    0.0f, 0.0f, 1.0f, z,
    0.0f, 0.0f, 0.0f, 1.0f
};

When in fact, OpenGL matrices are stored this way:

float f[16] =
{
    1.0f, 0.0f, 0.0f, 0.0f,
    0.0f, 1.0f, 0.0f, 0.0f,
    0.0f, 0.0f, 1.0f, 0.0f,
    x,    y,    z,    1.0f
};

Then, when reading comments on the differences between OpenGL and DirectX, you may well get lost in the row-major vs. column-major verbiage. Which is a shame, since both OpenGL and DirectX store matrices with exactly the same coefficient ordering in memory. The difference between them is how the matrices are generated. On the one hand, it is a fairly well-known fact by more experienced programmers that OpenGL and DirectX use the same memory layout, and is often pointed out on various graphics programming forums. Yet at the same time, you might never intuit this just by reading the documentation and the various confused posts on graphics forums.

I'm not even going to try to explain the difference in how OpenGL and DirectX go about creating matrices. The one constant I've learned with matrices is that everyone has a different internalization of how matrices are used and what all those terms mean: left, right, column-major, row-major, etc.

You would think that the same terms would mean the same thing to everyone, but they do not. Odds are quite good that you yourself consider the "correct" matrix representation to be the one you first learned, and anything different is either "wrong" or simply confusing. Worse, programmers often are taught the "wrong" meaning for all of those terms, where even "wrong" depends on whether you have a math or computing background.

In the end, I've learned to never trust the documentation, comments, or even function names in matrix libraries. Making even one assumption is likely to result in bugs. The only approach I trust is to visually study the source code (if it is available), and write lots of test code to verify how the coefficients are laid out, whether the "left" and "right" matrices are swapped when multiplied together, whether vector-matrix multiplication is row or column based, etc.

So what is the point of all this whining?

Mostly, it's a cautionary warning. You may think your understanding of how matrices is wrong when you code something that doesn't work right, when the odds are pretty good the API you're using is simply doing things using matrix math conventions that are different from those you have experienced in the past.

A Common Library

I was writing some code a few years back that was capable of using either DirectX, OpenGL, or a custom software renderer. As to be expected, there were plenty of matrix operations that needed to be performed, but the core of the code knew nothing about which renderer was being used. Trying to use any one API's matrix routines would cause problems for the other APIs.

So I used every programmer's favorite fallback option: I reinvented the wheel.

Happy for the excuse to code up a brand new matrix stack class (and learn by doing), I created a set of routines that handled matrix transforms the way I learned when I was an undergrad, and again when doing graphics research in grad school, and again in every graphics book I've ever read: build a 4×4 transform matrix, then do a bog standard 4×4 matrix multiplication.

In the process, it occurred to me that transform matrices are mostly zeroes. Doing a full matrix multiply when most coefficients are zero seemed wasteful. So I sat down and worked out all of the common transforms long-hand, figuring out exactly how to transform the contents of the matrix without needing a full 4×4 matrix multiplication.

The end result was a roughly 50% reduction in time to compute all of the transforms. It also eliminated a fair bit of redundant code since it allowed me to refactor out lots of temporary variables along with all those separate calls to create matrices then multiply them, not to mention removing those moments of confusion over the order to compose all of those matrices with this library as opposed to what that library does.

Best of all, since DirectX and OpenGL use the sample physical ordering of coefficients, a single matrix library could perform all matrix operations and hand the same array of 16 floats to either API and get the same results.

And that custom software renderer? I still had to transpose every matrix before using it, but that's what happens when you write code according to the way people write matrix documentation as opposed to how those matrices are actually implemented.

Still, after looking around, I have never seen anyone do anything other than use the 4×4 matrix multiply approach to matrix transforms. This cannot be a new idea, but I've never seen it mentioned anywhere before. So here's an overview of what this code does...

Simplifying Matrix Math

I'm not going to work out all of the matrix simplification here. It would be much too cumbersome to format it in HTML. But do have a look at the library of code in this zip file: crossplat.zip. This is the same library for the Windows/MacOS compatibility articles I have been writing. The relevant file is QzMatrix4x4.h and cpp.

For a quick example, I'll show a simple translation operation. Let's assume that you start with a transform matrix filled with unknown coefficients, each represented by the letters a–p.

float matrix[16] =
{
    a, b, c, d,
    e, f, g, h,
    i, j, k, l,
    m, n, o, p
};

If you want to apply a translation by (x,y,z), you would construct this matrix.

float translate[16] =
{
    1, 0, 0, 0,
    0, 1, 0, 0,
    0, 0, 1, 0,
    x, y, z, 1
};

To apply the translation such that it occurs before any other transformation, you would perform the matrix multiplication translate × matrix (unless, of course, you've been taught to multiply matrices the other way around, in which case it would be matrix × translate).

The result of this 4×4 matrix multiplication would be the following matrix:

float f[16] =
{
    a, b, c, d,
    e, f, g, h,
    i, j, k, l,
    x*a + y*e + z*i + m,
    x*b + y*f + z*j + n,
    x*c + y*g + z*k + o,
    x*d + y*h + z*l + p
};

Note that only the bottom row of coefficients is changed. The other values are completely left alone. Similarly, all other variations of translation, scaling, and rotation can be simplified, leaving many coefficients unchanged.

By factoring the full 4×4 matrix multiplication out of the picture, all of the standard transformations will run much faster. The only expensive operation that remains is for rotations, which must still compute a sin and cos — indeed, these two transcendental functions combined can be much more expensive to execute than a full 4×4 matrix multiplication, depending upon the CPU type (see the results at the end of the article).

Many of the routines in my library are prefixed by Pre... or Post.... Pre-transforms are applied before any other operation, and post-transforms are applied afterwards. For example, if you needed to scale a model down 50%, then rotate it around the X-axis, and finally translate it to a specific (x,y,z) position, you could do it with the following sequence of operations:

    matrix.MakeScale(0.5f);
    matrix.PostRotateX(angle);
    matrix.PostTranslate(x, y, z);

On the other hand, you may be accustomed to building up a transform matrix in the reverse order, starting with the last transform and working down to to first transform:

    matrix.MakeTranslate(x, y, z);
    matrix.PreRotateX(angle);
    matrix.PreScale(0.5f);

I've found that explicitly naming the transforms as pre- or post- eliminates the confusion surrounding matrix composition. Mileage may vary. Your very notion of "pre" and "post" may be the exact opposite of mine.

The Code

What follows is the matrix implementation I've used over a number of projects. It produces matrices that are suitable for use with both DirectX and OpenGL. It was coded to mimic the functionality of the DirectX library routines, but has grown in a different direction since then.

In my code, I've defined the base 4×4 matrix as follows:

struct QzMatrix4x4_t
{
    union {
        float c4x4[4][4];
        float cFlat[16];
    };
};

This allows the coefficients to be accessed as either a linear array or as a 2D array. For C and C++, the 2D array is perfectly safe. However, if you needed to translate the code to some other language, you would would probably be forced to remove the 2D array and access all coefficients linearly, which makes the code less readable. A good optimizing C++ compiler will simplify this code for you, so manually translating the code should not be required.

For anyone not having access to original source file, QzMatrix4x4.cpp, here is a copy of the relevant code:

void QzMatrix4x4::Multiply(const QzMatrix4x4 &first, const QzMatrix4x4 &second)
{
    const float *m1 = first.m_Matrix.cFlat;
    const float *m2 = second.m_Matrix.cFlat;
          float *r  = m_Matrix.cFlat;

    for (U32 row = 0; row < 4; ++row) {
        for (U32 col = 0; col < 4; ++col) {
            r[(row<<2)+col] = m1[(row<<2)  ] * m2[col   ]
                            + m1[(row<<2)+1] * m2[col+ 4]
                            + m1[(row<<2)+2] * m2[col+ 8]
                            + m1[(row<<2)+3] * m2[col+12];
        }
    }
}

void QzMatrix4x4::PreMultiply(const QzMatrix4x4 &first)
{
    // Multiplication is destructive, so we need make a duplicate of the
    // original value of m_Matrix for use in computing the new m_Matrix.
    const QzMatrix4x4_t dup = m_Matrix;
    const float*        m1  = first.m_Matrix.cFlat;
    const float*        m2  = dup.cFlat;
          float*        r   = m_Matrix.cFlat;

    for (U32 row = 0; row < 4; ++row) {
        for (U32 col = 0; col < 4; ++col) {
            r[(row<<2)+col] = m1[(row<<2)  ] * m2[col   ]
                            + m1[(row<<2)+1] * m2[col+ 4]
                            + m1[(row<<2)+2] * m2[col+ 8]
                            + m1[(row<<2)+3] * m2[col+12];
        }
    }
}

void QzMatrix4x4::PostMultiply(const QzMatrix4x4 &second)
{
    // Multiplication is destructive, so we need make a duplicate of the
    // original value of m_Matrix for use in computing the new m_Matrix.
    const QzMatrix4x4_t dup = m_Matrix;
    const float*        m1  = dup.cFlat;
    const float*        m2  = second.m_Matrix.cFlat;
          float*        r   = m_Matrix.cFlat;

    for (U32 row = 0; row < 4; ++row) {
        for (U32 col = 0; col < 4; ++col) {
            r[(row<<2)+col] = m1[(row<<2)  ] * m2[col   ]
                            + m1[(row<<2)+1] * m2[col+ 4]
                            + m1[(row<<2)+2] * m2[col+ 8]
                            + m1[(row<<2)+3] * m2[col+12];
        }
    }
}

void QzMatrix4x4::PreRotateX(float angle)
{
    float c = cosf(angle);
    float s = sinf(angle);

    float *p = m_Matrix.cFlat;

    for (U32 i = 0; i < 4; ++i) {
        float t = p[4+i];
        float u = p[8+i];
        p[4+i]  = (t * c) + (u * s);
        p[8+i]  = (u * c) - (t * s);
    }
}

void QzMatrix4x4::PostRotateX(float angle)
{
    float c = cosf(angle);
    float s = sinf(angle);

    float *p = m_Matrix.cFlat;

    for (U32 i = 0; i < 4; ++i) {
        float t = p[(i<<2)+1];
        float u = p[(i<<2)+2];
        p[(i<<2)+1] = (t * c) - (u * s);
        p[(i<<2)+2] = (t * s) + (u * c);
    }
}

void QzMatrix4x4::PreRotateY(float angle)
{
    float c = cosf(angle);
    float s = sinf(angle);

    float *p = m_Matrix.cFlat;

    for (U32 i = 0; i < 4; ++i) {
        float t = p[  i];
        float u = p[8+i];
        p[  i]  = (t * c) - (u * s);
        p[8+i]  = (u * c) + (t * s);
    }
}

void QzMatrix4x4::PostRotateY(float angle)
{
    float c = cosf(angle);
    float s = sinf(angle);

    float *p = m_Matrix.cFlat;

    for (U32 i = 0; i < 4; ++i) {
        float t = p[(i<<2)  ];
        float u = p[(i<<2)+2];
        p[(i<<2)  ] = (u * s) + (t * c);
        p[(i<<2)+2] = (u * c) - (t * s);
    }
}

void QzMatrix4x4::PreRotateZ(float angle)
{
    float c = cosf(angle);
    float s = sinf(angle);

    float *p = m_Matrix.cFlat;

    for (U32 i = 0; i < 4; ++i) {
        float t = p[  i];
        float u = p[4+i];
        p[  i]  = (t * c) + (u * s);
        p[4+i]  = (u * c) - (t * s);
    }
}

void QzMatrix4x4::PostRotateZ(float angle)
{
    float c = cosf(angle);
    float s = sinf(angle);

    float *p = m_Matrix.cFlat;

    for (U32 i = 0; i < 4; ++i) {
        float t = p[(i<<2)  ];
        float u = p[(i<<2)+1];
        p[(i<<2)  ] = (t * c) - (u * s);
        p[(i<<2)+1] = (t * s) + (u * c);
    }
}

void QzMatrix4x4::PreScale(float x, float y, float z)
{
    for (U32 i = 0; i < 4; ++i) {
        m_Matrix.c4x4[0][i] *= x;
        m_Matrix.c4x4[1][i] *= y;
        m_Matrix.c4x4[2][i] *= z;
    }
}

void QzMatrix4x4::PreScale(float x)
{
    for (U32 i = 0; i < 4; ++i) {
        m_Matrix.c4x4[0][i] *= x;
        m_Matrix.c4x4[1][i] *= x;
        m_Matrix.c4x4[2][i] *= x;
    }
}

void QzMatrix4x4::PostScale(float x, float y, float z)
{
    for (U32 i = 0; i < 4; ++i) {
        m_Matrix.c4x4[i][0] *= x;
        m_Matrix.c4x4[i][1] *= y;
        m_Matrix.c4x4[i][2] *= z;
    }
}

void QzMatrix4x4::PostScale(float x)
{
    for (U32 i = 0; i < 4; ++i) {
        m_Matrix.c4x4[i][0] *= x;
        m_Matrix.c4x4[i][1] *= x;
        m_Matrix.c4x4[i][2] *= x;
    }
}

void QzMatrix4x4::PreTranslate(float x, float y, float z)
{
    for (U32 i = 0; i < 4; ++i) {
        m_Matrix.c4x4[3][i] += (x * m_Matrix.c4x4[0][i])
                            +  (y * m_Matrix.c4x4[1][i])
                            +  (z * m_Matrix.c4x4[2][i]);
    }
}

void QzMatrix4x4::PostTranslate(float x, float y, float z)
{
    for (U32 i = 0; i < 4; ++i) {
        m_Matrix.c4x4[i][0] += m_Matrix.c4x4[i][3] * x;
        m_Matrix.c4x4[i][1] += m_Matrix.c4x4[i][3] * y;
        m_Matrix.c4x4[i][2] += m_Matrix.c4x4[i][3] * z;
    }
}

The zip file, crossplat.zip, contains a QzUnitTest project. Running this project will verify that all of these transforms yield the same results as the traditional matrix multiplication approach.

Performance Results

The unit test in the zip file will also perform some benchmarking of these operations. The results are written out to trace.txt. I've benchmarked a few machines to give an idea how well these operations run.

Average Microseconds Per Call
Function MacBook
Core 2 Duo
2.4 GHz
Windows
Core 2 6600
2.4 GHz
Windows
Core 2 Duo T5470
1.6 GHz
Windows
P4
3.6 GHz
Windows
P4
1.6 GHz
sine + cosine 0.044 0.073 0.100 0.101 0.250
MakeIdentity 0.017 0.015 0.020 0.025 0.048
MakeRotateX 0.054 0.083 0.109 0.133 0.315
MakeRotateY 0.055 0.082 0.109 0.132 0.308
MakeRotateZ 0.054 0.082 0.109 0.131 0.310
MakeArbitraryAxisRotation 0.060 0.084 0.111 0.123 0.300
MakeScale 0.020 0.015 0.021 0.025 0.048
MakeTranslate 0.019 0.016 0.021 0.025 0.048
Multiply 0.063 0.057 0.079 0.062 0.108
PostMultiply 0.070 0.072 0.097 0.086 0.158
PreMultiply 0.072 0.072 0.096 0.085 0.154
PostRotateX 0.056 0.091 0.122 0.108 0.252
PreRotateX 0.056 0.091 0.122 0.104 0.258
PostRotateY 0.056 0.090 0.122 0.104 0.253
PreRotateY 0.056 0.091 0.121 0.108 0.249
PostRotateZ 0.056 0.091 0.121 0.108 0.247
PreRotateZ 0.056 0.091 0.124 0.104 0.247
PostScale 0.012 0.013 0.018 0.014 0.029
PreScale 0.012 0.012 0.016 0.014 0.028
PostTranslate 0.017 0.017 0.023 0.019 0.034
PreTranslate 0.015 0.014 0.019 0.017 0.031

On Windows, these numbers were taken from code compiled with the 2003 Professional version of DevStudio, which has an optimizing compiler. The Standard and Express versions of DevStudio seldom (never?) have optimizing compilers, so tightly written C code like matrix multiplication can easily take several times as many CPU cycles to execute as code generated by an optimizing compiler.

As an example of how much difference this makes, consider the Multiply function. On a version of DevStudio with an optimizing compiler, a release build takes 0.057 microseconds, or 0.299 microseconds with a debug build. Compiling the same code on the Standard 2003 version of DevStudio, this function takes 0.212 microseconds, almost four times as long as code generated by an optimizing compiler. Obviously, if you care about performance on Windows, you will have to spring the several hundred dollars required for a Professional version of DevStudio.

So what do the numbers in the previous table indicate?

Going by the numbers from the Core 2 6600, translating an object would require a call to MakeTranslate to create the 4×4 translation matrix, then a call to MatrixMultiply, for a total of 0.057 + 0.016 = 0.073 μs. Directly applying the transform via PostTranslate would only take 0.017 μs, or about a 4× difference in processing time.

Creating a rotation matrix would require 0.083 μs, or 0.140 μs counting the matrix multiply, versus the 0.091 μs required to apply the rotation directly with PostRotateX. Here the difference is not as pronounced, since there is a 0.073 μs overhead required for the sinf and cosf function calls needed to compute the rotation coefficients. Still, most of the sampled processors exhibit a 30% to 50% reduction in CPU time for rotations. And when running with optimized sine functions, the performance improvement is even larger.

In some cases, the pre- and post- transforms require copying the contents of the matrix to avoid order dependence problems with computing new coefficients. It may only involve copying 16 floats, but this will still add a few nanoseconds to the transform time.

How well do these simplifications perform on embedded processors? That heavily depends on CPU speed versus memory speed. When the CPU is significantly faster than memory, a tight matrix multiply loop gains a preformance improvement since it requires a smaller instruction cache footprint — the simplified transforms should still be faster, but the difference may not be very dramatic. On the other hand, when CPU speed is closer to memory speed (often the case with embedded devices), there is a greater performance improvement when the total number of CPU instructions is reduced, so the simplifications in this article provide a greater benefit — even though the memory footprint of the unrolled instructions is larger than a tight loop.

With today's PC hardware, the speed of matrix operations is not likely to be a cost factor (unless you're doing something like skinned animations in software), but the weak embedded processors used in handhelds can still benefit from every clock cycle that is saved by simplifying the matrix transforms.