In researching my recent blog entry on raytracing, I found a sweet elegance that I always look for in computer architectures. The algorithms are pretty straightforward, albeit pretty compute intensive, so the barrier to entry into this area seems low enough that I can work on it when I get a chance. It also looks to benefit immensely from parallel processing, another interest of mine, and will get me into that area as well. That, and the demos I saw showed some wicked shadow affects that really added to the realism of scenes, so it'll be cool to show off as a CDT demo as well (as opposed to the spinning polygons I use as an SDL/OpenGL demo right now that you may have seen at ESC).
My first step was to build a vector class that does math with 3D vectors, a critical component of all graphics programming. The sample I was looking at used regular C++ floating point arithmetic with a vector composed of a float for each of the three axis.
class vector {
public:
vector(float _x, float _y, float _z)
: x(_x), y(_y), z(_z) { }
void operator +=(const vector & v) {
x += v.x; y += v.y; z += v.z;
}
private:
float x, y, z;
};
Pretty basic. But this is the first example of an algorithm that can benefit from parallelism. Since I have a fairly new laptop, I wondered if I could leverage SSE, Streaming SIMD Extensions to implement this. I also wondered how well gcc and the MinGW variant I'm using handles SSE. So I gave it a try.
class vector {
public:
vector(float _x, float _y, float _z) {
float array[4] __attribute__((aligned(16)))
= { x, y, z, 1 };
xyz = _mm_load_ps(array);
}
void operator +=(const vector & v) {
xyz += v.xyz;
}
private:
__m128 xyz;
}
The constructor is a bit more complicated. And with most things dealing with SSE, 16 byte alignment is critical for good performance. And looking at the generated assembly, I was pleased to see that gcc, after making sure I put the -msse2 option on the compile, worked hard at keeping the instances of __m128 aligned like that. The performance tests I ran with addition showed an O.K improvement in performance, especially as the number of math operations grew. But when I tried multiplication instead of addition, the performance gains were astronomical. Well worth the extra typing.
Now that I've got that under my belt, I can't wait to actually draw something...
BTW, I guess I didn't really explain, but the __m128 contains four single precision floats and math operations occur on all four at the same time. Single Instruction on Multiple Data (SIMD), in this case an addition instruction between four pairs of floats.
ReplyDeleteHi Doug,
ReplyDeleteAfter being "inspired" to tryout SSE instructions myself, I made a small vector class (not very different from your example, but with a bit more functions). To test the performance I made a small for-loop where I did some vector multiplication, calculated the dot product and normalized the vector.
I was bit annoyed when I saw that my pure c++ implementation was actually faster than the SSE one, it was not until I increased the number of iterations to 1 billion iterations that the SSE class was faster, but at that number it was about 4 times faster.
So please do tell more about your findings. (I have never used the SSE instruction-set before so I could have done some big no-no's for all I know :))