Archive for July, 2007

2D Software Renderer

Sunday, July 1st, 2007

A few months ago, we made a simple little tycoon-style game, with the goal of selling it from our own website. It was more of an experiment and learning experience than a serious attempt to break into the downloadable games industry.

Since we wanted to maximize our potential market, and thereby had to keep the system requirements down to a minimum, we chose to do the game in 2D, rather than 3D (and the game being a tycoon game made this decision easier as well).

And because of this I found myself, for the first time in a long while, in the position of having to write a 2D graphics engine. It was a nice change from all the 3D I’ve been doing in recent years, and I was quite pleased with the end result.

Requirements and objectives

For the type of game that we were developing, we had a need for drawing maybe a hundred furniture and about as many characters (at a maximum). We wanted 20-30 different types of characters, each with 20-30 animations. We were using Poser to do all the art, and when rendering in Poser an alpha channel is generated for each image, giving it nice, smooth, antialiased edges. We felt that it would be nice to use this alpha channel, as it would make everything look much more nice and smooth. The game would have quite a lot of text on some of its screens, so we wanted it to run at a resolution of 800×600, to make it more readable.

Hardware or software blitting?

Right from the start I discarded the idea to use hardware accelerated rendering (by which I mean storing the graphics in video memory and use the DirectDraw blit-methods to let the graphics card take care of the drawing). This could probably have given me more of raw blitting power, but there’s a number of disadvantages with that technique.

First, alpha blending doesn’t work, so for this I would have to roll my own anyway.

Second, there’s a big risk of compatibility problems. On more than one occasion, the graphics drivers for modern 3D cards have had bugs in them, causing the hardware accelerated blitting in 2D mode to be so slow that it’s practically unusable. We wanted to definitely avoid a situation where we had to tell out customers that they had to download the previous version of the drivers to be able to play the game. Also, some cards don’t allow the use of bitmaps wider than the screen resolution, while some do.

Bearing in mind that we wanted the game to work even on rather old computers, it was quite likely that the video memory wouldn’t be enough to hold all the graphics (just the animations would be anywhere between 4000 and 18000 images!). So if we where to use hardware blitting we would need a system for copying bitmaps from system memory to video memory, and that system would need to take into account the amount of available video memory on the customers video card. Way too much compexity for my taste :-)

So, the decision to do all blitting in system-memory, using custom algorithms, and then copy the result to the video card, was a decision I would not regret.

Memory requirements and quality

Due to the large amount of graphics (in regards to the target platform), and the fact that the game where to be sold and delivered through the internet (bigger size = more bandwidth = higher cost, as well as being less appealing to customers with modems), it was necessary to keep the actual size of the graphics down as much as possible. And not only did it have to take little space in the installation package, but it also hade to take up little space in the memory when running the game, since we couldn’t count on the target machine having lots of memory.

Using 32 bit images and keeping everything uncompressed was not going to be feasible. I had on an earlier game (Mr. FroOonk – The Crystal Collector ) used the technique of having a 16-bit screen mode with up to 256 unique colors per sprite (8-bit sprites with individual palettes). Using this technique, each sprite will be much smaller, but since it has got its own palette with up to 256 colors, the quality is still quite good, especially if the sprites aren’t huge ones with lots of different colors in lots of different shades.

So I did some quick tests, converting some of the graphics to 8 bits, and since the quality was quite acceptable, I decided to go with that method for this project as well. This meant that the memory usage was instantly cut in half, by using 16 bits (8 bits color + 8 bits alpha) rather than 32 bits.

RLE compression

I considered a few different compression techniques, but the target platform wasn’t fast enough for anything fancy – I wouldn’t want to use all the CPU power to decompress graphics…

So basically, what I was left with was RLE-compression. This is of course a very well-known and proven technique, but for those of you who are not familiar with it, it basically means that instead of storing a color value for every pixel in a bitmap, you store the color value and the number of contiguous pixels of that color. This means that images with large areas of the same color compress very well, while an image where every pixel is different from its neighbours won’t compress anything at all.

In our specific case, we had very few images that filled their entire bitmap. Most things were stuff like furniture or characters, and those have lots of empty space surrounding them. Also, we had downsampled the images to only 256 colors, and thereby increased the likelihood of adjacent pixels having the same value, compared to how it would be if we had kept the 24 bits per pixel.

For the alpha channel the situation was similar, most images had only masks for furniture and people, with most pixels being either entirely transparent or totally opaque, with only a few pixels around the edges being semi-transparent.

So, RLE-compression had great potential to accomplish a quite decent compression rate, so now it was more a question of performance.

Rendering & Performance

There’s a few interesting things to note when it comes to the rendering. First of all, it’s important to note that the speed of the rendering have more to do with memory access speed than raw CPU performance. Most of the time is simply spent retrieving a color value from one memory address and outputting it to another. And when it comes to alpha-blending, it’s even worse, when you’re reading color values from TWO memory positions, use some CPU-power to blend them together, and then write them out to the same position as you just read one of them (which may well lead to a stall). Anyone who’s ever run some alpha-blending code through a profiler will know just how much slower alpha-blending is than normal blitting.

To sum things up: for each pixel to be drawn, there’s three posibilities: the pixel is opaque (and needs to overwrite the destination completely), the pixel is transparent (and should not be written at all) or semi-transparent to some degree (needs to be alpha-blended).

RLE-compression has a big advantage when it comes to handling opaque and transparent pixels. For solid pixels, it’s simply a matter off writing to the destination the number of pixels specified of the given color. For fully transparent pixels, we just skip ahead the number of pixels specified, which means that with RLE compression, fully transparent pixels are basically free, as no processing and no memory access is being done for them.

Optimizations

After some thinking, I concluded that there is 3 types of images:

  1. Images where all pixels are solid
  2. Images with only fully opaque or fully transparent pixels
  3. Images that also contain semi-transparent pixels

Rendering images of type 1 is very straightforward:

Read the value for how many pixels there are in the next Run-Length (N)
Read the palette index value for the next Run-Length
Look up the RGB color value of the palette index (C)
Fill (N) pixels ahead with the color (C)
Process the next Run-length

(If you’re familiar with RLE-compression, you’ll know that obviously not all pixels are stored as a run-length; if there is a long run of pixels which all are different from their neighbours, they will be stored in a special run-length, so we don’t have to store a run-length of 1 for each and every one of those pixesl. I kept that out of the description above, as it doesn’t have a big impact, but wanted to point it out anyway).

It get’s a little bit more involved when rendering images of type 2:

Read the value for how many pixels there are in the next Run-Length (N)
Read the palette index value for the next Run-Length
If the palette index is 0 (0 means this pixel run  is fully transparent):
   Skip the next (N) pixels
Else
   Look up the RGB color value of the palette index (C)
   Fill (N) pixels ahead with the color (C)
Process the next Run-length

And finally type 3:

Read the value for how many pixels there are in the next Run-Length (N)
Read the palette index and the alpha value value for the next Run-Length
If the alpha value is 0 (0 means this pixel run is fully transparent):
   Skip the next (N) pixels
Else if the alpha value is 255 (255 means this pixel run is fully opaque)
   Look up the RGB color value of the palette index (C)
   Fill (N) pixels ahead with the color (C)
Else
   Look up the RGB color value of the palette index (C)
   Read the RGB color value of the destination pixel (D)
   Blend between (C) and (D) using the alpha value, to get the final RGB color value (B)
   Fill (N) pixels ahead with the color (B)
Process the next Run-length

It might be tempting to write the type 3 code such that it could handle all three cases (which would be trivial to do), but it’s not a good idea from a performance point of view.

Rendering will most likely be the bottle neck of the code, so anything that can be done to simplify the inner loop will definitely be worth it.

Divide and Conquer

So I decided to divide my bitmaps into two separate parts: all solid pixels ended up in one RLE-compressed part, and all semi-transparent pixels in another one. I also stored a flag for whether the solid part contained any fully transparent pixels or not.

This division allowed me to draw the solid part using method one as described above, if the flag was set to false, or using method two if it was set to true. The difference is that rather than having a condition which is executed for each pixel, I do the same check once for the entire bitmap. This means that bitmaps which doesn’t contain a single fully transparent pixel will render as fast as possible, and only bitmaps which actually contains some fully transparent pixels will suffer the performance loss which comes form the extra condition executed per pixel to check for transparency.

All semi-transparent pixels are separated into their own RLE-compressed section. In our case, this means that most of the pixels in this section is fully transparent (as this will be true for the pixels that was fully transparent in the original bitmap, and also for l he pixels which was opaque in the original bitmap, as they are now included in the other RLE-compressed part).

Dividing the bitmaps in two sections like this means that the first part doesn’t need an associated alpha channel, and can be rendered at the highest speed. The second part does have an alpha channel, but it wil compress really well using the RLE-compression, as the large chunks of pixels (the ones between the few semi-transparent edge pixels this section contains) will have a color value of 0 and an alpha value of 0 (as all fully opaque pixels are stored in the first part).

Low-level Optimizations

The system I’ve described here had good performance, really good performance actually… But after analyzing it with a profiler (AMD CodeAnalyst, free ) I found a significant bottle-neck in the code which writes pixels to the 16-bit screen buffer. Writing to one half of a 32-bit value, directly followed by a write to the other half, seemed to cause some kind of stall. So I changed the code so that I combine two pixels into one 32-bit variable, and that got me a significant speedup. A quick look at the generated assembler code showed me that the compiler now was able to use a register to combine the two pixels, instead of writing to the same 4-byte group twice in a row.

I choose to not continue optimizing the code much beyond this point. The system was fast enough for my needs even without resorting to inline assembler, and by not doing that I ended up with a renderer which will be extremely simple to port to other patforms. The only thing needed is a function which copies the data from my 16-bit back buffer to the screen buffer of the video card.

Conclusion

I am very happy with the final renderer. The bitmaps are small enough; my RLE-compressed, palettized bitmaps are about the same file size as the orgininal PNG’s, and when I compress them in an installation package they shrink by another 50%.

The loading of bitmaps is very quick, as I’m saving them in a format which allows me to pretty much just load them straight into memory, without having to unpack them or do any conversions.

Rendering is very fast, faster than I would have expected. With the game engine I was using at the time, it was very easy to swap renderers, and when I swapped this 2D-renderer for a DirectX9 hardware accekerared renderer, things ran significantly slower!

(This was mostly because the directx-renderer wasn’t optimized for this type of rendering, and there was a lot of state changes and texture uploads going on. I would still have expected that even this naive implementation using hardware acceleration would have beaten the 100% CPU-based version.)

I will use this renderer for a long time, for all the Colossus games. I used three days to write and optimize the renderer, and I think it’s safe to say it was a very good investment :-)

The source code for the renderer is part of my Pixie engine, so check that out if you’re interested in a more detailed study. It’s the file RLEBitmap.cpp you want to look at.