I'm changing the displacement map format.
The current format is a histogram normalized u8, generated from the source maps which are u16.
When zooming in there simply isn't enough precision and we start to see stair stepping from the u8
So the format I am switching to is roughly based on BC4, but as I'm decoding on the CPU I decided to simplify it.
BC4 stores per 4x4 pixels a min and max value, and a 3 bit interpolation weight per pixel. The total size is 64 bits per 4x4 block, half of u8.
First I removed the 2nd mode, where the order of min/max is swapped, this simplifies the decoder.
Second I changed it so that rather than storing a min and max value per 4x4, we store a min and ratio between min and 255.
This changes the precision range from 3-11 to 3-19 bits and is only possible because I removed the 2nd mode.
Third I changed the way the indices are stored, so that 0-7 maps directly to the lerp ratio.
I have no idea why BC4 stores it indices in such a random order, but it makes it much harder to decode on the CPU.
I'll refer to my format as DBlock.
I wrote a fail fast test compare the DBlock vs U8 encoding, one immediate downside of DBlock is that the decoder requires more instructions than U8. On the other hand the total memory is half, so less cache misses. Also the blocks are in 4x4 format which is essentially a partial morton order, so again better cache access patterns.
Visually DBlock wins in virtually every case despite being half the memory. This is because the vast majority of 4x4 blocks are better captured by a local gradient encoding. In theory for blocks that contain large differences between min/max DBlock should look worse, but I am not easily able to spot the difference.
The only real downside is the complexity of the decoder, in particular with U8 I was taking advantage of that fast that I could load multiple pixels with a single gather. This is far more difficult with DBlock, since each "pixel" is now 4x4 pixels, and 8 bytes in size.
Also this is all SIMD based, I don't have scalar decoders.
Thus branching on whether the X coordinate crosses from one block to another doesn't work well, it is very likely at least one of the lanes will cross blocks.
The initial decoder simply does many more gathers than the U8 one, for a 2x2 bilinear sample the U8 decoder required 2 gathers, with an extra 2 only when the X coordinate wrapped around(Repeat sampling).
The initial DBlock decoder for bilinear requires 8 gathers(in best case theoretically we only need 1 or2, but worst case is 8), although in many cases they are gathering from the same block, so it should be possible to eliminate some of these once I write a more optimized decoder.
One interesting thing is that for quadratic sampling(3x3), DBlocks number of gathers doesn't change from bilinear, and we can get everything we need with 8.
One perf improvement I quickly added was to switch the gathers to 64 bit sized rather than 32 bit, as our blocks are 64 bit, and 64 bit gathers do run faster.
Another option would be to drop the gathers and use 128 bit loads per lane, then shuffle everything around. I'll probably test this at some point, but it is a large # of instructions, so it may not be a win despite dropping the gathers.
*Update: I wrote the 128 bit load/swizzle based version and it does outperform the gather based implementation, and is now the default version. This is on Zen2, where gather is particularly slow and decodes to an obscene number of Uops.
Future: At some point I'd like to port the volumes to also using this format (Edit: this has been done although it is a slightly different format since it targets 3D blocks)