I rewrote the main render to use a visibility buffer, if you are unfamiliar with this, it is a renderer whose first pass has a very simple pixel shader that writes out the draw call ID & triangle index.
Given the high triangle count I figured the visibility buffer would perform better since it cancels out much of the 2x2 overshade issue that arises with tiny triangles, and this turned out to be very correct.
The game now runs significantly faster, and is also much more portable, as I don't have to rely on driver extensions to get access to barycentrics.
My visibility buffer is 32 bit, 14 bits for the triangle ID, and 18 for the draw call ID. This did require limiting triangles counts to 14 bit, which was not an existing requirement, but was simple enough to add.