5 hours ago, matt77hias said:
Instead of changing the CPU code, you can also change the used conventions in your shader code. For example: HLSL has both a row_major and column_major type modifier.
Beware of that if you are loading matrices from buffers such as StructuredBuffer. In this case the matrix is loaded as 4 vector4 separately and with row_major mode it can result in more register usage when multiplying (at least I've seen it in hlsl asm). Not sure about performance implication between row_major and column_major when operating on constant buffers.
Example with matrix loaded from structured buffer and multiplied with vec4:
float3 pos = mul(float4(inPos, 1), matrixBuffer).xyz;
Results in this hlsl assembly when the boneBuffer is storing column_major float4x4:
0x00000494: ld_structured_indexable(structured_buffer, stride=64)(mixed,mixed,mixed,mixed) r6.xyzw, r2.w, l(0), t2.xyzw
0x000004C0: ld_structured_indexable(structured_buffer, stride=64)(mixed,mixed,mixed,mixed) r7.xyzw, r2.w, l(16), t2.xyzw
0x000004EC: ld_structured_indexable(structured_buffer, stride=64)(mixed,mixed,mixed,mixed) r8.xyzw, r2.w, l(32), t2.xyzw
0x00000518: dp4 r6.x, r3.xyzw, r6.xyzw
0x00000534: dp4 r6.y, r3.xyzw, r7.xyzw
0x00000550: dp4 r6.z, r3.xyzw, r8.xyzw
With row_major float4x4:
0x00000494: ld_structured_indexable(structured_buffer, stride=64)(mixed,mixed,mixed,mixed) r6.xyz, r2.w, l(0), t2.xyzx
0x000004C0: ld_structured_indexable(structured_buffer, stride=64)(mixed,mixed,mixed,mixed) r7.xyz, r2.w, l(16), t2.xyzx
0x000004EC: ld_structured_indexable(structured_buffer, stride=64)(mixed,mixed,mixed,mixed) r8.xyz, r2.w, l(32), t2.xyzx
0x00000518: ld_structured_indexable(structured_buffer, stride=64)(mixed,mixed,mixed,mixed) r9.xyw, r2.w, l(48), t2.xyxz
0x00000544: mov r10.x, r6.x
0x00000558: mov r10.y, r7.x
0x0000056C: mov r10.z, r8.x
0x00000580: mov r10.w, r9.x
0x00000594: dp4 r10.x, r3.xyzw, r10.xyzw
0x000005B0: mov r11.x, r6.y
0x000005C4: mov r11.y, r7.y
0x000005D8: mov r11.z, r8.y
0x000005EC: mov r11.w, r9.y
0x00000600: dp4 r10.y, r3.xyzw, r11.xyzw
0x0000061C: mov r9.x, r6.z
0x00000630: mov r9.y, r7.z
0x00000644: mov r9.z, r8.z
0x00000658: dp4 r10.z, r3.xyzw, r9.xyzw
Fun fact, that row_major even has to load last row of matrix, while the column_major can get away not loading the last column (I think because I am not interested in .w component?)