🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

DirectXMath matrix query

Started by
5 comments, last by turanszkij 6 years, 2 months ago

Hi Guys,

I have been writing my own implementation of the XMMatrix* functions in assembly and the results are amazing, being roughly 500% faster than the MS implementation.

It got me thinking, why do we need to 'transpose' the matrix results as an additional call?

I understand why it has to be done, being the adjustment from Row major to Column major etc. 

But wouldn't it make sense for the library to do this for us behind the scenes, making things even more efficient. Or am I missing something?

Why wouldn't the MS libraries already take transposition in to account?

Love to hear your thoughts on this. :D 

Advertisement
1 hour ago, lonewolff said:

It got me thinking, why do we need to 'transpose' the matrix results as an additional call?

I understand why it has to be done, being the adjustment from Row major to Column major etc. 

But wouldn't it make sense for the library to do this for us behind the scenes, making things even more efficient. Or am I missing something?

 

You don't if you use the same conventions everywhere. The MS samples for some strange, unknown reason choose to use row-vector mathematics & row-major array indexing in their CPU code, and then choose to use row-vector mathematics & column-major array indexing in their HLSL code.

Why? God knows... Legacy reasons??

If you just use the same conventions everywhere -- both linear algebra convention (left-to-right matrix multiplication vs right-to-left matrix multiplication) and computer-science array indexing convention (row-major or column-major schemes), then there's no reason to shuffle data around.

Alternatively, as a funny quirk but not something that I recommend if you swap both your comp-sci and your mathematical conventions at the same time, there's no reason to transpose. E.g. if MS used column-vector math (aka backwards multiplication order) in their HLSL code instead of row-vector math (and column-major array indexing), then their row-vector/row-major matrices would work without modification.

This is because swapping one of the conventions is basically equivalent to doing a transpose (which is why the MS examples need a transpose -- to cancel out the convention change), but doing two convention changes is doing two transposes, which cancels out :D


Linear algebra:
Row vectors -- Column Vectors
Xx Xy Xz 0  --  Xx Yx Zx Px
Yx Yy Yz 0  --  Xy Yy Zy Py
Zx Zy Zz 0  --  Xz Yz Zz Pz
Px Py Pz 1  --  0  0  0  1

Computer science:
Row major   -- Column Major
 0  1  2  3 --  0  4  8 12
 4  5  6  7 --  1  5  9 13
 8  9 10 11 --  2  6 10 14
12 13 14 15 --  3  7 11 15

Converting the first two matrices into 1D arrays using the second two schemes gives 4 possible conventions that someone could use, but only results in two unique data layouts in RAM :)
Row vectors + Row major: (MS example CPU convention)
Xx Xy Xz 0, Yx Yy Yz 0, Zx Zy Zz 0, Px Py Pz 0
Column Vectors + Column Major (same as above!): (I use this convention everywhere in my engine)
Xx Xy Xz 0, Yx Yy Yz 0, Zx Zy Zz 0, Px Py Pz 0
Row vectors + Column major: (MS example GPU convention)
Xx Yx Zx Px, Xy Yy Zy Py, Xz Yz Zz Pz, 0  0  0  1
Column vectors + Row major (same as above!):
Xx Yx Zx Px, Xy Yy Zy Py, Xz Yz Zz Pz, 0  0  0  1

Awesome reply, thank you.

I'll have a closer think about what I want to achieve here.  :D

Instead of changing the CPU code, you can also change the used conventions in your shader code. For example: HLSL has both a row_major and column_major type modifier.

🧙

Nice! This is great to know!

5 hours ago, matt77hias said:

Instead of changing the CPU code, you can also change the used conventions in your shader code. For example: HLSL has both a row_major and column_major type modifier.

Beware of that if you are loading matrices from buffers such as StructuredBuffer. In this case the matrix is loaded as 4 vector4 separately and with row_major mode it can result in more register usage when multiplying (at least I've seen it in hlsl asm). Not sure about performance implication between row_major and column_major when operating on constant buffers.

 

Example with matrix loaded from structured buffer and multiplied with vec4:


float3 pos = mul(float4(inPos, 1), matrixBuffer).xyz;

Results in this hlsl assembly when the boneBuffer is storing column_major float4x4:


0x00000494:     ld_structured_indexable(structured_buffer, stride=64)(mixed,mixed,mixed,mixed) r6.xyzw, r2.w, l(0), t2.xyzw
0x000004C0:     ld_structured_indexable(structured_buffer, stride=64)(mixed,mixed,mixed,mixed) r7.xyzw, r2.w, l(16), t2.xyzw
0x000004EC:     ld_structured_indexable(structured_buffer, stride=64)(mixed,mixed,mixed,mixed) r8.xyzw, r2.w, l(32), t2.xyzw
0x00000518:     dp4 r6.x, r3.xyzw, r6.xyzw
0x00000534:     dp4 r6.y, r3.xyzw, r7.xyzw
0x00000550:     dp4 r6.z, r3.xyzw, r8.xyzw

With row_major float4x4:


0x00000494:     ld_structured_indexable(structured_buffer, stride=64)(mixed,mixed,mixed,mixed) r6.xyz, r2.w, l(0), t2.xyzx
0x000004C0:     ld_structured_indexable(structured_buffer, stride=64)(mixed,mixed,mixed,mixed) r7.xyz, r2.w, l(16), t2.xyzx
0x000004EC:     ld_structured_indexable(structured_buffer, stride=64)(mixed,mixed,mixed,mixed) r8.xyz, r2.w, l(32), t2.xyzx
0x00000518:     ld_structured_indexable(structured_buffer, stride=64)(mixed,mixed,mixed,mixed) r9.xyw, r2.w, l(48), t2.xyxz
0x00000544:     mov r10.x, r6.x
0x00000558:     mov r10.y, r7.x
0x0000056C:     mov r10.z, r8.x
0x00000580:     mov r10.w, r9.x
0x00000594:     dp4 r10.x, r3.xyzw, r10.xyzw
0x000005B0:     mov r11.x, r6.y
0x000005C4:     mov r11.y, r7.y
0x000005D8:     mov r11.z, r8.y
0x000005EC:     mov r11.w, r9.y
0x00000600:     dp4 r10.y, r3.xyzw, r11.xyzw
0x0000061C:     mov r9.x, r6.z
0x00000630:     mov r9.y, r7.z
0x00000644:     mov r9.z, r8.z
0x00000658:     dp4 r10.z, r3.xyzw, r9.xyzw

Fun fact, that row_major even has to load last row of matrix, while the column_major can get away not loading the last column (I think because I am not interested in .w component?)

This topic is closed to new replies.

Advertisement