Discussion:
vec_ld versus vec_vsx_ld on power8
Ewart Timothée
2015-03-13 09:04:19 UTC
Permalink
Hello all,

I have a issue/question using VMX/VSX on Power8 processor on a little endian system.
Using intrinsics function, if I perform an operation with vec_vsx_ld(…) - vet_vsx_st(), the compiler will add
a permutation, and then perform an operations (memory correctly aligned)

lxvd2x …
xxpermdi …
operations ….
xxpermdi
stxvd2x …

If I use vec_ld() - vec_st()

lvx
operations …
stvx

Reading the ISA, I do not see a real difference between this 2 instructions ( or I miss it)

So my 3 questions are:

Why do I have permutations ?
What is the cost of these permutations ?
What is the difference vet_vsx_ld and vec_ld for the performance ?


Best

Tim



Timothée Ewart, Ph. D.
http://www.linkedin.com/in/tewart
Bill Schmidt
2015-03-13 15:16:10 UTC
Permalink
This post might be inappropriate. Click to display it.
Ewart Timothée
2015-03-13 15:42:26 UTC
Permalink
thank you very much for this answer.
I know my memory is aligned so I will use vec_ld/st only.

best

Tim
Bill Schmidt
2015-03-13 16:50:01 UTC
Permalink
Hi Tim,

Actually, I left out another very good reason why you may want to use
vec_vsx_ld/st. Sorry for forgetting this.

As you saw, vec_ld translates into the lvx instruction. This
instruction loads a sequence of 16 bytes into a vector register. For
big endian, the first byte in memory is loaded into the high order byte
of the register. For little endian, the first byte in memory is loaded
into the low order byte of the register.

This is fine if the data you are loading is arrays of characters, but is
not so fine if you are loading arrays of larger items. Suppose you are
loading four integers {1, 2, 3, 4} into a register with lvx. In big
endian you will see:

00 00 00 01 00 00 00 02 00 00 00 03 00 00 00 04

In little endian you will see:

04 00 00 00 03 00 00 00 02 00 00 00 01 00 00 00

But for this to be interpreted as a vector of integers ordered for
little endian, what you really want is:

00 00 00 04 00 00 00 03 00 00 00 02 00 00 00 01

If you use vec_vsx_ld, the compiler will generate a lxvw2x instruction
followed by an xxpermdi that swaps the doublewords. After the lxvw2x
you will have:

00 00 00 02 00 00 00 01 00 00 00 04 00 00 00 03

because the two LE doublewords are loaded in BE (reversed) order.
Swapping the two doublewords restores sanity:

00 00 00 04 00 00 00 03 00 00 00 02 00 00 00 01

So, even if your data is properly aligned, the use of vec_ld = lvx is
only correct if you are loading arrays of bytes. Arrays of anything
larger must use vec_vsx_ld to avoid errors.

Again, sorry for my previous omission!

Thanks,

Bill Schmidt, Ph.D.
IBM Linux Technology Center
Post by Ewart Timothée
thank you very much for this answer.
I know my memory is aligned so I will use vec_ld/st only.
best
Tim
Ewart Timothée
2015-03-13 17:11:53 UTC
Permalink
Hello,

I am super confuse now

scenario 1, what I have in m code:
machine boots in LE.

1) memory: LE
2) I load (ld_vec)
3) register : LE
4) VSU compute in LE
5) I store (st_vec)
6) memory: LE

scenario 2: ( I did not test but it is what I get if I order gcc to compiler in BE)
machine boot in BE

1) memory: BE
2) I load (ld_vsx_vec)
3) register : BE
4) VSU compute in BE
5) I store (st_vsx_vec)
6) memory: BE

At this point the VUS compute in both order

chimera scenario 3, what I understand:

machine boot in LE

1) memory: LE
2) I load (ld_vsx_vec) (the load swap the element)
3) register : BE
4) swap : LE
5) VSU compute in LE
6) swap : BE
5) I store (st_vsx_vec) (the store swap the element)
6) memory: BE

I understand ld/st_vsx_vec load/store from LE/BE, but as the VXU can compute
in both mode what should I swap (I precise I am working with 32/64 bits float)

Best,

Tim

Timothée Ewart, Ph. D.
http://www.linkedin.com/in/tewart
Post by Bill Schmidt
Hi Tim,
Actually, I left out another very good reason why you may want to use
vec_vsx_ld/st. Sorry for forgetting this.
As you saw, vec_ld translates into the lvx instruction. This
instruction loads a sequence of 16 bytes into a vector register. For
big endian, the first byte in memory is loaded into the high order byte
of the register. For little endian, the first byte in memory is loaded
into the low order byte of the register.
This is fine if the data you are loading is arrays of characters, but is
not so fine if you are loading arrays of larger items. Suppose you are
loading four integers {1, 2, 3, 4} into a register with lvx. In big
00 00 00 01 00 00 00 02 00 00 00 03 00 00 00 04
04 00 00 00 03 00 00 00 02 00 00 00 01 00 00 00
But for this to be interpreted as a vector of integers ordered for
00 00 00 04 00 00 00 03 00 00 00 02 00 00 00 01
If you use vec_vsx_ld, the compiler will generate a lxvw2x instruction
followed by an xxpermdi that swaps the doublewords. After the lxvw2x
00 00 00 02 00 00 00 01 00 00 00 04 00 00 00 03
because the two LE doublewords are loaded in BE (reversed) order.
00 00 00 04 00 00 00 03 00 00 00 02 00 00 00 01
So, even if your data is properly aligned, the use of vec_ld = lvx is
only correct if you are loading arrays of bytes. Arrays of anything
larger must use vec_vsx_ld to avoid errors.
Again, sorry for my previous omission!
Thanks,
Bill Schmidt, Ph.D.
IBM Linux Technology Center
Post by Ewart Timothée
thank you very much for this answer.
I know my memory is aligned so I will use vec_ld/st only.
best
Tim
Bill Schmidt
2015-03-13 17:27:25 UTC
Permalink
Hi Tim,

Sorry to have confused you. This stuff is a bit boggling the first 200
times you look at it...

For both 32-bit and 64-bit floating-point, you should use ld_vsx_vec on
both BE and LE machines, and the compiler will take care of doing the
right thing for you in both cases. You do not have to add any swaps
yourself.

When compiling for big-endian, ld_vsx_vec will translate into either
lxvw4x (for 32-bit floating-point) or lxvd2x (for 64-bit
floating-point). The values will be loaded into the register from
left-to-right (BE ordering).

When compiling for little-endian, ld_vsx_vec will translate into lxvd2x
followed by xxpermdi for both 32-bit and 64-bit floating-point. This
does the right thing in both cases. The values will be loaded into the
register from right-to-left (LE ordering).

The vector programming model is set up to allow you to usually code the
same way for both BE and LE. This is discussed more in Chapter 6 of the
ELFv2 ABI manual, which can be obtained from the OpenPOWER Connect
website (free registration required):

https://www-03.ibm.com/technologyconnect/tgcm/TGCMServlet.wss?alias=OpenPOWER&linkid=1n0000

Bill
Post by Ewart Timothée
Hello,
I am super confuse now
machine boots in LE.
1) memory: LE
2) I load (ld_vec)
3) register : LE
4) VSU compute in LE
5) I store (st_vec)
6) memory: LE
scenario 2: ( I did not test but it is what I get if I order gcc to compiler in BE)
machine boot in BE
1) memory: BE
2) I load (ld_vsx_vec)
3) register : BE
4) VSU compute in BE
5) I store (st_vsx_vec)
6) memory: BE
At this point the VUS compute in both order
machine boot in LE
1) memory: LE
2) I load (ld_vsx_vec) (the load swap the element)
3) register : BE
4) swap : LE
5) VSU compute in LE
6) swap : BE
5) I store (st_vsx_vec) (the store swap the element)
6) memory: BE
I understand ld/st_vsx_vec load/store from LE/BE, but as the VXU can compute
in both mode what should I swap (I precise I am working with 32/64 bits float)
Best,
Tim
Timothée Ewart, Ph. D.
http://www.linkedin.com/in/tewart
Post by Bill Schmidt
Hi Tim,
Actually, I left out another very good reason why you may want to use
vec_vsx_ld/st. Sorry for forgetting this.
As you saw, vec_ld translates into the lvx instruction. This
instruction loads a sequence of 16 bytes into a vector register. For
big endian, the first byte in memory is loaded into the high order byte
of the register. For little endian, the first byte in memory is loaded
into the low order byte of the register.
This is fine if the data you are loading is arrays of characters, but is
not so fine if you are loading arrays of larger items. Suppose you are
loading four integers {1, 2, 3, 4} into a register with lvx. In big
00 00 00 01 00 00 00 02 00 00 00 03 00 00 00 04
04 00 00 00 03 00 00 00 02 00 00 00 01 00 00 00
But for this to be interpreted as a vector of integers ordered for
00 00 00 04 00 00 00 03 00 00 00 02 00 00 00 01
If you use vec_vsx_ld, the compiler will generate a lxvw2x instruction
followed by an xxpermdi that swaps the doublewords. After the lxvw2x
00 00 00 02 00 00 00 01 00 00 00 04 00 00 00 03
because the two LE doublewords are loaded in BE (reversed) order.
00 00 00 04 00 00 00 03 00 00 00 02 00 00 00 01
So, even if your data is properly aligned, the use of vec_ld = lvx is
only correct if you are loading arrays of bytes. Arrays of anything
larger must use vec_vsx_ld to avoid errors.
Again, sorry for my previous omission!
Thanks,
Bill Schmidt, Ph.D.
IBM Linux Technology Center
Post by Ewart Timothée
thank you very much for this answer.
I know my memory is aligned so I will use vec_ld/st only.
best
Tim
Loading...