vec_ld versus vec_vsx

Discussion:

vec_ld versus vec_vsx_ld on power8

Ewart Timothée

2015-03-13 09:04:19 UTC

Hello all,

I have a issue/question using VMX/VSX on Power8 processor on a little endian system.
Using intrinsics function, if I perform an operation with vec_vsx_ld(…) - vet_vsx_st(), the compiler will add
a permutation, and then perform an operations (memory correctly aligned)

lxvd2x …
xxpermdi …
operations ….
xxpermdi
stxvd2x …

If I use vec_ld() - vec_st()

lvx
operations …
stvx

Reading the ISA, I do not see a real difference between this 2 instructions ( or I miss it)

So my 3 questions are:

Why do I have permutations ?
What is the cost of these permutations ?
What is the difference vet_vsx_ld and vec_ld for the performance ?

Best

Tim

Timothée Ewart, Ph. D.
http://www.linkedin.com/in/tewart

Bill Schmidt

2015-03-13 15:16:10 UTC

Permalink

Hi Tim,

I'll discuss the loads here for simplicity; the situation for stores is
analogous.

There are a couple of differences between lvx and lxvd2x. The most
important one is that lxvd2x supports unaligned loads, while lvx does
not. You'll note that lvx will zero out the lower 4 bits of the
effective address in order to force an aligned load.

lxvd2x loads two doublewords into a vector register using big-endian
element order, regardless of whether the processor is running in
big-endian or little-endian mode. That is, the first doubleword from
memory goes into the high-order bits of the vector register, and the
second doubleword goes into the low-order bits. This is semantically
incorrect for little-endian, so the xxpermdi swaps the doublewords in
the register to correct for this.

At optimization -O1 and higher, gcc will remove many of the xxpermdi
instructions that are added to correct for LE semantics. In many vector
computations, the lanes where the computations are performed do not
matter, so we don't have to perform the swaps.

For unaligned loads where we are unable to remove the swaps, this is
still better than the alternative using lvx. An unaligned load requires
a four-instruction sequence to load the two aligned quadwords that
contain the desired data, set up a permutation control vector, and
combine the desired pieces of the two aligned quadwords into a vector
register. This can be pipelined in a loop so that only one load occurs
per loop iteration, but that requires additional vector copies. The
four-instruction sequence takes longer and increases vector register
pressure more than an lxvd2x/xxpermdi.

When the data is known to be aligned, lvx is equivalent to lxvd2x
performance if we are able to remove the permutes, and is preferable to
lxvd2x if not.

There are cases where we do not yet use lvx in lieu of lxvd2x when we
could do so and improve performance. For example, saving and restoring
of vector parameters in a function prolog and epilog does not yet always
use lvx. This is a performance opportunity we plan to improve in the
future.

A rule of thumb for your purposes is that if you can guarantee that you
are using aligned data, you should use vec_ld and vec_st, and otherwise
you should use vec_vsx_ld and vec_vsx_st. Depending on your
application, it may be worthwhile to copy your data into an aligned
buffer before performing vector calculations on it. GCC provides
attributes that will allow you to specify alignment on a 16-byte
boundary.

Note that the above discussion presumes POWER8, which is the only POWER
hardware that currently supports little-endian distributions and
applications. Unaligned load/store performance on earlier processors
was less efficient, so the tradeoffs differ.

I hope this is helpful!

Bill Schmidt, Ph.D.
IBM Linux Technology Center

Post by Ewart TimothÃ©e
I have a issue/question using VMX/VSX on Power8 processor on a little endian system.
Using intrinsics function, if I perform an operation with vec_vsx_ld(â) - vet_vsx_st(), the compiler will add
a permutation, and then perform an operations (memory correctly aligned)
lxvd2x â
xxpermdi â
operations â.
xxpermdi
stxvd2x â
If I use vec_ld() - vec_st()
lvx
operations â
stvx
Reading the ISA, I do not see a real difference between this 2 instructions ( or I miss it)
Why do I have permutations ?
What is the cost of these permutations ?
What is the difference vet_vsx_ld and vec_ld for the performance ?

Ewart Timothée

2015-03-13 15:42:26 UTC

Permalink

thank you very much for this answer.
I know my memory is aligned so I will use vec_ld/st only.

best

Tim

Bill Schmidt

2015-03-13 16:50:01 UTC

Permalink

Hi Tim,

Actually, I left out another very good reason why you may want to use
vec_vsx_ld/st. Sorry for forgetting this.

As you saw, vec_ld translates into the lvx instruction. This
instruction loads a sequence of 16 bytes into a vector register. For
big endian, the first byte in memory is loaded into the high order byte
of the register. For little endian, the first byte in memory is loaded
into the low order byte of the register.

This is fine if the data you are loading is arrays of characters, but is
not so fine if you are loading arrays of larger items. Suppose you are
loading four integers {1, 2, 3, 4} into a register with lvx. In big
endian you will see:

00 00 00 01 00 00 00 02 00 00 00 03 00 00 00 04

In little endian you will see:

04 00 00 00 03 00 00 00 02 00 00 00 01 00 00 00

But for this to be interpreted as a vector of integers ordered for
little endian, what you really want is:

00 00 00 04 00 00 00 03 00 00 00 02 00 00 00 01

If you use vec_vsx_ld, the compiler will generate a lxvw2x instruction
followed by an xxpermdi that swaps the doublewords. After the lxvw2x
you will have:

00 00 00 02 00 00 00 01 00 00 00 04 00 00 00 03

because the two LE doublewords are loaded in BE (reversed) order.
Swapping the two doublewords restores sanity:

00 00 00 04 00 00 00 03 00 00 00 02 00 00 00 01

So, even if your data is properly aligned, the use of vec_ld = lvx is
only correct if you are loading arrays of bytes. Arrays of anything
larger must use vec_vsx_ld to avoid errors.

Again, sorry for my previous omission!

Thanks,

Bill Schmidt, Ph.D.
IBM Linux Technology Center

Post by Ewart TimothÃ©e
thank you very much for this answer.
I know my memory is aligned so I will use vec_ld/st only.
best
Tim

Ewart Timothée

2015-03-13 17:11:53 UTC

Permalink

Hello,

I am super confuse now

scenario 1, what I have in m code:
machine boots in LE.

1) memory: LE
2) I load (ld_vec)
3) register : LE
4) VSU compute in LE
5) I store (st_vec)
6) memory: LE

scenario 2: ( I did not test but it is what I get if I order gcc to compiler in BE)
machine boot in BE

1) memory: BE
2) I load (ld_vsx_vec)
3) register : BE
4) VSU compute in BE
5) I store (st_vsx_vec)
6) memory: BE

At this point the VUS compute in both order

chimera scenario 3, what I understand:

machine boot in LE

1) memory: LE
2) I load (ld_vsx_vec) (the load swap the element)
3) register : BE
4) swap : LE
5) VSU compute in LE
6) swap : BE
5) I store (st_vsx_vec) (the store swap the element)
6) memory: BE

I understand ld/st_vsx_vec load/store from LE/BE, but as the VXU can compute
in both mode what should I swap (I precise I am working with 32/64 bits float)

Best,

Tim

Timothée Ewart, Ph. D.
http://www.linkedin.com/in/tewart

Post by Bill Schmidt
Hi Tim,
Actually, I left out another very good reason why you may want to use
vec_vsx_ld/st. Sorry for forgetting this.
As you saw, vec_ld translates into the lvx instruction. This
instruction loads a sequence of 16 bytes into a vector register. For
big endian, the first byte in memory is loaded into the high order byte
of the register. For little endian, the first byte in memory is loaded
into the low order byte of the register.
This is fine if the data you are loading is arrays of characters, but is
not so fine if you are loading arrays of larger items. Suppose you are
loading four integers {1, 2, 3, 4} into a register with lvx. In big
00 00 00 01 00 00 00 02 00 00 00 03 00 00 00 04
04 00 00 00 03 00 00 00 02 00 00 00 01 00 00 00
But for this to be interpreted as a vector of integers ordered for
00 00 00 04 00 00 00 03 00 00 00 02 00 00 00 01
If you use vec_vsx_ld, the compiler will generate a lxvw2x instruction
followed by an xxpermdi that swaps the doublewords. After the lxvw2x
00 00 00 02 00 00 00 01 00 00 00 04 00 00 00 03
because the two LE doublewords are loaded in BE (reversed) order.
00 00 00 04 00 00 00 03 00 00 00 02 00 00 00 01
So, even if your data is properly aligned, the use of vec_ld = lvx is
only correct if you are loading arrays of bytes. Arrays of anything
larger must use vec_vsx_ld to avoid errors.
Again, sorry for my previous omission!
Thanks,
Bill Schmidt, Ph.D.
IBM Linux Technology Center

Post by Ewart TimothÃ©e
thank you very much for this answer.
I know my memory is aligned so I will use vec_ld/st only.
best
Tim

Bill Schmidt

2015-03-13 17:27:25 UTC

Permalink

Hi Tim,

Sorry to have confused you. This stuff is a bit boggling the first 200
times you look at it...

For both 32-bit and 64-bit floating-point, you should use ld_vsx_vec on
both BE and LE machines, and the compiler will take care of doing the
right thing for you in both cases. You do not have to add any swaps
yourself.

When compiling for big-endian, ld_vsx_vec will translate into either
lxvw4x (for 32-bit floating-point) or lxvd2x (for 64-bit
floating-point). The values will be loaded into the register from
left-to-right (BE ordering).

When compiling for little-endian, ld_vsx_vec will translate into lxvd2x
followed by xxpermdi for both 32-bit and 64-bit floating-point. This
does the right thing in both cases. The values will be loaded into the
register from right-to-left (LE ordering).

The vector programming model is set up to allow you to usually code the
same way for both BE and LE. This is discussed more in Chapter 6 of the
ELFv2 ABI manual, which can be obtained from the OpenPOWER Connect
website (free registration required):

https://www-03.ibm.com/technologyconnect/tgcm/TGCMServlet.wss?alias=OpenPOWER&linkid=1n0000

Bill

Post by Ewart TimothÃ©e
Hello,
I am super confuse now
machine boots in LE.
1) memory: LE
2) I load (ld_vec)
3) register : LE
4) VSU compute in LE
5) I store (st_vec)
6) memory: LE
scenario 2: ( I did not test but it is what I get if I order gcc to compiler in BE)
machine boot in BE
1) memory: BE
2) I load (ld_vsx_vec)
3) register : BE
4) VSU compute in BE
5) I store (st_vsx_vec)
6) memory: BE
At this point the VUS compute in both order
machine boot in LE
1) memory: LE
2) I load (ld_vsx_vec) (the load swap the element)
3) register : BE
4) swap : LE
5) VSU compute in LE
6) swap : BE
5) I store (st_vsx_vec) (the store swap the element)
6) memory: BE
I understand ld/st_vsx_vec load/store from LE/BE, but as the VXU can compute
in both mode what should I swap (I precise I am working with 32/64 bits float)
Best,
Tim
Timothée Ewart, Ph. D.
http://www.linkedin.com/in/tewart

Post by Ewart TimothÃ©e
thank you very much for this answer.
I know my memory is aligned so I will use vec_ld/st only.
best
Tim