Discussion:
[RFC][GCC][rs6000] Remaining work for inline expansion of strncmp/strcmp/memcmp for powerpc
Aaron Sawdey
2018-10-17 20:28:15 UTC
Permalink
I've previously posted a patch to add vector/vsx inline expansion of
strcmp/strncmp for the power8/power9 processors. Here are some of the
other items I have in the pipeline that I hope to get into gcc9:

* vector/vsx support for inline expansion of memcmp to non-loop code.
This improves performance of small memcmp.
* vector/vsx support for inline expansion of memcmp to loop code. This
will close the performance gap for lengths of about 128-512 bytes
by making the loop code closer to the performance of the library
memcmp.
* generate inline expansion to a loop for strcmp/strncmp. This closes
another performance gap because strcmp/strncmp vector/vsx code
currently generated is lots faster than the library call but we
only generate comparison of 64 bytes to avoid exploding code size.
Similar code in a loop would be compact and allow inline comparison
of maybe the first 512 bytes inline before dumping to the library
function.

If anyone has any other input on the inline expansion work I've been
doing for the rs6000 target, please let me know.

Thanks!
Aaron
--
Aaron Sawdey, Ph.D. ***@linux.vnet.ibm.com
050-2/C113 (507) 253-7520 home: 507/263-0782
IBM Linux Technology Center - PPC Toolchain
Florian Weimer
2018-10-17 21:03:58 UTC
Permalink
Post by Aaron Sawdey
I've previously posted a patch to add vector/vsx inline expansion of
strcmp/strncmp for the power8/power9 processors. Here are some of the
* vector/vsx support for inline expansion of memcmp to non-loop code.
This improves performance of small memcmp.
* vector/vsx support for inline expansion of memcmp to loop code. This
will close the performance gap for lengths of about 128-512 bytes
by making the loop code closer to the performance of the library
memcmp.
* generate inline expansion to a loop for strcmp/strncmp. This closes
another performance gap because strcmp/strncmp vector/vsx code
currently generated is lots faster than the library call but we
only generate comparison of 64 bytes to avoid exploding code size.
Similar code in a loop would be compact and allow inline comparison
of maybe the first 512 bytes inline before dumping to the library
function.
If anyone has any other input on the inline expansion work I've been
doing for the rs6000 target, please let me know.
The inline expansion of strcmp is problematic for valgrind:

<https://bugs.kde.org/show_bug.cgi?id=386945>

We currently see around 0.5 KiB of instructions for each call to
strcmp. I find it hard to believe that this improves general system
performance except in micro-benchmarks.
Aaron Sawdey
2018-10-18 14:48:22 UTC
Permalink
Post by Florian Weimer
Post by Aaron Sawdey
I've previously posted a patch to add vector/vsx inline expansion of
strcmp/strncmp for the power8/power9 processors. Here are some of the
* vector/vsx support for inline expansion of memcmp to non-loop code.
This improves performance of small memcmp.
* vector/vsx support for inline expansion of memcmp to loop code. This
will close the performance gap for lengths of about 128-512 bytes
by making the loop code closer to the performance of the library
memcmp.
* generate inline expansion to a loop for strcmp/strncmp. This closes
another performance gap because strcmp/strncmp vector/vsx code
currently generated is lots faster than the library call but we
only generate comparison of 64 bytes to avoid exploding code size.
Similar code in a loop would be compact and allow inline comparison
of maybe the first 512 bytes inline before dumping to the library
function.
If anyone has any other input on the inline expansion work I've been
doing for the rs6000 target, please let me know.
<https://bugs.kde.org/show_bug.cgi?id=386945>
I'm aware of this. One thing that will help is that I believe the vsx
expansion for strcmp/strncmp does not have this problem, so with
current gcc9 trunk the problem should only be seen if one of the strings is
known at compile time to be less than 16 bytes, or if -mcpu=power7, or
if vector/vsx is disabled. My position is that it is valgrind's problem
if it doesn't understand correct code, but I also want valgrind to be a
useful tool so I'm going to take a look and see if I can find a gpr
sequence that is equally fast that it can understand.
Post by Florian Weimer
We currently see around 0.5 KiB of instructions for each call to
strcmp. I find it hard to believe that this improves general system
performance except in micro-benchmarks.
The expansion of strcmp where both arguments are strings of unknown
length at compile time will compare 64 bytes then call strcmp on the
remainder if no difference is found. If the gpr sequence is used (p7
or vec/vsx disabled) then the overhead is 91 instructions. If the
p8 vsx sequence is used, the overhead is 59 instructions. If the p9
vsx sequence is used, then the overhead is 41 instructions.

Yes, this will increase the instruction footprint. However the processors
that this targets (p7, p8, p9) all have aggressive iprefetch. Doing some
of the comparison inline makes the common cases of strings being totally
different, or identical and <= 64 bytes in length very much faster, and
also avoiding the plt call means less pressure on the count cache and
better branch prediction elsewhere.

If you are aware of any real world code that is faster when built
with -fno-builtin-strcmp and/or -fno-builtin-strncmp, please let me know
so I can look at avoiding those situations.

Aaron
--
Aaron Sawdey, Ph.D. ***@linux.vnet.ibm.com
050-2/C113 (507) 253-7520 home: 507/263-0782
IBM Linux Technology Center - PPC Toolchain
Segher Boessenkool
2018-10-18 16:59:47 UTC
Permalink
Post by Aaron Sawdey
I'm aware of this. One thing that will help is that I believe the vsx
expansion for strcmp/strncmp does not have this problem, so with
current gcc9 trunk the problem should only be seen if one of the strings is
known at compile time to be less than 16 bytes, or if -mcpu=power7, or
if vector/vsx is disabled. My position is that it is valgrind's problem
if it doesn't understand correct code, but I also want valgrind to be a
useful tool so I'm going to take a look and see if I can find a gpr
sequence that is equally fast that it can understand.
If we can do that without losing performance, that is nice of course :-)
Post by Aaron Sawdey
Post by Florian Weimer
We currently see around 0.5 KiB of instructions for each call to
strcmp. I find it hard to believe that this improves general system
performance except in micro-benchmarks.
The expansion of strcmp where both arguments are strings of unknown
length at compile time will compare 64 bytes then call strcmp on the
remainder if no difference is found. If the gpr sequence is used (p7
or vec/vsx disabled) then the overhead is 91 instructions. If the
p8 vsx sequence is used, the overhead is 59 instructions. If the p9
vsx sequence is used, then the overhead is 41 instructions.
That is 0.355kB, 0.230kB, resp. 0.160kB.
Post by Aaron Sawdey
Yes, this will increase the instruction footprint. However the processors
that this targets (p7, p8, p9) all have aggressive iprefetch. Doing some
of the comparison inline makes the common cases of strings being totally
different, or identical and <= 64 bytes in length very much faster, and
also avoiding the plt call means less pressure on the count cache and
better branch prediction elsewhere.
If you are aware of any real world code that is faster when built
with -fno-builtin-strcmp and/or -fno-builtin-strncmp, please let me know
so I can look at avoiding those situations.
+1

Thanks Aaron! Both for all the original work, and for looking at it
once again.


Segher

Loading...