Discussion:
PROPOSAL: Extend inline asm syntax with size spec
Borislav Petkov
2018-10-07 09:18:06 UTC
Permalink
Hi people,

this is an attempt to see whether gcc's inline asm heuristic when
estimating inline asm statements' cost for better inlining can be
improved.

AFAIU, the problematic arises when one ends up using a lot of inline
asm statements in the kernel but due to the inline asm cost estimation
heuristic which counts lines, I think, for example like in this here
macro:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/include/asm/cpufeature.h#n162

the resulting code ends up not inlining the functions themselves which
use this macro. I.e., you see a CALL <function> instead of its body
getting inlined directly.

Even though it should be because the actual instructions are only a
couple in most cases and all those other directives end up in another
section anyway.

The issue is explained below in the forwarded mail in a larger detail
too.

Now, Richard suggested doing something like:

1) inline asm ("...")
2) asm ("..." : : : : <size-expr>)
3) asm ("...") __attribute__((asm_size(<size-expr>)));

with which user can tell gcc what the size of that inline asm statement
is and thus allow for more precise cost estimation and in the end better
inlining.

And FWIW 3) looks pretty straight-forward to me because attributes are
pretty common anyways.

But I'm sure there are other options and I'm sure people will have
better/different ideas so feel free to chime in.

Thx.
This patch-set deals with an interesting yet stupid problem: kernel code
that does not get inlined despite its simplicity. There are several
causes for this behavior: "cold" attribute on __init, different function
optimization levels; conditional constant computations based on
__builtin_constant_p(); and finally large inline assembly blocks.
This patch-set deals with the inline assembly problem. I separated these
patches from the others (that were sent in the RFC) for easier
inclusion. I also separated the removal of unnecessary new-lines which
would be sent separately.
The problem with inline assembly is that inline assembly is often used
by the kernel for things that are other than code - for example,
assembly directives and data. GCC however is oblivious to the content of
the blocks and assumes their cost in space and time is proportional to
the number of the perceived assembly "instruction", according to the
number of newlines and semicolons. Alternatives, paravirt and other
mechanisms are affected, causing code not to be inlined, and degrading
compilation quality in general.
The solution that this patch-set carries for this problem is to create
an assembly macro, and then call it from the inline assembly block. As
a result, the compiler sees a single "instruction" and assigns the more
appropriate cost to the code.
To avoid uglification of the code, as many noted, the macros are first
precompiled into an assembly file, which is later assembled together
with the C files. This also enables to avoid duplicate implementation
that was set before for the asm and C code. This can be seen in the
exception table changes.
Overall this patch-set slightly increases the kernel size (my build was
text data bss dec hex filename
18140829 10224724 2957312 31322865 1ddf2f1 ./vmlinux before
18163608 10227348 2957312 31348268 1de562c ./vmlinux after (+0.1%)
The number of static functions in the image is reduced by 379, but
actually inlining is even better, which does not always shows in these
numbers: a function may be inlined causing the calling function not to
be inlined.
I ran some limited number of benchmarks, and in general the performance
impact is not very notable. You can still see >10 cycles shaved off some
syscalls that manipulate page-tables (e.g., mprotect()), in which
paravirt caused many functions not to be inlined. In addition this
patch-set can prevent issues such as [1], and improves code readability
and maintainability.
Update: Rasmus recently caused me (inadvertently) to become paranoid
about the dependencies. To clarify: if any of the headers changes, any c
file which uses macros that are included in macros.S would be fine as
long as it includes the header as well (as it should). Adding an
assertion to check this is done might become slightly ugly, and nobody
else is concerned about it. Another minor issue is that changes of
macros.S would not trigger a global rebuild, but that is pretty similar
to changes of the Makefile that do not trigger a rebuild.
[1] https://patchwork.kernel.org/patch/10450037/
v8->v9: * Restoring the '-pipe' parameter (Rasmus)
* Adding Kees's tested-by tag (Kees)
v7->v8: * Add acks (Masahiro, Max)
* Rebase on 4.19 (Ingo)
v6->v7: * Fix context switch tracking (Ingo)
* Fix xtensa build error (Ingo)
* Rebase on 4.18-rc8
v5->v6: * Removing more code from jump-labels (PeterZ)
* Fix build issue on i386 (0-day, PeterZ)
v4->v5: * Makefile fixes (Masahiro, Sam)
v3->v4: * Changed naming of macros in 2 patches (PeterZ)
* Minor cleanup of the paravirt patch
v2->v3: * Several build issues resolved (0-day)
* Wrong comments fix (Josh)
* Change asm vs C order in refcount (Kees)
v1->v2: * Compiling the macros into a separate .s file, improving
readability (Linus)
* Improving assembly formatting, applying most of the comments
according to my judgment (Jan)
* Adding exception-table, cpufeature and jump-labels
* Removing new-line cleanup; to be submitted separately
xtensa: defining LINKER_SCRIPT for the linker script
Makefile: Prepare for using macros for inline asm
x86: objtool: use asm macro for better compiler decisions
x86: refcount: prevent gcc distortions
x86: alternatives: macrofy locks for better inlining
x86: bug: prevent gcc distortions
x86: prevent inline distortion by paravirt ops
x86: extable: use macros instead of inline assembly
x86: cpufeature: use macros instead of inline assembly
x86: jump-labels: use macros instead of inline assembly
Makefile | 9 ++-
arch/x86/Makefile | 7 ++
arch/x86/entry/calling.h | 2 +-
arch/x86/include/asm/alternative-asm.h | 20 ++++--
arch/x86/include/asm/alternative.h | 11 +--
arch/x86/include/asm/asm.h | 61 +++++++---------
arch/x86/include/asm/bug.h | 98 +++++++++++++++-----------
arch/x86/include/asm/cpufeature.h | 82 ++++++++++++---------
arch/x86/include/asm/jump_label.h | 77 ++++++++------------
arch/x86/include/asm/paravirt_types.h | 56 +++++++--------
arch/x86/include/asm/refcount.h | 74 +++++++++++--------
arch/x86/kernel/macros.S | 16 +++++
arch/xtensa/kernel/Makefile | 4 +-
include/asm-generic/bug.h | 8 +--
include/linux/compiler.h | 56 +++++++++++----
scripts/Kbuild.include | 4 +-
scripts/mod/Makefile | 2 +
17 files changed, 331 insertions(+), 256 deletions(-)
create mode 100644 arch/x86/kernel/macros.S
--
2.17.1
--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.
Segher Boessenkool
2018-10-07 13:22:28 UTC
Permalink
Post by Borislav Petkov
this is an attempt to see whether gcc's inline asm heuristic when
estimating inline asm statements' cost for better inlining can be
improved.
GCC already estimates the *size* of inline asm, and this is required
*for correctness*. So any workaround that works against this will only
end in tears.

Taking size as an estimate of cost is not very good. But in your
motivating example you actually *do* care mostly about size, namely, for
the inlining decisions.

So I guess the real issue is that the inline asm size estimate for x86
isn't very good (since it has to be pessimistic, and x86 insns can be
huge)?
Post by Borislav Petkov
1) inline asm ("...")
What would the semantics of this be?
Post by Borislav Petkov
2) asm ("..." : : : : <size-expr>)
This potentially conflicts with the syntax for asm goto.
Post by Borislav Petkov
3) asm ("...") __attribute__((asm_size(<size-expr>)));
Eww.
Post by Borislav Petkov
with which user can tell gcc what the size of that inline asm statement
is and thus allow for more precise cost estimation and in the end better
inlining.
More precise *size* estimates, yes. And if the user lies he should not
be surprised to get assembler errors, etc.
Post by Borislav Petkov
And FWIW 3) looks pretty straight-forward to me because attributes are
pretty common anyways.
I don't like 2) either. But 1) looks interesting, depends what its
semantics would be? "Don't count this insn's size for inlining decisions",
maybe?


Another option is to just force inlining for those few functions where
GCC currently makes an inlining decision you don't like. Or are there
more than a few?


Segher
Borislav Petkov
2018-10-07 14:13:49 UTC
Permalink
Post by Segher Boessenkool
GCC already estimates the *size* of inline asm, and this is required
*for correctness*.
I didn't say it didn't - but the heuristic could use improving.
Post by Segher Boessenkool
So I guess the real issue is that the inline asm size estimate for x86
isn't very good (since it has to be pessimistic, and x86 insns can be
huge)?
Well, the size thing could be just a "parameter" or "hint" of sorts, to
tell gcc to inline the function X which is inlining the asm statement
into the function Y which is calling function X. If you look at the
patchset, it is moving everything to asm macros where gcc is apparently
able to do better inlining.
Post by Segher Boessenkool
Post by Borislav Petkov
3) asm ("...") __attribute__((asm_size(<size-expr>)));
Eww.
Why?
Post by Segher Boessenkool
More precise *size* estimates, yes. And if the user lies he should not
be surprised to get assembler errors, etc.
Yes.

Another option would be if gcc parses the inline asm directly and
does a more precise size estimation. Which is a lot more involved and
complicated solution so I guess we wanna look at the simpler ones first.

:-)
Post by Segher Boessenkool
I don't like 2) either. But 1) looks interesting, depends what its
semantics would be? "Don't count this insn's size for inlining decisions",
maybe?
Or simply "this asm statement has a size of 1" to mean, inline it
everywhere. Which has the same caveats as above.
Post by Segher Boessenkool
Another option is to just force inlining for those few functions where
GCC currently makes an inlining decision you don't like. Or are there
more than a few?
I'm afraid they're more than a few and this should work automatically,
if possible.

Thx.
--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.
Segher Boessenkool
2018-10-07 15:14:28 UTC
Permalink
Post by Borislav Petkov
Post by Segher Boessenkool
GCC already estimates the *size* of inline asm, and this is required
*for correctness*.
I didn't say it didn't - but the heuristic could use improving.
How? It is as sharp an estimate as can be *already*: number of insns
times maximum size per insn.

If you get this wrong, conditional branches (and similar things, but
conditional branches usually hit first, and hurt most) will stop working
correctly, unless binutils uses relaxation for those on your architecture
(most don't).
Post by Borislav Petkov
Post by Segher Boessenkool
So I guess the real issue is that the inline asm size estimate for x86
isn't very good (since it has to be pessimistic, and x86 insns can be
huge)?
Well, the size thing could be just a "parameter" or "hint" of sorts, to
tell gcc to inline the function X which is inlining the asm statement
into the function Y which is calling function X. If you look at the
patchset, it is moving everything to asm macros where gcc is apparently
able to do better inlining.
Yes, that will cause fewer problems I think: do not override size _at all_,
but give a hint to the inliner.
Post by Borislav Petkov
Post by Segher Boessenkool
Post by Borislav Petkov
3) asm ("...") __attribute__((asm_size(<size-expr>)));
Eww.
Why?
Attributes are clumsy and clunky and kludgy.

It never is well-defined how attributes interact, and the more attributes
we define and use, the more that matters.
Post by Borislav Petkov
Post by Segher Boessenkool
More precise *size* estimates, yes. And if the user lies he should not
be surprised to get assembler errors, etc.
Yes.
Another option would be if gcc parses the inline asm directly and
does a more precise size estimation. Which is a lot more involved and
complicated solution so I guess we wanna look at the simpler ones first.
:-)
Which is *impossible* to do. Inline assembler is free-form text.
Post by Borislav Petkov
Post by Segher Boessenkool
I don't like 2) either. But 1) looks interesting, depends what its
semantics would be? "Don't count this insn's size for inlining decisions",
maybe?
Or simply "this asm statement has a size of 1" to mean, inline it
everywhere. Which has the same caveats as above.
"Has minimum length" then (size 1 cannot work on most archs).
Post by Borislav Petkov
Post by Segher Boessenkool
Another option is to just force inlining for those few functions where
GCC currently makes an inlining decision you don't like. Or are there
more than a few?
I'm afraid they're more than a few and this should work automatically,
if possible.
Would counting *all* asms as having minimum length for inlining decisions
work? Will that give bad side effects?

Or since this problem is quite specific to x86, maybe some target hook is
wanted? Things work quite well elsewhere as-is, degrading that is not a
good idea.


Segher
Segher Boessenkool
2018-10-08 07:53:00 UTC
Permalink
Post by Segher Boessenkool
Post by Borislav Petkov
Post by Segher Boessenkool
More precise *size* estimates, yes. And if the user lies he should not
be surprised to get assembler errors, etc.
Yes.
Another option would be if gcc parses the inline asm directly and
does a more precise size estimation. Which is a lot more involved and
complicated solution so I guess we wanna look at the simpler ones first.
:-)
Which is *impossible* to do. Inline assembler is free-form text.
"Impossible" is false: only under GCC's model and semantics of inline
asm that is, and only under the (false) assumption that the semantics
of the asm statement (which is a GCC extension to begin with) cannot
be changed like it has been changed multiple times in the past.
"Difficult", "not worth our while", perhaps.
If we throw out our current definition of inline assembler, and of the
internal backend interfaces, then sure you can do it. This of course
invalidates all code that uses GCC inline assembler, and all GCC backends
(both in-tree and out-of-tree, both current and historical).

If other compilers think everyone should rewrite all of their code because
those compiler do inline asm wro^H^H^Hdifferently, that is their problem;
GCC should not deny all history and screw over all its users.


Segher
Michael Matz
2018-10-07 15:53:26 UTC
Permalink
Hi Segher,
Post by Segher Boessenkool
Post by Borislav Petkov
this is an attempt to see whether gcc's inline asm heuristic when
estimating inline asm statements' cost for better inlining can be
improved.
GCC already estimates the *size* of inline asm, and this is required
*for correctness*. So any workaround that works against this will only
end in tears.
You're right and wrong. GCC can't even estimate the size of mildly
complicated inline asms right now, so your claim of it being necessary for
correctness can't be true in this absolute form. I know what you try to
say, but still, consider inline asms like this:

insn1
.section bla
insn2
.previous

or
invoke_asm_macro foo,bar

in both cases GCCs size estimate will be wrong however you want to count
it. This is actually the motivating example for the kernel guys, the
games they play within their inline asms make the estimates be wildly
wrong to a point it interacts with the inliner.
Post by Segher Boessenkool
So I guess the real issue is that the inline asm size estimate for x86
isn't very good (since it has to be pessimistic, and x86 insns can be
huge)?
No, see above, even if we were to improve the size estimates (e.g. based
on some average instruction size) the kernel examples would still be off
because they switch sections back and forth, use asm macros and computed
.fill directives and maybe further stuff. GCC will never be able to
accurately calculate these sizes (without an built-in assembler which
hopefully noone proposes).

So, there is a case for extending the inline-asm facility to say
"size is complicated here, assume this for inline decisions".
Post by Segher Boessenkool
Post by Borislav Petkov
1) inline asm ("...")
What would the semantics of this be?
The size of the inline asm wouldn't be counted towards the inliner size
limits (or be counted as "1").
Post by Segher Boessenkool
I don't like 2) either. But 1) looks interesting, depends what its
semantics would be? "Don't count this insn's size for inlining decisions",
maybe?
TBH, I like the inline asm (...) suggestion most currently, but what if we
want to add more attributes to asms? We could add further special
keywords to the clobber list:
asm ("...." : : : "cc,memory,inline");
sure, it might seem strange to "clobber" inline, but if we reinterpret the
clobber list as arbitrary set of attributes for this asm, it'd be fine.
Post by Segher Boessenkool
Another option is to just force inlining for those few functions where
GCC currently makes an inlining decision you don't like. Or are there
more than a few?
I think the examples I saw from Boris were all indirect inlines:

static inline void foo() { asm("large-looking-but-small-asm"); }
static void bar1() { ... foo() ... }
static void bar2() { ... foo() ... }
void goo (void) { bar1(); } // bar1 should have been inlined

So, while the immediate asm user was marked as always inline that in turn
caused users of it to become non-inlined. I'm assuming the kernel guys
did proper measurements that they _really_ get some non-trivial speed
benefit by inlining bar1/bar2, but for some reasons (I didn't inquire)
didn't want to mark them all as inline as well.


Ciao,
Michael.
Segher Boessenkool
2018-10-08 07:31:28 UTC
Permalink
Hi!
Post by Michael Matz
Post by Segher Boessenkool
Post by Borislav Petkov
this is an attempt to see whether gcc's inline asm heuristic when
estimating inline asm statements' cost for better inlining can be
improved.
GCC already estimates the *size* of inline asm, and this is required
*for correctness*. So any workaround that works against this will only
end in tears.
You're right and wrong. GCC can't even estimate the size of mildly
complicated inline asms right now, so your claim of it being necessary for
correctness can't be true in this absolute form. I know what you try to
insn1
.section bla
insn2
.previous
or
invoke_asm_macro foo,bar
in both cases GCCs size estimate will be wrong however you want to count
it. This is actually the motivating example for the kernel guys, the
games they play within their inline asms make the estimates be wildly
wrong to a point it interacts with the inliner.
Right. The manual says:

"""
Some targets require that GCC track the size of each instruction used
in order to generate correct code. Because the final length of the
code produced by an @code{asm} statement is only known by the
assembler, GCC must make an estimate as to how big it will be. It
does this by counting the number of instructions in the pattern of the
@code{asm} and multiplying that by the length of the longest
instruction supported by that processor. (When working out the number
of instructions, it assumes that any occurrence of a newline or of
whatever statement separator character is supported by the assembler --
typically @samp{;} --- indicates the end of an instruction.)

Normally, GCC's estimate is adequate to ensure that correct
code is generated, but it is possible to confuse the compiler if you use
pseudo instructions or assembler macros that expand into multiple real
instructions, or if you use assembler directives that expand to more
space in the object file than is needed for a single instruction.
If this happens then the assembler may produce a diagnostic saying that
a label is unreachable.
"""

It *is* necessary for correctness, except you can do things that can
confuse the compiler and then you are on your own anyway.
Post by Michael Matz
Post by Segher Boessenkool
So I guess the real issue is that the inline asm size estimate for x86
isn't very good (since it has to be pessimistic, and x86 insns can be
huge)?
No, see above, even if we were to improve the size estimates (e.g. based
on some average instruction size) the kernel examples would still be off
because they switch sections back and forth, use asm macros and computed
.fill directives and maybe further stuff. GCC will never be able to
accurately calculate these sizes
What *is* such a size, anyway? If it can be spread over multiple sections
(some of which support section merging), and you can have huge alignments,
etc. What is needed here is not knowing the maximum size of the binary
output (however you want to define that), but some way for the compiler
to understand how bad it is to inline some assembler. Maybe manual
direction, maybe just the current jeuristics can be tweaked a bit, maybe
we need to invent some attribute or two.
Post by Michael Matz
(without an built-in assembler which hopefully noone proposes).
Not me, that's for sure.
Post by Michael Matz
So, there is a case for extending the inline-asm facility to say
"size is complicated here, assume this for inline decisions".
Yeah, that's an option. It may be too complicated though, or just not
useful in its generality, so that everyone will use "1" (or "1 normal
size instruction"), and then we are better off just making something
for _that_ (or making it the default).
Post by Michael Matz
Post by Segher Boessenkool
Post by Borislav Petkov
1) inline asm ("...")
What would the semantics of this be?
The size of the inline asm wouldn't be counted towards the inliner size
limits (or be counted as "1").
That sounds like a good option.
Post by Michael Matz
Post by Segher Boessenkool
I don't like 2) either. But 1) looks interesting, depends what its
semantics would be? "Don't count this insn's size for inlining decisions",
maybe?
TBH, I like the inline asm (...) suggestion most currently, but what if we
want to add more attributes to asms? We could add further special
asm ("...." : : : "cc,memory,inline");
sure, it might seem strange to "clobber" inline, but if we reinterpret the
clobber list as arbitrary set of attributes for this asm, it'd be fine.
All of a targets register names and alternative register names are
allowed in the clobber list. Will that never conflict with an attribute
name? We already *have* syntax for specifying attributes on an asm (on
*any* statement even), so mixing these two things has no advantage.

Both "cc" and "memory" have their own problems of course, adding more
things to this just feels bad. It may not be so bad ;-)
Post by Michael Matz
Post by Segher Boessenkool
Another option is to just force inlining for those few functions where
GCC currently makes an inlining decision you don't like. Or are there
more than a few?
static inline void foo() { asm("large-looking-but-small-asm"); }
static void bar1() { ... foo() ... }
static void bar2() { ... foo() ... }
void goo (void) { bar1(); } // bar1 should have been inlined
So, while the immediate asm user was marked as always inline that in turn
caused users of it to become non-inlined. I'm assuming the kernel guys
did proper measurements that they _really_ get some non-trivial speed
benefit by inlining bar1/bar2, but for some reasons (I didn't inquire)
didn't want to mark them all as inline as well.
Yeah that makes sense, like if this happens with the fixup stuff, it will
quickly spiral out of control.


Segher
Richard Biener
2018-10-08 09:07:46 UTC
Permalink
Post by Segher Boessenkool
Hi!
Post by Michael Matz
Post by Segher Boessenkool
Post by Borislav Petkov
this is an attempt to see whether gcc's inline asm heuristic when
estimating inline asm statements' cost for better inlining can be
improved.
GCC already estimates the *size* of inline asm, and this is required
*for correctness*. So any workaround that works against this will only
end in tears.
You're right and wrong. GCC can't even estimate the size of mildly
complicated inline asms right now, so your claim of it being necessary for
correctness can't be true in this absolute form. I know what you try to
insn1
.section bla
insn2
.previous
or
invoke_asm_macro foo,bar
in both cases GCCs size estimate will be wrong however you want to count
it. This is actually the motivating example for the kernel guys, the
games they play within their inline asms make the estimates be wildly
wrong to a point it interacts with the inliner.
"""
Some targets require that GCC track the size of each instruction used
in order to generate correct code. Because the final length of the
assembler, GCC must make an estimate as to how big it will be. It
does this by counting the number of instructions in the pattern of the
@code{asm} and multiplying that by the length of the longest
instruction supported by that processor. (When working out the number
of instructions, it assumes that any occurrence of a newline or of
whatever statement separator character is supported by the assembler --
Normally, GCC's estimate is adequate to ensure that correct
code is generated, but it is possible to confuse the compiler if you use
pseudo instructions or assembler macros that expand into multiple real
instructions, or if you use assembler directives that expand to more
space in the object file than is needed for a single instruction.
If this happens then the assembler may produce a diagnostic saying that
a label is unreachable.
"""
It *is* necessary for correctness, except you can do things that can
confuse the compiler and then you are on your own anyway.
Post by Michael Matz
Post by Segher Boessenkool
So I guess the real issue is that the inline asm size estimate for x86
isn't very good (since it has to be pessimistic, and x86 insns can be
huge)?
No, see above, even if we were to improve the size estimates (e.g. based
on some average instruction size) the kernel examples would still be off
because they switch sections back and forth, use asm macros and computed
.fill directives and maybe further stuff. GCC will never be able to
accurately calculate these sizes
What *is* such a size, anyway? If it can be spread over multiple sections
(some of which support section merging), and you can have huge alignments,
etc. What is needed here is not knowing the maximum size of the binary
output (however you want to define that), but some way for the compiler
to understand how bad it is to inline some assembler. Maybe manual
direction, maybe just the current jeuristics can be tweaked a bit, maybe
we need to invent some attribute or two.
Post by Michael Matz
(without an built-in assembler which hopefully noone proposes).
Not me, that's for sure.
Post by Michael Matz
So, there is a case for extending the inline-asm facility to say
"size is complicated here, assume this for inline decisions".
Yeah, that's an option. It may be too complicated though, or just not
useful in its generality, so that everyone will use "1" (or "1 normal
size instruction"), and then we are better off just making something
for _that_ (or making it the default).
Post by Michael Matz
Post by Segher Boessenkool
Post by Borislav Petkov
1) inline asm ("...")
What would the semantics of this be?
The size of the inline asm wouldn't be counted towards the inliner size
limits (or be counted as "1").
That sounds like a good option.
Yes, I also like it for simplicity. It also avoids the requirement
of translating the number (in bytes?) given by the user to
"number of GIMPLE instructions" as needed by the inliner.
Post by Segher Boessenkool
Post by Michael Matz
Post by Segher Boessenkool
I don't like 2) either. But 1) looks interesting, depends what its
semantics would be? "Don't count this insn's size for inlining decisions",
maybe?
TBH, I like the inline asm (...) suggestion most currently, but what if we
want to add more attributes to asms? We could add further special
asm ("...." : : : "cc,memory,inline");
sure, it might seem strange to "clobber" inline, but if we reinterpret the
clobber list as arbitrary set of attributes for this asm, it'd be fine.
All of a targets register names and alternative register names are
allowed in the clobber list. Will that never conflict with an attribute
name? We already *have* syntax for specifying attributes on an asm (on
*any* statement even), so mixing these two things has no advantage.
Heh, but I failed to make an example with attribute synatx working.
IIRC attributes do not work on stmts. What could work is to use
a #pragma though.

Richard.
Post by Segher Boessenkool
Both "cc" and "memory" have their own problems of course, adding more
things to this just feels bad. It may not be so bad ;-)
Post by Michael Matz
Post by Segher Boessenkool
Another option is to just force inlining for those few functions where
GCC currently makes an inlining decision you don't like. Or are there
more than a few?
static inline void foo() { asm("large-looking-but-small-asm"); }
static void bar1() { ... foo() ... }
static void bar2() { ... foo() ... }
void goo (void) { bar1(); } // bar1 should have been inlined
So, while the immediate asm user was marked as always inline that in turn
caused users of it to become non-inlined. I'm assuming the kernel guys
did proper measurements that they _really_ get some non-trivial speed
benefit by inlining bar1/bar2, but for some reasons (I didn't inquire)
didn't want to mark them all as inline as well.
Yeah that makes sense, like if this happens with the fixup stuff, it will
quickly spiral out of control.
Segher Boessenkool
2018-10-08 10:02:46 UTC
Permalink
Post by Richard Biener
Post by Segher Boessenkool
All of a targets register names and alternative register names are
allowed in the clobber list. Will that never conflict with an attribute
name? We already *have* syntax for specifying attributes on an asm (on
*any* statement even), so mixing these two things has no advantage.
Heh, but I failed to make an example with attribute synatx working.
IIRC attributes do not work on stmts. What could work is to use
a #pragma though.
Apparently statement attributes currently(?) only work for null statements.
Oh well.


Segher
Segher Boessenkool
2018-10-09 14:53:30 UTC
Permalink
Post by Richard Biener
Post by Segher Boessenkool
Post by Michael Matz
Post by Segher Boessenkool
Post by Borislav Petkov
1) inline asm ("...")
What would the semantics of this be?
The size of the inline asm wouldn't be counted towards the inliner size
limits (or be counted as "1").
That sounds like a good option.
Yes, I also like it for simplicity. It also avoids the requirement
of translating the number (in bytes?) given by the user to
"number of GIMPLE instructions" as needed by the inliner.
This patch implements this, for C only so far. And the syntax is
"asm inline", which is more in line with other syntax.

How does this look?


Segher



diff --git a/gcc/c/c-parser.c b/gcc/c/c-parser.c
index 1f173fc..94b1d41 100644
--- a/gcc/c/c-parser.c
+++ b/gcc/c/c-parser.c
@@ -6287,7 +6287,7 @@ static tree
c_parser_asm_statement (c_parser *parser)
{
tree quals, str, outputs, inputs, clobbers, labels, ret;
- bool simple, is_goto;
+ bool simple, is_goto, is_inline;
location_t asm_loc = c_parser_peek_token (parser)->location;
int section, nsections;

@@ -6318,6 +6318,13 @@ c_parser_asm_statement (c_parser *parser)
is_goto = true;
}

+ is_inline = false;
+ if (c_parser_next_token_is_keyword (parser, RID_INLINE))
+ {
+ c_parser_consume_token (parser);
+ is_inline = true;
+ }
+
/* ??? Follow the C++ parser rather than using the
lex_untranslated_string kludge. */
parser->lex_untranslated_string = true;
@@ -6393,7 +6400,8 @@ c_parser_asm_statement (c_parser *parser)
c_parser_skip_to_end_of_block_or_statement (parser);

ret = build_asm_stmt (quals, build_asm_expr (asm_loc, str, outputs, inputs,
- clobbers, labels, simple));
+ clobbers, labels, simple,
+ is_inline));

error:
parser->lex_untranslated_string = false;
diff --git a/gcc/c/c-tree.h b/gcc/c/c-tree.h
index 017c01c..f5629300 100644
--- a/gcc/c/c-tree.h
+++ b/gcc/c/c-tree.h
@@ -677,7 +677,8 @@ extern tree build_compound_literal (location_t, tree, tree, bool,
extern void check_compound_literal_type (location_t, struct c_type_name *);
extern tree c_start_case (location_t, location_t, tree, bool);
extern void c_finish_case (tree, tree);
-extern tree build_asm_expr (location_t, tree, tree, tree, tree, tree, bool);
+extern tree build_asm_expr (location_t, tree, tree, tree, tree, tree, bool,
+ bool);
extern tree build_asm_stmt (tree, tree);
extern int c_types_compatible_p (tree, tree);
extern tree c_begin_compound_stmt (bool);
diff --git a/gcc/c/c-typeck.c b/gcc/c/c-typeck.c
index 9d09b8d..e013100 100644
--- a/gcc/c/c-typeck.c
+++ b/gcc/c/c-typeck.c
@@ -10064,7 +10064,7 @@ build_asm_stmt (tree cv_qualifier, tree args)
are subtly different. We use a ASM_EXPR node to represent this. */
tree
build_asm_expr (location_t loc, tree string, tree outputs, tree inputs,
- tree clobbers, tree labels, bool simple)
+ tree clobbers, tree labels, bool simple, bool is_inline)
{
tree tail;
tree args;
@@ -10182,6 +10182,7 @@ build_asm_expr (location_t loc, tree string, tree outputs, tree inputs,
as volatile. */
ASM_INPUT_P (args) = simple;
ASM_VOLATILE_P (args) = (noutputs == 0);
+ ASM_INLINE_P (args) = is_inline;

return args;
}
diff --git a/gcc/gimple-pretty-print.c b/gcc/gimple-pretty-print.c
index 83e2273..1a00fa3 100644
--- a/gcc/gimple-pretty-print.c
+++ b/gcc/gimple-pretty-print.c
@@ -2019,6 +2019,8 @@ dump_gimple_asm (pretty_printer *buffer, gasm *gs, int spc, dump_flags_t flags)
pp_string (buffer, "__asm__");
if (gimple_asm_volatile_p (gs))
pp_string (buffer, " __volatile__");
+ if (gimple_asm_inline_p (gs))
+ pp_string (buffer, " __inline__");
if (gimple_asm_nlabels (gs))
pp_string (buffer, " goto");
pp_string (buffer, "(\"");
diff --git a/gcc/gimple.h b/gcc/gimple.h
index a5dda93..8a58e07 100644
--- a/gcc/gimple.h
+++ b/gcc/gimple.h
@@ -137,6 +137,7 @@ enum gimple_rhs_class
enum gf_mask {
GF_ASM_INPUT = 1 << 0,
GF_ASM_VOLATILE = 1 << 1,
+ GF_ASM_INLINE = 1 << 2,
GF_CALL_FROM_THUNK = 1 << 0,
GF_CALL_RETURN_SLOT_OPT = 1 << 1,
GF_CALL_TAILCALL = 1 << 2,
@@ -3911,7 +3912,7 @@ gimple_asm_volatile_p (const gasm *asm_stmt)
}


-/* If VOLATLE_P is true, mark asm statement ASM_STMT as volatile. */
+/* If VOLATILE_P is true, mark asm statement ASM_STMT as volatile. */

static inline void
gimple_asm_set_volatile (gasm *asm_stmt, bool volatile_p)
@@ -3923,6 +3924,27 @@ gimple_asm_set_volatile (gasm *asm_stmt, bool volatile_p)
}


+/* Return true ASM_STMT ASM_STMT is an asm statement marked inline. */
+
+static inline bool
+gimple_asm_inline_p (const gasm *asm_stmt)
+{
+ return (asm_stmt->subcode & GF_ASM_INLINE) != 0;
+}
+
+
+/* If INLINE_P is true, mark asm statement ASM_STMT as inline. */
+
+static inline void
+gimple_asm_set_inline (gasm *asm_stmt, bool inline_p)
+{
+ if (inline_p)
+ asm_stmt->subcode |= GF_ASM_INLINE;
+ else
+ asm_stmt->subcode &= ~GF_ASM_INLINE;
+}
+
+
/* If INPUT_P is true, mark asm ASM_STMT as an ASM_INPUT. */

static inline void
diff --git a/gcc/gimplify.c b/gcc/gimplify.c
index 509fc2f..10b80f2 100644
--- a/gcc/gimplify.c
+++ b/gcc/gimplify.c
@@ -6315,6 +6315,7 @@ gimplify_asm_expr (tree *expr_p, gimple_seq *pre_p, gimple_seq *post_p)

gimple_asm_set_volatile (stmt, ASM_VOLATILE_P (expr) || noutputs == 0);
gimple_asm_set_input (stmt, ASM_INPUT_P (expr));
+ gimple_asm_set_inline (stmt, ASM_INLINE_P (expr));

gimplify_seq_add_stmt (pre_p, stmt);
}
diff --git a/gcc/ipa-icf-gimple.c b/gcc/ipa-icf-gimple.c
index ba39ea3..5361139 100644
--- a/gcc/ipa-icf-gimple.c
+++ b/gcc/ipa-icf-gimple.c
@@ -993,6 +993,9 @@ func_checker::compare_gimple_asm (const gasm *g1, const gasm *g2)
if (gimple_asm_input_p (g1) != gimple_asm_input_p (g2))
return false;

+ if (gimple_asm_inline_p (g1) != gimple_asm_inline_p (g2))
+ return false;
+
if (gimple_asm_ninputs (g1) != gimple_asm_ninputs (g2))
return false;

diff --git a/gcc/tree-inline.c b/gcc/tree-inline.c
index 9134253..6b1d2ea 100644
--- a/gcc/tree-inline.c
+++ b/gcc/tree-inline.c
@@ -4108,6 +4108,8 @@ estimate_num_insns (gimple *stmt, eni_weights *weights)
with very long asm statements. */
if (count > 1000)
count = 1000;
+ if (gimple_asm_inline_p (as_a <gasm *> (stmt)))
+ count = !!count;
return MAX (1, count);
}

diff --git a/gcc/tree.h b/gcc/tree.h
index 2e45f9d..160b3a7 100644
--- a/gcc/tree.h
+++ b/gcc/tree.h
@@ -1245,6 +1245,9 @@ extern tree maybe_wrap_with_location (tree, location_t);
ASM_OPERAND with no operands. */
#define ASM_INPUT_P(NODE) (ASM_EXPR_CHECK (NODE)->base.static_flag)
#define ASM_VOLATILE_P(NODE) (ASM_EXPR_CHECK (NODE)->base.public_flag)
+/* Nonzero if we want to consider this asm as minimum length and cost
+ for inlining decisions. */
+#define ASM_INLINE_P(NODE) (ASM_EXPR_CHECK (NODE)->base.protected_flag)

/* COND_EXPR accessors. */
#define COND_EXPR_COND(NODE) (TREE_OPERAND (COND_EXPR_CHECK (NODE), 0))
Richard Biener
2018-10-10 07:12:48 UTC
Permalink
Post by Segher Boessenkool
Post by Richard Biener
Post by Segher Boessenkool
Post by Michael Matz
Post by Segher Boessenkool
Post by Borislav Petkov
1) inline asm ("...")
What would the semantics of this be?
The size of the inline asm wouldn't be counted towards the inliner size
limits (or be counted as "1").
That sounds like a good option.
Yes, I also like it for simplicity. It also avoids the requirement
of translating the number (in bytes?) given by the user to
"number of GIMPLE instructions" as needed by the inliner.
This patch implements this, for C only so far. And the syntax is
"asm inline", which is more in line with other syntax.
How does this look?
Looks good. A few nits - you need to document this in extend.texi, the
tree flag use needs documenting in tree-core.h, and we need a testcase
(I'd suggest one that shows we inline a function with "large" asm inline
() even at -Os).

Oh, and I don't think we want C and C++ to diverge - so you need to
cook up C++ support as well.

Can kernel folks give this a second and third thought please so we
don't implement sth that in the end won't satisfy you guys?

Thanks for doing this,
Richard.
Post by Segher Boessenkool
Segher
diff --git a/gcc/c/c-parser.c b/gcc/c/c-parser.c
index 1f173fc..94b1d41 100644
--- a/gcc/c/c-parser.c
+++ b/gcc/c/c-parser.c
@@ -6287,7 +6287,7 @@ static tree
c_parser_asm_statement (c_parser *parser)
{
tree quals, str, outputs, inputs, clobbers, labels, ret;
- bool simple, is_goto;
+ bool simple, is_goto, is_inline;
location_t asm_loc = c_parser_peek_token (parser)->location;
int section, nsections;
@@ -6318,6 +6318,13 @@ c_parser_asm_statement (c_parser *parser)
is_goto = true;
}
+ is_inline = false;
+ if (c_parser_next_token_is_keyword (parser, RID_INLINE))
+ {
+ c_parser_consume_token (parser);
+ is_inline = true;
+ }
+
/* ??? Follow the C++ parser rather than using the
lex_untranslated_string kludge. */
parser->lex_untranslated_string = true;
@@ -6393,7 +6400,8 @@ c_parser_asm_statement (c_parser *parser)
c_parser_skip_to_end_of_block_or_statement (parser);
ret = build_asm_stmt (quals, build_asm_expr (asm_loc, str, outputs, inputs,
- clobbers, labels, simple));
+ clobbers, labels, simple,
+ is_inline));
parser->lex_untranslated_string = false;
diff --git a/gcc/c/c-tree.h b/gcc/c/c-tree.h
index 017c01c..f5629300 100644
--- a/gcc/c/c-tree.h
+++ b/gcc/c/c-tree.h
@@ -677,7 +677,8 @@ extern tree build_compound_literal (location_t, tree, tree, bool,
extern void check_compound_literal_type (location_t, struct c_type_name *);
extern tree c_start_case (location_t, location_t, tree, bool);
extern void c_finish_case (tree, tree);
-extern tree build_asm_expr (location_t, tree, tree, tree, tree, tree, bool);
+extern tree build_asm_expr (location_t, tree, tree, tree, tree, tree, bool,
+ bool);
extern tree build_asm_stmt (tree, tree);
extern int c_types_compatible_p (tree, tree);
extern tree c_begin_compound_stmt (bool);
diff --git a/gcc/c/c-typeck.c b/gcc/c/c-typeck.c
index 9d09b8d..e013100 100644
--- a/gcc/c/c-typeck.c
+++ b/gcc/c/c-typeck.c
@@ -10064,7 +10064,7 @@ build_asm_stmt (tree cv_qualifier, tree args)
are subtly different. We use a ASM_EXPR node to represent this. */
tree
build_asm_expr (location_t loc, tree string, tree outputs, tree inputs,
- tree clobbers, tree labels, bool simple)
+ tree clobbers, tree labels, bool simple, bool is_inline)
{
tree tail;
tree args;
@@ -10182,6 +10182,7 @@ build_asm_expr (location_t loc, tree string, tree outputs, tree inputs,
as volatile. */
ASM_INPUT_P (args) = simple;
ASM_VOLATILE_P (args) = (noutputs == 0);
+ ASM_INLINE_P (args) = is_inline;
return args;
}
diff --git a/gcc/gimple-pretty-print.c b/gcc/gimple-pretty-print.c
index 83e2273..1a00fa3 100644
--- a/gcc/gimple-pretty-print.c
+++ b/gcc/gimple-pretty-print.c
@@ -2019,6 +2019,8 @@ dump_gimple_asm (pretty_printer *buffer, gasm *gs, int spc, dump_flags_t flags)
pp_string (buffer, "__asm__");
if (gimple_asm_volatile_p (gs))
pp_string (buffer, " __volatile__");
+ if (gimple_asm_inline_p (gs))
+ pp_string (buffer, " __inline__");
if (gimple_asm_nlabels (gs))
pp_string (buffer, " goto");
pp_string (buffer, "(\"");
diff --git a/gcc/gimple.h b/gcc/gimple.h
index a5dda93..8a58e07 100644
--- a/gcc/gimple.h
+++ b/gcc/gimple.h
@@ -137,6 +137,7 @@ enum gimple_rhs_class
enum gf_mask {
GF_ASM_INPUT = 1 << 0,
GF_ASM_VOLATILE = 1 << 1,
+ GF_ASM_INLINE = 1 << 2,
GF_CALL_FROM_THUNK = 1 << 0,
GF_CALL_RETURN_SLOT_OPT = 1 << 1,
GF_CALL_TAILCALL = 1 << 2,
@@ -3911,7 +3912,7 @@ gimple_asm_volatile_p (const gasm *asm_stmt)
}
-/* If VOLATLE_P is true, mark asm statement ASM_STMT as volatile. */
+/* If VOLATILE_P is true, mark asm statement ASM_STMT as volatile. */
static inline void
gimple_asm_set_volatile (gasm *asm_stmt, bool volatile_p)
@@ -3923,6 +3924,27 @@ gimple_asm_set_volatile (gasm *asm_stmt, bool volatile_p)
}
+/* Return true ASM_STMT ASM_STMT is an asm statement marked inline. */
+
+static inline bool
+gimple_asm_inline_p (const gasm *asm_stmt)
+{
+ return (asm_stmt->subcode & GF_ASM_INLINE) != 0;
+}
+
+
+/* If INLINE_P is true, mark asm statement ASM_STMT as inline. */
+
+static inline void
+gimple_asm_set_inline (gasm *asm_stmt, bool inline_p)
+{
+ if (inline_p)
+ asm_stmt->subcode |= GF_ASM_INLINE;
+ else
+ asm_stmt->subcode &= ~GF_ASM_INLINE;
+}
+
+
/* If INPUT_P is true, mark asm ASM_STMT as an ASM_INPUT. */
static inline void
diff --git a/gcc/gimplify.c b/gcc/gimplify.c
index 509fc2f..10b80f2 100644
--- a/gcc/gimplify.c
+++ b/gcc/gimplify.c
@@ -6315,6 +6315,7 @@ gimplify_asm_expr (tree *expr_p, gimple_seq *pre_p, gimple_seq *post_p)
gimple_asm_set_volatile (stmt, ASM_VOLATILE_P (expr) || noutputs == 0);
gimple_asm_set_input (stmt, ASM_INPUT_P (expr));
+ gimple_asm_set_inline (stmt, ASM_INLINE_P (expr));
gimplify_seq_add_stmt (pre_p, stmt);
}
diff --git a/gcc/ipa-icf-gimple.c b/gcc/ipa-icf-gimple.c
index ba39ea3..5361139 100644
--- a/gcc/ipa-icf-gimple.c
+++ b/gcc/ipa-icf-gimple.c
@@ -993,6 +993,9 @@ func_checker::compare_gimple_asm (const gasm *g1, const gasm *g2)
if (gimple_asm_input_p (g1) != gimple_asm_input_p (g2))
return false;
+ if (gimple_asm_inline_p (g1) != gimple_asm_inline_p (g2))
+ return false;
+
if (gimple_asm_ninputs (g1) != gimple_asm_ninputs (g2))
return false;
diff --git a/gcc/tree-inline.c b/gcc/tree-inline.c
index 9134253..6b1d2ea 100644
--- a/gcc/tree-inline.c
+++ b/gcc/tree-inline.c
@@ -4108,6 +4108,8 @@ estimate_num_insns (gimple *stmt, eni_weights *weights)
with very long asm statements. */
if (count > 1000)
count = 1000;
+ if (gimple_asm_inline_p (as_a <gasm *> (stmt)))
+ count = !!count;
return MAX (1, count);
}
diff --git a/gcc/tree.h b/gcc/tree.h
index 2e45f9d..160b3a7 100644
--- a/gcc/tree.h
+++ b/gcc/tree.h
@@ -1245,6 +1245,9 @@ extern tree maybe_wrap_with_location (tree, location_t);
ASM_OPERAND with no operands. */
#define ASM_INPUT_P(NODE) (ASM_EXPR_CHECK (NODE)->base.static_flag)
#define ASM_VOLATILE_P(NODE) (ASM_EXPR_CHECK (NODE)->base.public_flag)
+/* Nonzero if we want to consider this asm as minimum length and cost
+ for inlining decisions. */
+#define ASM_INLINE_P(NODE) (ASM_EXPR_CHECK (NODE)->base.protected_flag)
/* COND_EXPR accessors. */
#define COND_EXPR_COND(NODE) (TREE_OPERAND (COND_EXPR_CHECK (NODE), 0))
--
Richard Biener <***@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
Segher Boessenkool
2018-10-10 07:53:54 UTC
Permalink
Post by Richard Biener
Post by Segher Boessenkool
Post by Richard Biener
Post by Segher Boessenkool
Post by Michael Matz
Post by Segher Boessenkool
Post by Borislav Petkov
1) inline asm ("...")
What would the semantics of this be?
The size of the inline asm wouldn't be counted towards the inliner size
limits (or be counted as "1").
That sounds like a good option.
Yes, I also like it for simplicity. It also avoids the requirement
of translating the number (in bytes?) given by the user to
"number of GIMPLE instructions" as needed by the inliner.
This patch implements this, for C only so far. And the syntax is
"asm inline", which is more in line with other syntax.
How does this look?
Looks good. A few nits - you need to document this in extend.texi, the
Yup.
Post by Richard Biener
tree flag use needs documenting in tree-core.h,
Ah yes.
Post by Richard Biener
and we need a testcase
(I'd suggest one that shows we inline a function with "large" asm inline
() even at -Os).
I have one. Oh, and I probably should do a comment at the one line of
code that isn't just bookkeeping ;-)
Post by Richard Biener
Oh, and I don't think we want C and C++ to diverge - so you need to
cook up C++ support as well.
Right, that's why I said "C only so far".
Post by Richard Biener
Can kernel folks give this a second and third thought please so we
don't implement sth that in the end won't satisfy you guys?
Or actually try it out and see if it has the desired effect! Nothing
beats field trials.

I'll do the C++ thing today hopefully, and send things to gcc-***@.


Segher
Segher Boessenkool
2018-10-10 08:03:25 UTC
Permalink
Post by Richard Biener
Can kernel folks give this a second and third thought please so we
don't implement sth that in the end won't satisfy you guys?
So this basically passes '0 size' to the inliner, which should be better
than passing in the explicit size, as we'd inevitably get it wrong
in cases.
The code immediately after this makes it size 1, even for things like
asm(""), I suppose this works better for the inliner. But that's a detail
(and it might change); the description says "consider this asm as minimum
length and cost for inlining decisions", which works for either 0 or 1.
I also like 'size 0' for the reason that we tend to write assembly code
and mark it 'inline' if we really think it matters to performance,
so making it more likely to be inlined when used within another inline
function is a plus as well.
You can think of it as meaning "we want this asm inlined always", and then
whether that actually happens depends on if the function around it is
inlined or not.


Segher
Borislav Petkov
2018-10-10 08:19:06 UTC
Permalink
Post by Segher Boessenkool
The code immediately after this makes it size 1, even for things like
asm(""), I suppose this works better for the inliner. But that's a detail
(and it might change); the description says "consider this asm as minimum
length and cost for inlining decisions", which works for either 0 or 1.
Thanks for implementing this, much appreciated. If you need people to
test stuff, lemme know.
Post by Segher Boessenkool
You can think of it as meaning "we want this asm inlined always", and then
whether that actually happens depends on if the function around it is
inlined or not.
My only concern is how we would catch the other extremity where the
inline asm grows too big and we end up inlining it everywhere and thus
getting fat. The 0day bot already builds tinyconfigs but we should be
looking at vmlinux size growth too.

Thx.
--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.
Richard Biener
2018-10-10 08:35:54 UTC
Permalink
Post by Borislav Petkov
Post by Segher Boessenkool
The code immediately after this makes it size 1, even for things like
asm(""), I suppose this works better for the inliner. But that's a detail
(and it might change); the description says "consider this asm as minimum
length and cost for inlining decisions", which works for either 0 or 1.
Thanks for implementing this, much appreciated. If you need people to
test stuff, lemme know.
Post by Segher Boessenkool
You can think of it as meaning "we want this asm inlined always", and then
whether that actually happens depends on if the function around it is
inlined or not.
My only concern is how we would catch the other extremity where the
inline asm grows too big and we end up inlining it everywhere and thus
getting fat. The 0day bot already builds tinyconfigs but we should be
looking at vmlinux size growth too.
Well, it's like always-inline functions, you have to be careful
and _not_ put it everywhere... It's also somewhat different to
always-inline functions as those lose their special-ness once
inlined (the inlined body is properly accounted for size).

Richard.
Segher Boessenkool
2018-10-10 18:54:33 UTC
Permalink
Post by Borislav Petkov
Post by Segher Boessenkool
The code immediately after this makes it size 1, even for things like
asm(""), I suppose this works better for the inliner. But that's a detail
(and it might change); the description says "consider this asm as minimum
length and cost for inlining decisions", which works for either 0 or 1.
Thanks for implementing this, much appreciated. If you need people to
test stuff, lemme know.
It would be great to hear from kernel people if it works adequately for
what you guys want it for :-)
Post by Borislav Petkov
Post by Segher Boessenkool
You can think of it as meaning "we want this asm inlined always", and then
whether that actually happens depends on if the function around it is
inlined or not.
My only concern is how we would catch the other extremity where the
inline asm grows too big and we end up inlining it everywhere and thus
getting fat. The 0day bot already builds tinyconfigs but we should be
looking at vmlinux size growth too.
But this isn't really different from other always_inline concerns afaics?
So you should be able to catch it the same way, too.


Segher
Borislav Petkov
2018-10-10 19:14:27 UTC
Permalink
Post by Segher Boessenkool
It would be great to hear from kernel people if it works adequately for
what you guys want it for :-)
Sure, ping me when you have the final version and I'll try to build gcc
with it and do some size comparisons.

Thx.
--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.
Borislav Petkov
2018-10-13 19:33:35 UTC
Permalink
Ok,

with Segher's help I've been playing with his patch ontop of bleeding
edge gcc 9 and here are my observations. Please double-check me for
booboos so that they can be addressed while there's time.

So here's what I see ontop of 4.19-rc7:

First marked the alternative asm() as inline and undeffed the "inline"
keyword. I need to do that because of the funky games we do with
"inline" redefinitions in include/linux/compiler_types.h.

And Segher hinted at either doing:

asm volatile inline(...

or

asm volatile __inline__(...

but both "inline" variants are defined as macros in that file.

Which means we either need to #undef inline before using it in asm() or
come up with something cleverer.

Anyway, did this:

---
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 4cd6a3b71824..7c0639087da7 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -165,11 +165,13 @@ static inline int alternatives_text_reserved(void *start, void *end)
* For non barrier like inlines please define new variants
* without volatile and memory clobber.
*/
+
+#undef inline
#define alternative(oldinstr, newinstr, feature) \
- asm volatile (ALTERNATIVE(oldinstr, newinstr, feature) : : : "memory")
+ asm volatile inline(ALTERNATIVE(oldinstr, newinstr, feature) : : : "memory")

#define alternative_2(oldinstr, newinstr1, feature1, newinstr2, feature2) \
- asm volatile(ALTERNATIVE_2(oldinstr, newinstr1, feature1, newinstr2, feature2) ::: "memory")
+ asm volatile inline(ALTERNATIVE_2(oldinstr, newinstr1, feature1, newinstr2, feature2) ::: "memory")

/*
* Alternative inline assembly with input.
---

Build failed at link time with:

arch/x86/boot/compressed/cmdline.o: In function `native_save_fl':
cmdline.c:(.text+0x0): multiple definition of `native_save_fl'
arch/x86/boot/compressed/misc.o:misc.c:(.text+0x1e0): first defined here
arch/x86/boot/compressed/cmdline.o: In function `native_restore_fl':
cmdline.c:(.text+0x10): multiple definition of `native_restore_fl'
arch/x86/boot/compressed/misc.o:misc.c:(.text+0x1f0): first defined here
arch/x86/boot/compressed/error.o: In function `native_save_fl':
error.c:(.text+0x0): multiple definition of `native_save_fl'

which I had to fix with this:

---
diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index 15450a675031..0d772598c37c 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -14,8 +14,7 @@
*/

/* Declaration required for gcc < 4.9 to prevent -Werror=missing-prototypes */
-extern inline unsigned long native_save_fl(void);
-extern inline unsigned long native_save_fl(void)
+static inline unsigned long native_save_fl(void)
{
unsigned long flags;

@@ -33,8 +32,7 @@ ex
---

That "extern inline" declaration looks fishy to me anyway, maybe not really
needed ? Apparently, gcc < 4.9 complains with -Werror=missing-prototypes...

Then the build worked and the results look like this:

text data bss dec hex filename
17287384 5040656 2019532 24347572 17383b4 vmlinux-before
17288020 5040656 2015436 24344112 1737630 vmlinux-2nd-version

so some inlining must be happening.

Then I did this:

---
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index 9c5606d88f61..a0170344cf08 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -20,7 +20,7 @@ unsigned long __clear_user(void __user *addr, unsigned long size)
/* no memory constraint because it doesn't change any memory gcc knows
about */
stac();
- asm volatile(
+ asm volatile inline(
" testq %[size8],%[size8]\n"
" jz 4f\n"
"0: movq $0,(%[dst])\n"
---

to force inlining of a somewhat bigger asm() statement. And yap, more
got inlined:

text data bss dec hex filename
17287384 5040656 2019532 24347572 17383b4 vmlinux-before
17288020 5040656 2015436 24344112 1737630 vmlinux-2nd-version
17288076 5040656 2015436 24344168 1737668 vmlinux-2nd-version__clear_user

so more stuff gets inlined.

Looking at the asm output, it had before:

---
clear_user:
# ./arch/x86/include/asm/current.h:15: return this_cpu_read_stable(current_task);
#APP
# 15 "./arch/x86/include/asm/current.h" 1
movq %gs:current_task,%rax #, pfo_ret__
# 0 "" 2
# arch/x86/lib/usercopy_64.c:51: if (access_ok(VERIFY_WRITE, to, n))
#NO_APP
movq 2520(%rax), %rdx # pfo_ret___12->thread.addr_limit.seg, _1
movq %rdi, %rax # to, tmp93
addq %rsi, %rax # n, tmp93
jc .L3 #,
# arch/x86/lib/usercopy_64.c:51: if (access_ok(VERIFY_WRITE, to, n))
cmpq %rax, %rdx # tmp93, _1
jb .L3 #,
# arch/x86/lib/usercopy_64.c:52: return __clear_user(to, n);
jmp __clear_user #
---

note the JMP to __clear_user. After marking the asm() in __clear_user() as
inline, clear_user() inlines __clear_user() directly:

---
clear_user:
# ./arch/x86/include/asm/current.h:15: return this_cpu_read_stable(current_task);
#APP
# 15 "./arch/x86/include/asm/current.h" 1
movq %gs:current_task,%rax #, pfo_ret__
# 0 "" 2
# arch/x86/lib/usercopy_64.c:51: if (access_ok(VERIFY_WRITE, to, n))
#NO_APP
movq 2520(%rax), %rdx # pfo_ret___12->thread.addr_limit.seg, _1
movq %rdi, %rax # to, tmp95
addq %rsi, %rax # n, tmp95
jc .L8 #,
# arch/x86/lib/usercopy_64.c:51: if (access_ok(VERIFY_WRITE, to, n))
cmpq %rax, %rdx # tmp95, _1
jb .L8 #,
# ./arch/x86/include/asm/smap.h:58: alternative("", __stringify(__ASM_STAC), X86_FEATURE_SMAP);
...

this last line is the stac() macro which gets inlined due to the
alternative() inlined macro above.

So I guess this all looks like what we wanted.

And if this lands in gcc9, we would need to do a asm_volatile() macro
which is defined differently based on the compiler used.

Thoughts, suggestions, etc are most welcome.

Thx.
--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.
Borislav Petkov
2018-10-13 21:30:15 UTC
Permalink
I apologize for coming in late here with an alternative proposal, but would
you be happy if GCC gave you a way to designate a portion of the asm template
string that shouldn't be counted as its cost because it doesn't go into the
.text section? This wouldn't interact with your redefinitions of the inline
keyword, and you could do something like (assuming we go with %` ... %`
delimiters)
I don't mind it but I see you guys are still discussing what would be
the better solution here, on the gcc ML. And we, as one user, are a
happy camper as long as it does what it is meant to do. But how the
feature looks like syntactically is something for gcc folk to decide as
they're going to support it for the foreseeable future and I'm very well
aware of how important it is for a supportable feature to be designed
properly.

Thx.
--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.
Borislav Petkov
2018-10-25 10:24:05 UTC
Permalink
Ping.

This is a good point in time, methinks, where kernel folk on CC here
should have a look at this and speak up whether it is useful for us in
this form.

Frankly, I'm a bit unsure on the aspect of us using this and supporting
old compilers which don't have it and new compilers which do. Because
the old compilers should get to see the now upstreamed assembler macros
and the new compilers will simply inline the "asm volatile inline"
constructs. And how ugly that would become...

I guess we can try this with smaller macros first and keep them all
nicely hidden in header files.
Post by Borislav Petkov
Ok,
with Segher's help I've been playing with his patch ontop of bleeding
edge gcc 9 and here are my observations. Please double-check me for
booboos so that they can be addressed while there's time.
First marked the alternative asm() as inline and undeffed the "inline"
keyword. I need to do that because of the funky games we do with
"inline" redefinitions in include/linux/compiler_types.h.
asm volatile inline(...
or
asm volatile __inline__(...
but both "inline" variants are defined as macros in that file.
Which means we either need to #undef inline before using it in asm() or
come up with something cleverer.
---
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 4cd6a3b71824..7c0639087da7 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -165,11 +165,13 @@ static inline int alternatives_text_reserved(void *start, void *end)
* For non barrier like inlines please define new variants
* without volatile and memory clobber.
*/
+
+#undef inline
#define alternative(oldinstr, newinstr, feature) \
- asm volatile (ALTERNATIVE(oldinstr, newinstr, feature) : : : "memory")
+ asm volatile inline(ALTERNATIVE(oldinstr, newinstr, feature) : : : "memory")
#define alternative_2(oldinstr, newinstr1, feature1, newinstr2, feature2) \
- asm volatile(ALTERNATIVE_2(oldinstr, newinstr1, feature1, newinstr2, feature2) ::: "memory")
+ asm volatile inline(ALTERNATIVE_2(oldinstr, newinstr1, feature1, newinstr2, feature2) ::: "memory")
/*
* Alternative inline assembly with input.
---
cmdline.c:(.text+0x0): multiple definition of `native_save_fl'
arch/x86/boot/compressed/misc.o:misc.c:(.text+0x1e0): first defined here
cmdline.c:(.text+0x10): multiple definition of `native_restore_fl'
arch/x86/boot/compressed/misc.o:misc.c:(.text+0x1f0): first defined here
error.c:(.text+0x0): multiple definition of `native_save_fl'
---
diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index 15450a675031..0d772598c37c 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -14,8 +14,7 @@
*/
/* Declaration required for gcc < 4.9 to prevent -Werror=missing-prototypes */
-extern inline unsigned long native_save_fl(void);
-extern inline unsigned long native_save_fl(void)
+static inline unsigned long native_save_fl(void)
{
unsigned long flags;
@@ -33,8 +32,7 @@ ex
---
That "extern inline" declaration looks fishy to me anyway, maybe not really
needed ? Apparently, gcc < 4.9 complains with -Werror=missing-prototypes...
text data bss dec hex filename
17287384 5040656 2019532 24347572 17383b4 vmlinux-before
17288020 5040656 2015436 24344112 1737630 vmlinux-2nd-version
so some inlining must be happening.
---
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index 9c5606d88f61..a0170344cf08 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -20,7 +20,7 @@ unsigned long __clear_user(void __user *addr, unsigned long size)
/* no memory constraint because it doesn't change any memory gcc knows
about */
stac();
- asm volatile(
+ asm volatile inline(
" testq %[size8],%[size8]\n"
" jz 4f\n"
"0: movq $0,(%[dst])\n"
---
to force inlining of a somewhat bigger asm() statement. And yap, more
text data bss dec hex filename
17287384 5040656 2019532 24347572 17383b4 vmlinux-before
17288020 5040656 2015436 24344112 1737630 vmlinux-2nd-version
17288076 5040656 2015436 24344168 1737668 vmlinux-2nd-version__clear_user
so more stuff gets inlined.
---
# ./arch/x86/include/asm/current.h:15: return this_cpu_read_stable(current_task);
#APP
# 15 "./arch/x86/include/asm/current.h" 1
movq %gs:current_task,%rax #, pfo_ret__
# 0 "" 2
# arch/x86/lib/usercopy_64.c:51: if (access_ok(VERIFY_WRITE, to, n))
#NO_APP
movq 2520(%rax), %rdx # pfo_ret___12->thread.addr_limit.seg, _1
movq %rdi, %rax # to, tmp93
addq %rsi, %rax # n, tmp93
jc .L3 #,
# arch/x86/lib/usercopy_64.c:51: if (access_ok(VERIFY_WRITE, to, n))
cmpq %rax, %rdx # tmp93, _1
jb .L3 #,
# arch/x86/lib/usercopy_64.c:52: return __clear_user(to, n);
jmp __clear_user #
---
note the JMP to __clear_user. After marking the asm() in __clear_user() as
---
# ./arch/x86/include/asm/current.h:15: return this_cpu_read_stable(current_task);
#APP
# 15 "./arch/x86/include/asm/current.h" 1
movq %gs:current_task,%rax #, pfo_ret__
# 0 "" 2
# arch/x86/lib/usercopy_64.c:51: if (access_ok(VERIFY_WRITE, to, n))
#NO_APP
movq 2520(%rax), %rdx # pfo_ret___12->thread.addr_limit.seg, _1
movq %rdi, %rax # to, tmp95
addq %rsi, %rax # n, tmp95
jc .L8 #,
# arch/x86/lib/usercopy_64.c:51: if (access_ok(VERIFY_WRITE, to, n))
cmpq %rax, %rdx # tmp95, _1
jb .L8 #,
# ./arch/x86/include/asm/smap.h:58: alternative("", __stringify(__ASM_STAC), X86_FEATURE_SMAP);
...
this last line is the stac() macro which gets inlined due to the
alternative() inlined macro above.
So I guess this all looks like what we wanted.
And if this lands in gcc9, we would need to do a asm_volatile() macro
which is defined differently based on the compiler used.
Thoughts, suggestions, etc are most welcome.
Thx.
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
Segher Boessenkool
2018-10-31 16:31:16 UTC
Permalink
Post by Borislav Petkov
text data bss dec hex filename
17385183 5064780 1953892 24403855 1745f8f defconfig-build/vmlinux
17385678 5064780 1953892 24404350 174617e defconfig-build/vmlinux
Which shows we inline more (look for asm_volatile for the actual
changes).
So yes, this seems like something we could work with.
Great to hear. Note that with my latest patches
(see https://gcc.gnu.org/ml/gcc-patches/2018-10/msg01931.html ) there
is no required order "asm volatile inline" anymore, so you can just say
"asm inline volatile". (And similar for "goto" as well).


Segher
Joe Perches
2018-11-01 05:20:00 UTC
Permalink
Post by Borislav Petkov
text data bss dec hex filename
17385183 5064780 1953892 24403855 1745f8f defconfig-build/vmlinux
17385678 5064780 1953892 24404350 174617e defconfig-build/vmlinux
Which shows we inline more (look for asm_volatile for the actual
changes).
[]
Post by Borislav Petkov
scripts/checkpatch.pl | 8 ++---
scripts/genksyms/keywords.c | 4 +--
scripts/kernel-doc | 4 +--
I believe these should be excluded from the conversions.

Other than that, generic conversion by script seems a good idea.
Joe Perches
2018-11-01 09:20:40 UTC
Permalink
Post by Joe Perches
Post by Borislav Petkov
text data bss dec hex filename
17385183 5064780 1953892 24403855 1745f8f defconfig-build/vmlinux
17385678 5064780 1953892 24404350 174617e defconfig-build/vmlinux
Which shows we inline more (look for asm_volatile for the actual
changes).
[]
Post by Borislav Petkov
scripts/checkpatch.pl | 8 ++---
scripts/genksyms/keywords.c | 4 +--
scripts/kernel-doc | 4 +--
I believe these should be excluded from the conversions.
Probably, yes. It compiled, which was all I cared about :-)
BTW, if we do that conversion, we should upgrade the checkpatch warn to
an error I suppose.
More like remove altogether as __inline and __inline__
will no longer be #defined

$ git grep -P 'define\s+__inline'
include/linux/compiler_types.h:#define __inline__ inline
include/linux/compiler_types.h:#define __inline inline
Richard Biener
2018-10-10 10:29:03 UTC
Permalink
Post by Segher Boessenkool
Post by Richard Biener
Can kernel folks give this a second and third thought please so we
don't implement sth that in the end won't satisfy you guys?
So this basically passes '0 size' to the inliner, which should be better
than passing in the explicit size, as we'd inevitably get it wrong
in cases.
The code immediately after this makes it size 1, even for things like
asm(""), I suppose this works better for the inliner. But that's a detail
(and it might change); the description says "consider this asm as minimum
length and cost for inlining decisions", which works for either 0 or 1.
It was made 1 as otherwise the inliner happily explodes on

void foo () { asm(""); foo(); }
Post by Segher Boessenkool
I also like 'size 0' for the reason that we tend to write assembly code
and mark it 'inline' if we really think it matters to performance,
so making it more likely to be inlined when used within another inline
function is a plus as well.
You can think of it as meaning "we want this asm inlined always", and then
whether that actually happens depends on if the function around it is
inlined or not.
Richard.
Nadav Amit
2018-10-10 16:31:41 UTC
Permalink
Post by Segher Boessenkool
Post by Richard Biener
Post by Segher Boessenkool
Post by Michael Matz
Post by Segher Boessenkool
Post by Borislav Petkov
1) inline asm ("...")
What would the semantics of this be?
The size of the inline asm wouldn't be counted towards the inliner size
limits (or be counted as "1").
That sounds like a good option.
Yes, I also like it for simplicity. It also avoids the requirement
of translating the number (in bytes?) given by the user to
"number of GIMPLE instructions" as needed by the inliner.
This patch implements this, for C only so far. And the syntax is
"asm inline", which is more in line with other syntax.
How does this look?
It looks good to me in general. I have a couple of reservations, but I
suspect you will not want to address them:

1. It is not backward compatible, requiring a C macro to wrap it, as the
kernel might be built with different compilers.

2. It is specific to asm. I do not have in mind another use case (excluding
the __builtin_constant_p), but it would be nicer IMHO to have a builtin
saying “ignore the cost of this statement” for the matter of optimizations.

Reg
Segher Boessenkool
2018-10-10 19:21:12 UTC
Permalink
Hi Nadav,
Post by Nadav Amit
Post by Segher Boessenkool
How does this look?
It looks good to me in general. I have a couple of reservations, but I
1. It is not backward compatible, requiring a C macro to wrap it, as the
kernel might be built with different compilers.
How *could* it be backward compatible? There should be an error or at least
a warning if the compiler does not support this, in general.

For the kernel, the kernel already has plenty of infrastructure to support
this (compiler.h etc.) For other applications it is quite trivial, too.
Post by Nadav Amit
2. It is specific to asm.
Yes, and that is on purpose.
Post by Nadav Amit
I do not have in mind another use case (excluding
the __builtin_constant_p), but it would be nicer IMHO to have a builtin
saying “ignore the cost of this statement” for the matter of optimizations.
That is a hundred or a thousand times more work to design and implement
(including testing etc.) I'm not going to do it, but feel free to try
yourself!


Segher
Richard Biener
2018-10-11 07:04:58 UTC
Permalink
Post by Nadav Amit
Post by Segher Boessenkool
Post by Richard Biener
Post by Segher Boessenkool
Post by Michael Matz
Post by Segher Boessenkool
Post by Borislav Petkov
1) inline asm ("...")
What would the semantics of this be?
The size of the inline asm wouldn't be counted towards the inliner size
limits (or be counted as "1").
That sounds like a good option.
Yes, I also like it for simplicity. It also avoids the requirement
of translating the number (in bytes?) given by the user to
"number of GIMPLE instructions" as needed by the inliner.
This patch implements this, for C only so far. And the syntax is
"asm inline", which is more in line with other syntax.
How does this look?
It looks good to me in general. I have a couple of reservations, but I
1. It is not backward compatible, requiring a C macro to wrap it, as the
kernel might be built with different compilers.
2. It is specific to asm. I do not have in mind another use case (excluding
the __builtin_constant_p), but it would be nicer IMHO to have a builtin
saying “ignore the cost of this statement” for the matter of optimizations.
The only easy possibility that comes to my mid is sth like

__attribute__((always_inline, zero_cost)) foo ()
{
... your stmts ...
}

and us, upon inlining, marking the inlined stmts properly. That would
also work for the asm() case and it would be backwards compatible
(well, you'd get a diagnostic for the unknown zero_cost attribute).

There's the slight complication that if you have, say

_1 = _2 * 3; // zero-cost
_4 = _1 * 2;

and optimization ends up combining those to

_4 = _2 * 6;

then is this stmt supposed to be zero-cost or not? Compare that to

_1 = _2 * 3;
_4 = _1 * 2; // zero-cost

So outside of asm() there are new issues that come up with respect
to expected (cost) semantics.

Richard.
Richard Biener
2018-11-29 13:09:25 UTC
Permalink
But, I'd like to ask if x86 people want to keep this macros.s approach.
Revert 77b0bf55bc675 right now
assuming the compiler will eventually solve the issue?
Yap, considering how elegant the compiler solution is and how much
problems this macros-in-asm thing causes, I think we should patch
out the latter and wait for gcc9. I mean, the savings are not so
mind-blowing to have to deal with the fallout.
But this is just my opinion.
I'd be not opposed to backporting the asm inline support.

Of course we still have to be happy with it and install the patch ;)

Are you (kernel folks) happy with asm inline ()?

Richard.
Thx.
--
Richard Biener <***@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
Richard Biener
2018-11-29 13:24:55 UTC
Permalink
Post by Richard Biener
I'd be not opposed to backporting the asm inline support.
Even better! :-)
Post by Richard Biener
Of course we still have to be happy with it and install the patch ;)
Are you (kernel folks) happy with asm inline ()?
OK, Segher - can you ping the latest version of the patch please?

Thanks,
Richard.
Segher Boessenkool
2018-11-29 12:25:02 UTC
Permalink
On Wed, Oct 10, 2018 at 1:14 AM Segher Boessenkool
Post by Segher Boessenkool
Post by Richard Biener
Post by Segher Boessenkool
Post by Michael Matz
Post by Segher Boessenkool
Post by Borislav Petkov
1) inline asm ("...")
What would the semantics of this be?
The size of the inline asm wouldn't be counted towards the inliner size
limits (or be counted as "1").
That sounds like a good option.
Yes, I also like it for simplicity. It also avoids the requirement
of translating the number (in bytes?) given by the user to
"number of GIMPLE instructions" as needed by the inliner.
This patch implements this, for C only so far. And the syntax is
"asm inline", which is more in line with other syntax.
How does this look?
Thank you very much for your work.
https://gcc.gnu.org/ml/gcc-patches/2018-10/msg01932.html
How is the progress of this in GCC ML?
Latest patch was pinged a few times:
https://gcc.gnu.org/ml/gcc-patches/2018-11/msg01569.html .

I'll ping it again. Will fix the subject as well if I remember to, sigh.
I am really hoping the issue will be solved by compiler
instead of the in-kernel workaround.
This will only be fixed from GCC 9 on, if the compiler adopts it. The
kernel wants to support ancient GCC, so it will need to have a workaround
for older GCC versions anyway.


Segher
Segher Boessenkool
2018-11-30 13:16:34 UTC
Permalink
Post by Segher Boessenkool
This will only be fixed from GCC 9 on, if the compiler adopts it. The
kernel wants to support ancient GCC, so it will need to have a workaround
for older GCC versions anyway.
What about backporting it, like Richard says?
Let me first get it into GCC trunk :-)

It should backport fine, sure; and I'll work on that.


Segher

Segher Boessenkool
2018-10-08 08:18:57 UTC
Permalink
Post by Michael Matz
(without an built-in assembler which hopefully noone proposes).
There are disadvantages (the main one is having to implement it), but a built-in assembler has
- Better optimizations: for example -Os could more accurately estimate true instruction size.
GCC already knows the exact instruction size in almost all cases, for most
backends. It is an advantage to not *have to* keep track of exact insn
size in all cases.
- Better inlining: as the examples in this thread are showing.
That's a red herring, the actual issue is inlining makes some spectacularly
wrong decisions for code involving asm currently. That would not be solved
by outputting binary code instead of assembler code. All those decisions
are done long before code is output.
- Better padding/alignment: right now GCC has no notion about the precise cache layout of the
assembly code it generates and the code alignment options it has are crude.
This isn't true. Maybe some targets do not care.

And of course GCC only knows this as far as it knows the alignments of the
sections it outputs into; for example, ASLR is a nice performance killer
at times. And if your linker scripts align sections to less than a cache
line things do not look rosy either, etc.
It got away with
this so far because the x86 rule of thumb is that dense code is usually the right choice.
- Better compiler performance: it would be faster as well to immediately emit assembly
instructions, just like GCC's preprocessor library use speeds up compilation *significantly*
instead of creating a separate preprocessor task.
So how much faster do you think it would be? Do you have actual numbers?
And no, -O0 does not count.
- Better future integration of assembly blocks: GCC could begin to actually understand the
assembly statements in inline asm and allow more user-friendly extensions to its
historically complex and difficult to master inline asm syntax.
If you want to add a different kind of inline assembler, you can already.
There is no need for outputting binary code for that.
I mean, it's a fact that the GNU project has *already* defined their own assembly syntax which
departs from decades old platform assembly syntax
GCC's asm syntax is over three decades old itself. When it was defined
all other C inline assembler syntaxes were much younger than that. I don't
see what your argument is here.
- and how the assembler is called by the
compiler is basically an implementation detail, not a conceptual choice.
But the *format* of the interface representation, in this case textual
assembler code, is quite fundamental.
The random
multi-process unidirectional assembler choice of the past should not be treated as orthodoxy.
Then I have good news for you: no assembler works that way these days, and
that has been true for decades.


Segher
Nadav Amit
2018-10-07 16:13:19 UTC
Permalink
Second try after being blocked by gcc mailing list:

at 9:09 AM, Nadav Amit <***@vmware.com<mailto:***@vmware.com>> wrote:

at 2:18 AM, Borislav Petkov <***@alien8.de<mailto:***@alien8.de>> wrote:

Hi people,

this is an attempt to see whether gcc's inline asm heuristic when
estimating inline asm statements' cost for better inlining can be
improved.

AFAIU, the problematic arises when one ends up using a lot of inline
asm statements in the kernel but due to the inline asm cost estimation
heuristic which counts lines, I think, for example like in this here
macro:

https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Ftree%2Farch%2Fx86%2Finclude%2Fasm%2Fcpufeature.h%23n162&amp;data=02%7C01%7Cnamit%40vmware.com%7C6db1258c65ea45bbe4ea08d62c35ceec%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C636745007006838299&amp;sdata=iehl%2Fb8h%2BZE%2Frqb4qjac19WekSgOObc9%2BM1Jto1VgF4%3D&amp;reserved=0

the resulting code ends up not inlining the functions themselves which
use this macro. I.e., you see a CALL <function> instead of its body
getting inlined directly.

Even though it should be because the actual instructions are only a
couple in most cases and all those other directives end up in another
section anyway.

The issue is explained below in the forwarded mail in a larger detail
too.

Now, Richard suggested doing something like:

1) inline asm ("...")
2) asm ("..." : : : : <size-expr>)
3) asm ("...") __attribute__((asm_size(<size-expr>)));

with which user can tell gcc what the size of that inline asm statement
is and thus allow for more precise cost estimation and in the end better
inlining.

And FWIW 3) looks pretty straight-forward to me because attributes are
pretty common anyways.

But I'm sure there are other options and I'm sure people will have
better/different ideas so feel free to chime in.

Thanks for taking care of it. I would like to mention a second issue, since
you may want to resolve both with a single solution: not inlining
conditional __builtin_constant_p(), in which there are two code-paths - one
for constants and one for variables.

Consider for example the Linux kernel ilog2 macro, which has a condition
based on __builtin_constant_p() (
https://elixir.bootlin.com/linux/v4.19-rc7/source/include/linux/log2.h#L160
). The compiler mistakenly considers the “heavy” code-path that is supposed
to be evaluated only in compilation time to evaluate the code size. This
causes the kernel to consider functions such as kmalloc() as “big”.
kmalloc() is marked with always_inline attribute, so instead the calling
functions, such as kzalloc() are not inlined.

When I thought about hacking gcc to solve this issue, I considered an
intrinsic that would override the cost of a given statement. This solution
is not too nice, but may solve both issues.

In addition, note that AFAIU the impact of a wrong cost of code estimation
can also impact loop and other optim
Richard Biener
2018-10-07 16:46:24 UTC
Permalink
Post by Borislav Petkov
Post by Borislav Petkov
Hi people,
this is an attempt to see whether gcc's inline asm heuristic when
estimating inline asm statements' cost for better inlining can be
improved.
AFAIU, the problematic arises when one ends up using a lot of inline
asm statements in the kernel but due to the inline asm cost
estimation
Post by Borislav Petkov
heuristic which counts lines, I think, for example like in this here
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Ftree%2Farch%2Fx86%2Finclude%2Fasm%2Fcpufeature.h%23n162&amp;data=02%7C01%7Cnamit%40vmware.com%7C6db1258c65ea45bbe4ea08d62c35ceec%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C636745007006838299&amp;sdata=iehl%2Fb8h%2BZE%2Frqb4qjac19WekSgOObc9%2BM1Jto1VgF4%3D&amp;reserved=0
Post by Borislav Petkov
the resulting code ends up not inlining the functions themselves
which
Post by Borislav Petkov
use this macro. I.e., you see a CALL <function> instead of its body
getting inlined directly.
Even though it should be because the actual instructions are only a
couple in most cases and all those other directives end up in another
section anyway.
The issue is explained below in the forwarded mail in a larger detail
too.
1) inline asm ("...")
2) asm ("..." : : : : <size-expr>)
3) asm ("...") __attribute__((asm_size(<size-expr>)));
with which user can tell gcc what the size of that inline asm
statement
Post by Borislav Petkov
is and thus allow for more precise cost estimation and in the end
better
Post by Borislav Petkov
inlining.
And FWIW 3) looks pretty straight-forward to me because attributes
are
Post by Borislav Petkov
pretty common anyways.
But I'm sure there are other options and I'm sure people will have
better/different ideas so feel free to chime in.
Thanks for taking care of it. I would like to mention a second issue,
since
you may want to resolve both with a single solution: not inlining
conditional __builtin_constant_p(), in which there are two code-paths -
one
for constants and one for variables.
Consider for example the Linux kernel ilog2 macro, which has a
condition
based on __builtin_constant_p() (
https://elixir.bootlin.com/linux/v4.19-rc7/source/include/linux/log2.h#L160
). The compiler mistakenly considers the “heavy” code-path that is
supposed
to be evaluated only in compilation time to evaluate the code size.
But this is a misconception about __builtin_constant_p. It doesn't guard sth like 'constexpr' regions. If you try to use it with those semantics you'll fail (appearantly you do).

Of course IPA CP code size estimates when seeing a constant fed to bcp might be not optimal, that's another issue of course.

Richard.
Post by Borislav Petkov
This
causes the kernel to consider functions such as kmalloc() as “big”.
kmalloc() is marked with always_inline attribute, so instead the
calling
functions, such as kzalloc() are not inlined.
When I thought about hacking gcc to solve this issue, I considered an
intrinsic that would override the cost of a given statement. This
solution
is not too nice, but may solve both issues.
In addition, note that AFAIU the impact of a wrong cost of code
estimation
can also impact loop and other optimizations.
Regards,
Nadav
Nadav Amit
2018-10-07 19:06:07 UTC
Permalink
Post by Richard Biener
Post by Borislav Petkov
Post by Borislav Petkov
Hi people,
this is an attempt to see whether gcc's inline asm heuristic when
estimating inline asm statements' cost for better inlining can be
improved.
AFAIU, the problematic arises when one ends up using a lot of inline
asm statements in the kernel but due to the inline asm cost
estimation
Post by Borislav Petkov
heuristic which counts lines, I think, for example like in this here
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Ftree%2Farch%2Fx86%2Finclude%2Fasm%2Fcpufeature.h%23n162&amp;data=02%7C01%7Cnamit%40vmware.com%7C860403cecb874db64b7e08d62c746f46%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C636745275975505381&amp;sdata=Nd0636K9Z1IsUs1RWSRAhVuVboLxlBCB4peiAMfmQzQ%3D&amp;reserved=0
Post by Borislav Petkov
the resulting code ends up not inlining the functions themselves
which
Post by Borislav Petkov
use this macro. I.e., you see a CALL <function> instead of its body
getting inlined directly.
Even though it should be because the actual instructions are only a
couple in most cases and all those other directives end up in another
section anyway.
The issue is explained below in the forwarded mail in a larger detail
too.
1) inline asm ("...")
2) asm ("..." : : : : <size-expr>)
3) asm ("...") __attribute__((asm_size(<size-expr>)));
with which user can tell gcc what the size of that inline asm
statement
Post by Borislav Petkov
is and thus allow for more precise cost estimation and in the end
better
Post by Borislav Petkov
inlining.
And FWIW 3) looks pretty straight-forward to me because attributes
are
Post by Borislav Petkov
pretty common anyways.
But I'm sure there are other options and I'm sure people will have
better/different ideas so feel free to chime in.
Thanks for taking care of it. I would like to mention a second issue,
since
you may want to resolve both with a single solution: not inlining
conditional __builtin_constant_p(), in which there are two code-paths -
one
for constants and one for variables.
Consider for example the Linux kernel ilog2 macro, which has a
condition
based on __builtin_constant_p() (
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Fv4.19-rc7%2Fsource%2Finclude%2Flinux%2Flog2.h%23L160&amp;data=02%7C01%7Cnamit%40vmware.com%7C860403cecb874db64b7e08d62c746f46%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C636745275975515386&amp;sdata=Hk39Za9%2FxcFyK0sGENB24d6QySjsDGzF%2FwqjnUEMiGk%3D&amp;reserved=0
). The compiler mistakenly considers the “heavy” code-path that is
supposed
to be evaluated only in compilation time to evaluate the code size.
But this is a misconception about __builtin_constant_p. It doesn't guard sth like 'constexpr' regions. If you try to use it with those semantics you'll fail (appearantly you do).
Of course IPA CP code size estimates when seeing a constant fed to bcp might be not optimal, that's another issue of course.
I understand that this is might not be the right way to implement macros
such as ilog2() and test_bit(), but this code is around for some time.

I thought of using __builtin_choose_expr() instead of ternary operator, but
this introduces a different problem, as the variable version is used instead
of the constant one in many cases. From my brief experiments with llvm, it
appears that llvm does not have both of these issues (wrong cost attributed
to inline asm and conditions based on __builtin_constant_p()).

So what alternative do you propose to implement ilog2() like behavior? I was
digging through the gcc code to find a workaround w
Jeff Law
2018-10-07 19:52:24 UTC
Permalink
Post by Nadav Amit
Post by Richard Biener
Post by Borislav Petkov
Post by Borislav Petkov
Hi people,
this is an attempt to see whether gcc's inline asm heuristic when
estimating inline asm statements' cost for better inlining can be
improved.
AFAIU, the problematic arises when one ends up using a lot of inline
asm statements in the kernel but due to the inline asm cost
estimation
Post by Borislav Petkov
heuristic which counts lines, I think, for example like in this here
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Ftree%2Farch%2Fx86%2Finclude%2Fasm%2Fcpufeature.h%23n162&amp;data=02%7C01%7Cnamit%40vmware.com%7C860403cecb874db64b7e08d62c746f46%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C636745275975505381&amp;sdata=Nd0636K9Z1IsUs1RWSRAhVuVboLxlBCB4peiAMfmQzQ%3D&amp;reserved=0
Post by Borislav Petkov
the resulting code ends up not inlining the functions themselves
which
Post by Borislav Petkov
use this macro. I.e., you see a CALL <function> instead of its body
getting inlined directly.
Even though it should be because the actual instructions are only a
couple in most cases and all those other directives end up in another
section anyway.
The issue is explained below in the forwarded mail in a larger detail
too.
1) inline asm ("...")
2) asm ("..." : : : : <size-expr>)
3) asm ("...") __attribute__((asm_size(<size-expr>)));
with which user can tell gcc what the size of that inline asm
statement
Post by Borislav Petkov
is and thus allow for more precise cost estimation and in the end
better
Post by Borislav Petkov
inlining.
And FWIW 3) looks pretty straight-forward to me because attributes
are
Post by Borislav Petkov
pretty common anyways.
But I'm sure there are other options and I'm sure people will have
better/different ideas so feel free to chime in.
Thanks for taking care of it. I would like to mention a second issue,
since
you may want to resolve both with a single solution: not inlining
conditional __builtin_constant_p(), in which there are two code-paths -
one
for constants and one for variables.
Consider for example the Linux kernel ilog2 macro, which has a
condition
based on __builtin_constant_p() (
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Fv4.19-rc7%2Fsource%2Finclude%2Flinux%2Flog2.h%23L160&amp;data=02%7C01%7Cnamit%40vmware.com%7C860403cecb874db64b7e08d62c746f46%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C636745275975515386&amp;sdata=Hk39Za9%2FxcFyK0sGENB24d6QySjsDGzF%2FwqjnUEMiGk%3D&amp;reserved=0
). The compiler mistakenly considers the “heavy” code-path that is
supposed
to be evaluated only in compilation time to evaluate the code size.
But this is a misconception about __builtin_constant_p. It doesn't guard sth like 'constexpr' regions. If you try to use it with those semantics you'll fail (appearantly you do).
Of course IPA CP code size estimates when seeing a constant fed to bcp might be not optimal, that's another issue of course.
I understand that this is might not be the right way to implement macros
such as ilog2() and test_bit(), but this code is around for some time.
That doesn't make it right -- and there's been numerous bogus bugs
reported against ilog2 because the authors of ilog2 haven't had a clear
understanding of the semantics of builtin_constant_p.


Jeff
Richard Biener
2018-10-08 07:46:49 UTC
Permalink
Post by Nadav Amit
Post by Richard Biener
Post by Borislav Petkov
Post by Borislav Petkov
Hi people,
this is an attempt to see whether gcc's inline asm heuristic when
estimating inline asm statements' cost for better inlining can be
improved.
AFAIU, the problematic arises when one ends up using a lot of inline
asm statements in the kernel but due to the inline asm cost
estimation
Post by Borislav Petkov
heuristic which counts lines, I think, for example like in this here
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Ftree%2Farch%2Fx86%2Finclude%2Fasm%2Fcpufeature.h%23n162&amp;data=02%7C01%7Cnamit%40vmware.com%7C860403cecb874db64b7e08d62c746f46%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C636745275975505381&amp;sdata=Nd0636K9Z1IsUs1RWSRAhVuVboLxlBCB4peiAMfmQzQ%3D&amp;reserved=0
Post by Borislav Petkov
the resulting code ends up not inlining the functions themselves
which
Post by Borislav Petkov
use this macro. I.e., you see a CALL <function> instead of its body
getting inlined directly.
Even though it should be because the actual instructions are only a
couple in most cases and all those other directives end up in another
section anyway.
The issue is explained below in the forwarded mail in a larger detail
too.
1) inline asm ("...")
2) asm ("..." : : : : <size-expr>)
3) asm ("...") __attribute__((asm_size(<size-expr>)));
with which user can tell gcc what the size of that inline asm
statement
Post by Borislav Petkov
is and thus allow for more precise cost estimation and in the end
better
Post by Borislav Petkov
inlining.
And FWIW 3) looks pretty straight-forward to me because attributes
are
Post by Borislav Petkov
pretty common anyways.
But I'm sure there are other options and I'm sure people will have
better/different ideas so feel free to chime in.
Thanks for taking care of it. I would like to mention a second issue,
since
you may want to resolve both with a single solution: not inlining
conditional __builtin_constant_p(), in which there are two code-paths -
one
for constants and one for variables.
Consider for example the Linux kernel ilog2 macro, which has a
condition
based on __builtin_constant_p() (
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Fv4.19-rc7%2Fsource%2Finclude%2Flinux%2Flog2.h%23L160&amp;data=02%7C01%7Cnamit%40vmware.com%7C860403cecb874db64b7e08d62c746f46%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C636745275975515386&amp;sdata=Hk39Za9%2FxcFyK0sGENB24d6QySjsDGzF%2FwqjnUEMiGk%3D&amp;reserved=0
). The compiler mistakenly considers the “heavy” code-path that is
supposed
to be evaluated only in compilation time to evaluate the code size.
But this is a misconception about __builtin_constant_p. It doesn't guard sth like 'constexpr' regions. If you try to use it with those semantics you'll fail (appearantly you do).
Of course IPA CP code size estimates when seeing a constant fed to bcp might be not optimal, that's another issue of course.
I understand that this is might not be the right way to implement macros
such as ilog2() and test_bit(), but this code is around for some time.
I thought of using __builtin_choose_expr() instead of ternary operator, but
this introduces a different problem, as the variable version is used instead
of the constant one in many cases. From my brief experiments with llvm, it
appears that llvm does not have both of these issues (wrong cost attributed
to inline asm and conditions based on __builtin_constant_p()).
So what alternative do you propose to implement ilog2() like behavior? I was
digging through the gcc code to find a workaround with no success.
1) Don't try to cheat the compilers constant propagation abilities
2) Use a language that allows this (C++)
3) Define (and implement) the corresponding GNU C extension

__builtin_constant_p() isn't the right fit (I wonder what it was
implemented for in the first place though...).

I suppose you want sth like

if (__builtin_constant_p (x))
return __constexpr ...;

or use a call and have constexpr functions. Note it wouldn't be
C++-constexpr like since you want the constexpr evaluation to
happen very late in the compilation to benefit from optimizations
and you are fine with the non-constexpr path.

Properly defining a language extension is hard.

Richard.
Post by Nadav Amit
Thanks,
Nadav
--
Richard Biener <***@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
David Laight
2018-10-09 06:56:19 UTC
Permalink
From: Michael Matz
Sent: 07 October 2018 16:53
...
static inline void foo() { asm("large-looking-but-small-asm"); }
static void bar1() { ... foo() ... }
static void bar2() { ... foo() ... }
void goo (void) { bar1(); } // bar1 should have been inlined
So, while the immediate asm user was marked as always inline that in turn
caused users of it to become non-inlined. I'm assuming the kernel guys
did proper measurements that they _really_ get some non-trivial speed
benefit by inlining bar1/bar2, but for some reasons (I didn't inquire)
didn't want to mark them all as inline as well.
Could you add a 'size' attribute to the 'always inlined' foo() above
rather than trying to add one to the asm() statement itself.
Then add a warning in the documentation that small size attributes
might make the assembly fail due to limited branch offsets (etc).
Size '1' probably ought to be reserved for things that definitely
fit in a delay slot.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Loading...