[RFC] Kernel livepatching support in GCC

Discussion:

[RFC] Kernel livepatching support in GCC

Maxim Kuvyrkov

2015-05-28 08:39:27 UTC

Hi,

Akashi-san and I have been discussing required GCC changes to make kernel's livepatching work for AArch64 and other architectures. At the moment livepatching is supported for x86[_64] using the following options: "-pg -mfentry -mrecord-mcount -mnop-mcount" which is geek-speak for "please add several NOPs at the very beginning of each function, and make a section with addresses of all those NOP pads".

The above long-ish list of options is a historical artifact of how livepatching support evolved for x86. The end result is that for livepatching (or ftrace, or possible future kernel features) to work compiler needs to generate a little bit of empty code space at the beginning of each function. Kernel can later use that space to insert call sequences for various hooks.

Our proposal is that instead of adding -mfentry/-mnop-count/-mrecord-mcount options to other architectures, we should implement a target-independent option -fprolog-pad=N, which will generate a pad of N nops at the beginning of each function and add a section entry describing the pad similar to -mrecord-mcount [1].

Since adding NOPs is much less architecture-specific then outputting call instruction sequences, this option can be handled in a target-independent way at least for some/most architectures.

Comments?

As I found out today, the team from Huawei has implemented [2], which follows x86 example of -mfentry option generating a hard-coded call sequence. I hope that this proposal can be easily incorporated into their work since most of the livepatching changes are in the kernel.

[1] Technically, generating a NOP pad and adding a section entry in .__mcount_loc are two separate actions, so we may want to have a -fprolog-pad-record option. My instinct is to stick with a single option for now, since we can always add more later.

[2] http://lists.infradead.org/pipermail/linux-arm-kernel/2015-May/346905.html

--
Maxim Kuvyrkov
www.linaro.org

Richard Biener

2015-05-28 08:59:10 UTC

Post by Maxim Kuvyrkov
Hi,
Akashi-san and I have been discussing required GCC changes to make
kernel's livepatching work for AArch64 and other architectures. At the
moment livepatching is supported for x86[_64] using the following
options: "-pg -mfentry -mrecord-mcount -mnop-mcount" which is
geek-speak for "please add several NOPs at the very beginning of each
function, and make a section with addresses of all those NOP pads".
The above long-ish list of options is a historical artifact of how
livepatching support evolved for x86. The end result is that for
livepatching (or ftrace, or possible future kernel features) to work
compiler needs to generate a little bit of empty code space at the
beginning of each function. Kernel can later use that space to insert
call sequences for various hooks.
Our proposal is that instead of adding
-mfentry/-mnop-count/-mrecord-mcount options to other architectures, we
should implement a target-independent option -fprolog-pad=N, which will
generate a pad of N nops at the beginning of each function and add a
section entry describing the pad similar to -mrecord-mcount [1].
Since adding NOPs is much less architecture-specific then outputting
call instruction sequences, this option can be handled in a
target-independent way at least for some/most architectures.
Comments?

Maybe follow s390 -mhotpatch instead?

Post by Maxim Kuvyrkov
As I found out today, the team from Huawei has implemented [2], which
follows x86 example of -mfentry option generating a hard-coded call
sequence. I hope that this proposal can be easily incorporated into
their work since most of the livepatching changes are in the kernel.
[1] Technically, generating a NOP pad and adding a section entry in
.__mcount_loc are two separate actions, so we may want to have a
-fprolog-pad-record option. My instinct is to stick with a single
option for now, since we can always add more later.
[2]
http://lists.infradead.org/pipermail/linux-arm-kernel/2015-May/346905.html
--
Maxim Kuvyrkov
www.linaro.org

Maxim Kuvyrkov

2015-05-28 09:16:58 UTC

Post by Richard Biener

Post by Maxim Kuvyrkov
Hi,
Akashi-san and I have been discussing required GCC changes to make
kernel's livepatching work for AArch64 and other architectures. At the
moment livepatching is supported for x86[_64] using the following
options: "-pg -mfentry -mrecord-mcount -mnop-mcount" which is
geek-speak for "please add several NOPs at the very beginning of each
function, and make a section with addresses of all those NOP pads".
The above long-ish list of options is a historical artifact of how
livepatching support evolved for x86. The end result is that for
livepatching (or ftrace, or possible future kernel features) to work
compiler needs to generate a little bit of empty code space at the
beginning of each function. Kernel can later use that space to insert
call sequences for various hooks.
Our proposal is that instead of adding
-mfentry/-mnop-count/-mrecord-mcount options to other architectures, we
should implement a target-independent option -fprolog-pad=N, which will
generate a pad of N nops at the beginning of each function and add a
section entry describing the pad similar to -mrecord-mcount [1].
Since adding NOPs is much less architecture-specific then outputting
call instruction sequences, this option can be handled in a
target-independent way at least for some/most architectures.
Comments?

Maybe follow s390 -mhotpatch instead?

Regarding implementation of the option, it will follow what s390 is doing with function attributes to mark which functions to apply nop-treatment to (using attributes will avoid problems with [coming] LTO builds of the kernel). The new option will set value of the attribute on all functions in current compilation unit, and then nops will be generated from the attribute specification.

On the other hand, s390 does not generate a section of descriptor entries of NOP pads, which seems like a useful (or necessary) option. A more-or-less generic implementation should, therefore, combine s390's attributes approach to annotating functions and x86's approach to providing information in an ELF section about NOP entries. Or can we record value of a function attribute in ELF in a generic way?

Whatever the specifics, implementation of livepatch support should be decoupled from -pg/mcount dependency as I don't see any real need in overloading mcount with livepatching stuff.

--
Maxim Kuvyrkov
www.linaro.org

Andreas Krebbel

2015-05-28 15:37:53 UTC

...

Post by Maxim Kuvyrkov

Post by Richard Biener
Maybe follow s390 -mhotpatch instead?

Regarding implementation of the option, it will follow what s390 is doing with function attributes to mark which functions to apply nop-treatment to (using attributes will avoid problems with [coming] LTO builds of the kernel). The new option will set value of the attribute on all functions in current compilation unit, and then nops will be generated from the attribute specification.
On the other hand, s390 does not generate a section of descriptor entries of NOP pads, which seems like a useful (or necessary) option. A more-or-less generic implementation should, therefore, combine s390's attributes approach to annotating functions and x86's approach to providing information in an ELF section about NOP entries. Or can we record value of a function attribute in ELF in a generic way?

I agree that would be useful. The only reason we didn't implement that was that our kernel guys were
confident enough to be able to detect patchable functions without it. We discussed two solutions to
that:

1. Add special relocations pointing to the patchable areas.
2. Add a special section listing all patchable areas. I think systemtap is doing something similiar
for their probes.

Post by Maxim Kuvyrkov
Whatever the specifics, implementation of livepatch support should be decoupled from -pg/mcount dependency as I don't see any real need in overloading mcount with livepatching stuff.

Agreed.

For user space hotpatching we also needed hotpatching areas *before* the function label to emit
trampolines there. This perhaps should be covered by a generic approach as well.

-Andreas-

Ondřej Bílka

2015-06-04 07:15:31 UTC

Post by Andreas Krebbel
...

Post by Maxim Kuvyrkov

Post by Richard Biener
Maybe follow s390 -mhotpatch instead?

Regarding implementation of the option, it will follow what s390 is doing with function attributes to mark which functions to apply nop-treatment to (using attributes will avoid problems with [coming] LTO builds of the kernel). The new option will set value of the attribute on all functions in current compilation unit, and then nops will be generated from the attribute specification.
On the other hand, s390 does not generate a section of descriptor entries of NOP pads, which seems like a useful (or necessary) option. A more-or-less generic implementation should, therefore, combine s390's attributes approach to annotating functions and x86's approach to providing information in an ELF section about NOP entries. Or can we record value of a function attribute in ELF in a generic way?

I agree that would be useful. The only reason we didn't implement that was that our kernel guys were
confident enough to be able to detect patchable functions without it. We discussed two solutions to
1. Add special relocations pointing to the patchable areas.
2. Add a special section listing all patchable areas. I think systemtap is doing something similiar
for their probes.

As I am bit concerned with performance why require nops there? Add a
byte count number >= requested thats boundary of next instruction. When
lifepatching for return you need to copy this followed by jump back to next
instruction. Then gcc could fill that with instructions that don't
depend on address, fill with nops as trivial first implementation.

Would that be possible?

Andreas Krebbel

2015-06-09 14:10:31 UTC

Post by OndÅej BÃlka

Post by Andreas Krebbel
...

Post by Maxim Kuvyrkov

Post by Richard Biener
Maybe follow s390 -mhotpatch instead?

Regarding implementation of the option, it will follow what s390 is doing with function attributes to mark which functions to apply nop-treatment to (using attributes will avoid problems with [coming] LTO builds of the kernel). The new option will set value of the attribute on all functions in current compilation unit, and then nops will be generated from the attribute specification.
On the other hand, s390 does not generate a section of descriptor entries of NOP pads, which seems like a useful (or necessary) option. A more-or-less generic implementation should, therefore, combine s390's attributes approach to annotating functions and x86's approach to providing information in an ELF section about NOP entries. Or can we record value of a function attribute in ELF in a generic way?

I agree that would be useful. The only reason we didn't implement that was that our kernel guys were
confident enough to be able to detect patchable functions without it. We discussed two solutions to
1. Add special relocations pointing to the patchable areas.
2. Add a special section listing all patchable areas. I think systemtap is doing something similiar
for their probes.

As I am bit concerned with performance why require nops there? Add a
byte count number >= requested thats boundary of next instruction. When
lifepatching for return you need to copy this followed by jump back to next
instruction. Then gcc could fill that with instructions that don't
depend on address, fill with nops as trivial first implementation.
Would that be possible?

So instead of placing NOPs to be overwritten you intend to simply overwrite the existing code after
making a backup of it? While that sounds appealing I think you could run into trouble when trying
to atomically patch the code. On S/390 it could happen that you overwrite 3 2 byte instructions
with a 6 byte instruction. The effect of the first two might be visible when doing the update so you
would have to keep all CPUs outside this area when patching. With the -mhotpatch option it is
guaranteed that a 6 byte NOP is used. With the proper alignment we can do an atomic update then.
For your approach we have to arrange that bigger instructions get scheduled first and if this isn't
possible we would have to add proper NOPs instead. That might be an improvement which could be added
to that feature.
Btw. we have compared several benchmarks with kernels either using or not using -mhotpatch and there
was no noticable penalty. However, I agree that it would be nice to get rid of the NOPs althogether.

-Andreas-

Andi Kleen

2015-06-09 15:52:57 UTC

Post by Andreas Krebbel

Post by OndÅej BÃlka
As I am bit concerned with performance why require nops there? Add a
byte count number >= requested thats boundary of next instruction. When
lifepatching for return you need to copy this followed by jump back to next
instruction. Then gcc could fill that with instructions that don't
depend on address, fill with nops as trivial first implementation.
Would that be possible?

So instead of placing NOPs to be overwritten you intend to simply overwrite the existing code after
making a backup of it?

This is how Linux k/uprobes work. But it only works for a subset of instructions and is
fairly complicated because you need a complete decoder that is able to adjust any program counter
relative data offsets. Having a patch area is far easier and more reliable.

-Andi

--
***@linux.intel.com -- Speaking for myself only

Andi Kleen

2015-05-28 18:05:02 UTC

Post by Maxim Kuvyrkov
Our proposal is that instead of adding -mfentry/-mnop-count/-mrecord-mcount options to other architectures, we should implement a target-independent option -fprolog-pad=N, which will generate a pad of N nops at the beginning of each function and add a section entry describing the pad similar to -mrecord-mcount [1].

Sounds fine to me.

-Andi

--
***@linux.intel.com -- Speaking for myself only

Jim Wilson

2015-05-28 22:41:07 UTC

Post by Maxim Kuvyrkov
Hi,
Akashi-san and I have been discussing required GCC changes to make kernel's livepatching work for AArch64 and other architectures. At the moment livepatching is supported for x86[_64] using the following options: "-pg -mfentry -mrecord-mcount -mnop-mcount" which is geek-speak for "please add several NOPs at the very beginning of each function, and make a section with addresses of all those NOP pads".

FYI, there is also the darwin/rs6000 -mfix-and-continue support, which
adds 5 nops to the prologue. This was a part of a gdb feature, to allow
one to load a fixed function into a binary inside the debugger, and then
continue executing with the fixed code. It sounds like your kernel
feature is doing something very similar. If you are making this a
generic feature, then maybe the darwin/rs6000 -mfix-and-continue support
can be merged with it somehow.

Jim

Li Bin

2015-05-30 08:22:09 UTC

Post by Maxim Kuvyrkov
Hi,
Akashi-san and I have been discussing required GCC changes to make kernel's livepatching work for AArch64 and other architectures. At the moment livepatching is supported for x86[_64] using the following options: "-pg -mfentry -mrecord-mcount -mnop-mcount" which is geek-speak for "please add several NOPs at the very beginning of each function, and make a section with addresses of all those NOP pads".
The above long-ish list of options is a historical artifact of how livepatching support evolved for x86. The end result is that for livepatching (or ftrace, or possible future kernel features) to work compiler needs to generate a little bit of empty code space at the beginning of each function. Kernel can later use that space to insert call sequences for various hooks.
Our proposal is that instead of adding -mfentry/-mnop-count/-mrecord-mcount options to other architectures, we should implement a target-independent option -fprolog-pad=N, which will generate a pad of N nops at the beginning of each function and add a section entry describing the pad similar to -mrecord-mcount [1].
Since adding NOPs is much less architecture-specific then outputting call instruction sequences, this option can be handled in a target-independent way at least for some/most architectures.
Comments?

This proposal sounds good to me, and I look forward to it be merged soon:)
Then I'll make the appropriate changes in kernel.
Thanks!
Li Bin

Post by Maxim Kuvyrkov
As I found out today, the team from Huawei has implemented [2], which follows x86 example of -mfentry option generating a hard-coded call sequence. I hope that this proposal can be easily incorporated into their work since most of the livepatching changes are in the kernel.
[1] Technically, generating a NOP pad and adding a section entry in .__mcount_loc are two separate actions, so we may want to have a -fprolog-pad-record option. My instinct is to stick with a single option for now, since we can always add more later.
[2] http://lists.infradead.org/pipermail/linux-arm-kernel/2015-May/346905.html
--
Maxim Kuvyrkov
www.linaro.org

David Brown

2015-06-04 09:45:10 UTC

Post by Maxim Kuvyrkov
Hi,
Akashi-san and I have been discussing required GCC changes to make
kernel's livepatching work for AArch64 and other architectures. At
the moment livepatching is supported for x86[_64] using the following
options: "-pg -mfentry -mrecord-mcount -mnop-mcount" which is
geek-speak for "please add several NOPs at the very beginning of each
function, and make a section with addresses of all those NOP pads".
The above long-ish list of options is a historical artifact of how
livepatching support evolved for x86. The end result is that for
livepatching (or ftrace, or possible future kernel features) to work
compiler needs to generate a little bit of empty code space at the
beginning of each function. Kernel can later use that space to
insert call sequences for various hooks.
Our proposal is that instead of adding
-mfentry/-mnop-count/-mrecord-mcount options to other architectures,
we should implement a target-independent option -fprolog-pad=N, which
will generate a pad of N nops at the beginning of each function and
add a section entry describing the pad similar to -mrecord-mcount
[1].
Since adding NOPs is much less architecture-specific then outputting
call instruction sequences, this option can be handled in a
target-independent way at least for some/most architectures.
Comments?

Rather than just a sequence of NOP's, should the first NOP be a
unconditional branch to the beginning of the real function? I don't
know if this applies to AArch64 cpus, but I believe some cpus can handle
such branches already in the decode unit, thus avoiding any extra cycles
for skipping the NOPs.

David

Andi Kleen

2015-06-04 15:17:48 UTC

Post by David Brown
Rather than just a sequence of NOP's, should the first NOP be a
unconditional branch to the beginning of the real function? I don't
know if this applies to AArch64 cpus, but I believe some cpus can handle
such branches already in the decode unit, thus avoiding any extra cycles
for skipping the NOPs.

nops are very cheap. Typically they are already discard in the frontend.
It's unlikely all of this is worth it.

-Andi

--
***@linux.intel.com -- Speaking for myself only

Andrew Pinski

2015-06-04 15:30:20 UTC

Post by Andi Kleen

Post by David Brown
Rather than just a sequence of NOP's, should the first NOP be a
unconditional branch to the beginning of the real function? I don't
know if this applies to AArch64 cpus, but I believe some cpus can handle
such branches already in the decode unit, thus avoiding any extra cycles
for skipping the NOPs.

nops are very cheap. Typically they are already discard in the frontend.
It's unlikely all of this is worth it.

Maybe on Intel's chip but not for an example on ThunderX they are not
discarded. But we can issue 2 at a time. So what is a few cycles
overhead for each function when an icache miss is much higher.

Thanks,
Andrew

Maxim Kuvyrkov

2015-10-13 12:51:12 UTC

Hi,

The feedback in this thread was overall positive with good suggestions
on implementation details. I'm starting to work on the first draft,
and plan to post something in 2-4 weeks.

Thanks.

Post by Maxim Kuvyrkov
Hi,
Akashi-san and I have been discussing required GCC changes to make kernel's livepatching work for AArch64 and other architectures. At the moment livepatching is supported for x86[_64] using the following options: "-pg -mfentry -mrecord-mcount -mnop-mcount" which is geek-speak for "please add several NOPs at the very beginning of each function, and make a section with addresses of all those NOP pads".
The above long-ish list of options is a historical artifact of how livepatching support evolved for x86. The end result is that for livepatching (or ftrace, or possible future kernel features) to work compiler needs to generate a little bit of empty code space at the beginning of each function. Kernel can later use that space to insert call sequences for various hooks.
Our proposal is that instead of adding -mfentry/-mnop-count/-mrecord-mcount options to other architectures, we should implement a target-independent option -fprolog-pad=N, which will generate a pad of N nops at the beginning of each function and add a section entry describing the pad similar to -mrecord-mcount [1].
Since adding NOPs is much less architecture-specific then outputting call instruction sequences, this option can be handled in a target-independent way at least for some/most architectures.
Comments?
As I found out today, the team from Huawei has implemented [2], which follows x86 example of -mfentry option generating a hard-coded call sequence. I hope that this proposal can be easily incorporated into their work since most of the livepatching changes are in the kernel.
[1] Technically, generating a NOP pad and adding a section entry in .__mcount_loc are two separate actions, so we may want to have a -fprolog-pad-record option. My instinct is to stick with a single option for now, since we can always add more later.
[2] http://lists.infradead.org/pipermail/linux-arm-kernel/2015-May/346905.html
--
Maxim Kuvyrkov
www.linaro.org

--
Maxim Kuvyrkov
www.linaro.org

libin

2015-10-22 09:07:04 UTC

Post by Maxim Kuvyrkov
Hi,
Akashi-san and I have been discussing required GCC changes to make kernel's livepatching work for AArch64 and other architectures. At the moment livepatching is supported for x86[_64] using the following options: "-pg -mfentry -mrecord-mcount -mnop-mcount" which is geek-speak for "please add several NOPs at the very beginning of each function, and make a section with addresses of all those NOP pads".
The above long-ish list of options is a historical artifact of how livepatching support evolved for x86. The end result is that for livepatching (or ftrace, or possible future kernel features) to work compiler needs to generate a little bit of empty code space at the beginning of each function. Kernel can later use that space to insert call sequences for various hooks.
Our proposal is that instead of adding -mfentry/-mnop-count/-mrecord-mcount options to other architectures, we should implement a target-independent option -fprolog-pad=N, which will generate a pad of N nops at the beginning of each function and add a section entry describing the pad similar to -mrecord-mcount [1].
Since adding NOPs is much less architecture-specific then outputting call instruction sequences, this option can be handled in a target-independent way at least for some/most architectures.
Comments?
As I found out today, the team from Huawei has implemented [2], which follows x86 example of -mfentry option generating a hard-coded call sequence. I hope that this proposal can be easily incorporated into their work since most of the livepatching changes are in the kernel.

Thanks very much for your effort for this, and the arch-independed implementation
is very good to me, but only have one question that how to enture the atomic
replacement of multi instructions in kernel side?

And before this arch-independed option, can we consider the arch-depended -mfentry
implemention for arm64 like arch x86 firstly? I will post it soon.

livepatch for arm64 based on this arm64 -mfentry feature on github:
https://github.com/libin2015/livepatch-for-arm64.git master

discussions on this topic:
https://lkml.org/lkml/2015/5/28/54

Thanks,
Li Bin

Post by Maxim Kuvyrkov
[1] Technically, generating a NOP pad and adding a section entry in .__mcount_loc are two separate actions, so we may want to have a -fprolog-pad-record option. My instinct is to stick with a single option for now, since we can always add more later.
[2] http://lists.infradead.org/pipermail/linux-arm-kernel/2015-May/346905.html
--
Maxim Kuvyrkov
www.linaro.org

libin

2015-10-22 09:23:30 UTC

From: Jiangjiji <***@huawei.com>
Date: Sat, 10 Oct 2015 15:29:57 +0800
Subject: [PATCH] * gcc/config/aarch64/aarch64.opt: Add a new option.
* gcc/config/aarch64/aarch64.c: Add some new functions and Macros.
* gcc/config/aarch64/aarch64.h: Modify PROFILE_HOOK and FUNCTION_PROFILER.

Signed-off-by: Jiangjiji <***@huawei.com>
Signed-off-by: Li Bin <***@huawei.com>
---
gcc/config/aarch64/aarch64.c | 23 +++++++++++++++++++++++
gcc/config/aarch64/aarch64.h | 13 ++++++++-----
gcc/config/aarch64/aarch64.opt | 4 ++++
3 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 752df4e..c70b161 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -440,6 +440,17 @@ aarch64_is_long_call_p (rtx sym)
return aarch64_decl_is_long_call_p (SYMBOL_REF_DECL (sym));
}

+void
+aarch64_function_profiler (FILE *file, int labelno ATTRIBUTE_UNUSED)
+{
+ if (flag_fentry)
+ {
+ fprintf (file, "\tmov\tx9, x30\n");
+ fprintf (file, "\tbl\t__fentry__\n");
+ fprintf (file, "\tmov\tx30, x9\n");
+ }
+}
+
/* Return true if the offsets to a zero/sign-extract operation
represent an expression that matches an extend operation. The
operands represent the paramters from
@@ -7414,6 +7425,15 @@ aarch64_emit_unlikely_jump (rtx insn)
add_int_reg_note (insn, REG_BR_PROB, very_unlikely);
}

+/* Return true, if profiling code should be emitted before
+ * prologue. Otherwise it returns false.
+ * Note: For x86 with "hotfix" it is sorried. */
+static bool
+aarch64_profile_before_prologue (void)
+{
+ return flag_fentry != 0;
+}
+
/* Expand a compare and swap pattern. */

void
@@ -8454,6 +8474,9 @@ aarch64_cannot_change_mode_class (enum machine_mode from,
#undef TARGET_ASM_ALIGNED_SI_OP
#define TARGET_ASM_ALIGNED_SI_OP "\t.word\t"

+#undef TARGET_PROFILE_BEFORE_PROLOGUE
+#define TARGET_PROFILE_BEFORE_PROLOGUE aarch64_profile_before_prologue
+
#undef TARGET_ASM_CAN_OUTPUT_MI_THUNK
#define TARGET_ASM_CAN_OUTPUT_MI_THUNK \
hook_bool_const_tree_hwi_hwi_const_tree_true
diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index 77b2bb9..65e34fc 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -804,13 +804,16 @@ do { \
#define PROFILE_HOOK(LABEL) \
{ \
rtx fun, lr; \
- lr = get_hard_reg_initial_val (Pmode, LR_REGNUM); \
- fun = gen_rtx_SYMBOL_REF (Pmode, MCOUNT_NAME); \
- emit_library_call (fun, LCT_NORMAL, VOIDmode, 1, lr, Pmode); \
+ if (!flag_fentry)
+ {
+ lr = get_hard_reg_initial_val (Pmode, LR_REGNUM); \
+ fun = gen_rtx_SYMBOL_REF (Pmode, MCOUNT_NAME); \
+ emit_library_call (fun, LCT_NORMAL, VOIDmode, 1, lr, Pmode); \
+ }
}

-/* All the work done in PROFILE_HOOK, but still required. */
-#define FUNCTION_PROFILER(STREAM, LABELNO) do { } while (0)
+#define FUNCTION_PROFILER(STREAM, LABELNO)
+ aarch64_function_profiler(STREAM, LABELNO)

/* For some reason, the Linux headers think they know how to define
these macros. They don't!!! */
diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
index 266d873..9e4b408 100644
--- a/gcc/config/aarch64/aarch64.opt
+++ b/gcc/config/aarch64/aarch64.opt
@@ -124,3 +124,7 @@ Enum(aarch64_abi) String(ilp32) Value(AARCH64_ABI_ILP32)

EnumValue
Enum(aarch64_abi) String(lp64) Value(AARCH64_ABI_LP64)
+
+mfentry
+Target Report Var(flag_fentry) Init(0)
+Emit profiling counter call at function entry immediately after prologue.

--
1.7.1

Szabolcs Nagy

2015-10-22 10:16:33 UTC

Post by libin
Date: Sat, 10 Oct 2015 15:29:57 +0800
Subject: [PATCH] * gcc/config/aarch64/aarch64.opt: Add a new option.
* gcc/config/aarch64/aarch64.c: Add some new functions and Macros.
* gcc/config/aarch64/aarch64.h: Modify PROFILE_HOOK and FUNCTION_PROFILER.

this patch might be worth submitting to gcc-patches.

i assume this is not redundant with respect to the
nop-padding work.

Post by libin
---
gcc/config/aarch64/aarch64.c | 23 +++++++++++++++++++++++
gcc/config/aarch64/aarch64.h | 13 ++++++++-----
gcc/config/aarch64/aarch64.opt | 4 ++++
3 files changed, 35 insertions(+), 5 deletions(-)
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 752df4e..c70b161 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -440,6 +440,17 @@ aarch64_is_long_call_p (rtx sym)
return aarch64_decl_is_long_call_p (SYMBOL_REF_DECL (sym));
}
+void
+aarch64_function_profiler (FILE *file, int labelno ATTRIBUTE_UNUSED)
+{
+ if (flag_fentry)
+ {
+ fprintf (file, "\tmov\tx9, x30\n");
+ fprintf (file, "\tbl\t__fentry__\n");
+ fprintf (file, "\tmov\tx30, x9\n");
+ }
+}
+

you can even omit the mov x30,x9 at the call site if
__fentry__ does

stp x9,x30,[sp,#-16]!
//... profiling
ldp x30,x9,[sp],#16
ret x9

is there a problem with this?

i think the rest of the patch means that -pg retains
the old behaviour and -pg -mfentry emits this new entry.

note that -pg rejects -fomit-frame-pointer (for no good
reason), that should be fixed separately (it seems the
kernel now relies on frame pointers on aarch64, but the
mcount abi does not require this and e.g. the glibc
mcount does not use it.)

Post by libin
/* Return true if the offsets to a zero/sign-extract operation
represent an expression that matches an extend operation. The
operands represent the paramters from
@@ -7414,6 +7425,15 @@ aarch64_emit_unlikely_jump (rtx insn)
add_int_reg_note (insn, REG_BR_PROB, very_unlikely);
}
+/* Return true, if profiling code should be emitted before
+ * prologue. Otherwise it returns false.
+ * Note: For x86 with "hotfix" it is sorried. */
+static bool
+aarch64_profile_before_prologue (void)
+{
+ return flag_fentry != 0;
+}
+
/* Expand a compare and swap pattern. */
void
@@ -8454,6 +8474,9 @@ aarch64_cannot_change_mode_class (enum machine_mode from,
#undef TARGET_ASM_ALIGNED_SI_OP
#define TARGET_ASM_ALIGNED_SI_OP "\t.word\t"
+#undef TARGET_PROFILE_BEFORE_PROLOGUE
+#define TARGET_PROFILE_BEFORE_PROLOGUE aarch64_profile_before_prologue
+
#undef TARGET_ASM_CAN_OUTPUT_MI_THUNK
#define TARGET_ASM_CAN_OUTPUT_MI_THUNK \
hook_bool_const_tree_hwi_hwi_const_tree_true
diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index 77b2bb9..65e34fc 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -804,13 +804,16 @@ do { \
#define PROFILE_HOOK(LABEL) \
{ \
rtx fun, lr; \
- lr = get_hard_reg_initial_val (Pmode, LR_REGNUM); \
- fun = gen_rtx_SYMBOL_REF (Pmode, MCOUNT_NAME); \
- emit_library_call (fun, LCT_NORMAL, VOIDmode, 1, lr, Pmode); \
+ if (!flag_fentry)
+ {
+ lr = get_hard_reg_initial_val (Pmode, LR_REGNUM); \
+ fun = gen_rtx_SYMBOL_REF (Pmode, MCOUNT_NAME); \
+ emit_library_call (fun, LCT_NORMAL, VOIDmode, 1, lr, Pmode); \
+ }
}
-/* All the work done in PROFILE_HOOK, but still required. */
-#define FUNCTION_PROFILER(STREAM, LABELNO) do { } while (0)
+#define FUNCTION_PROFILER(STREAM, LABELNO)
+ aarch64_function_profiler(STREAM, LABELNO)
/* For some reason, the Linux headers think they know how to define
these macros. They don't!!! */
diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
index 266d873..9e4b408 100644
--- a/gcc/config/aarch64/aarch64.opt
+++ b/gcc/config/aarch64/aarch64.opt
@@ -124,3 +124,7 @@ Enum(aarch64_abi) String(ilp32) Value(AARCH64_ABI_ILP32)
EnumValue
Enum(aarch64_abi) String(lp64) Value(AARCH64_ABI_LP64)
+
+mfentry
+Target Report Var(flag_fentry) Init(0)
+Emit profiling counter call at function entry immediately after prologue.

libin

2015-10-22 13:06:30 UTC

Post by Szabolcs Nagy

Post by libin
Date: Sat, 10 Oct 2015 15:29:57 +0800
Subject: [PATCH] * gcc/config/aarch64/aarch64.opt: Add a new option.
* gcc/config/aarch64/aarch64.c: Add some new functions and Macros.
* gcc/config/aarch64/aarch64.h: Modify PROFILE_HOOK and FUNCTION_PROFILER.

this patch might be worth submitting to gcc-patches.

Ok, i will submit it to gcc-patches :)

Post by Szabolcs Nagy
i assume this is not redundant with respect to the
nop-padding work.

Post by libin
---
gcc/config/aarch64/aarch64.c | 23 +++++++++++++++++++++++
gcc/config/aarch64/aarch64.h | 13 ++++++++-----
gcc/config/aarch64/aarch64.opt | 4 ++++
3 files changed, 35 insertions(+), 5 deletions(-)
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 752df4e..c70b161 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -440,6 +440,17 @@ aarch64_is_long_call_p (rtx sym)
return aarch64_decl_is_long_call_p (SYMBOL_REF_DECL (sym));
}
+void
+aarch64_function_profiler (FILE *file, int labelno ATTRIBUTE_UNUSED)
+{
+ if (flag_fentry)
+ {
+ fprintf (file, "\tmov\tx9, x30\n");
+ fprintf (file, "\tbl\t__fentry__\n");
+ fprintf (file, "\tmov\tx30, x9\n");
+ }
+}
+

you can even omit the mov x30,x9 at the call site if
__fentry__ does
stp x9,x30,[sp,#-16]!
//... profiling
ldp x30,x9,[sp],#16
ret x9
is there a problem with this?

__fentry__ is responsible to protect all registers.

Post by Szabolcs Nagy
i think the rest of the patch means that -pg retains
the old behaviour and -pg -mfentry emits this new entry.
note that -pg rejects -fomit-frame-pointer (for no good
reason), that should be fixed separately (it seems the
kernel now relies on frame pointers on aarch64, but the
mcount abi does not require this and e.g. the glibc
mcount does not use it.)

Yes, kernel configuration CONFIG_FRAME_POINTER is forced on arm64,
that the kernel is compiled without -fomit-frame-pointer.

Thanks,

Li Bin

Post by Szabolcs Nagy

Post by libin
/* Return true if the offsets to a zero/sign-extract operation
represent an expression that matches an extend operation. The
operands represent the paramters from
@@ -7414,6 +7425,15 @@ aarch64_emit_unlikely_jump (rtx insn)
add_int_reg_note (insn, REG_BR_PROB, very_unlikely);
}
+/* Return true, if profiling code should be emitted before
+ * prologue. Otherwise it returns false.
+ * Note: For x86 with "hotfix" it is sorried. */
+static bool
+aarch64_profile_before_prologue (void)
+{
+ return flag_fentry != 0;
+}
+
/* Expand a compare and swap pattern. */
void
@@ -8454,6 +8474,9 @@ aarch64_cannot_change_mode_class (enum machine_mode from,
#undef TARGET_ASM_ALIGNED_SI_OP
#define TARGET_ASM_ALIGNED_SI_OP "\t.word\t"
+#undef TARGET_PROFILE_BEFORE_PROLOGUE
+#define TARGET_PROFILE_BEFORE_PROLOGUE aarch64_profile_before_prologue
+
#undef TARGET_ASM_CAN_OUTPUT_MI_THUNK
#define TARGET_ASM_CAN_OUTPUT_MI_THUNK \
hook_bool_const_tree_hwi_hwi_const_tree_true
diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index 77b2bb9..65e34fc 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -804,13 +804,16 @@ do { \
#define PROFILE_HOOK(LABEL) \
{ \
rtx fun, lr; \
- lr = get_hard_reg_initial_val (Pmode, LR_REGNUM); \
- fun = gen_rtx_SYMBOL_REF (Pmode, MCOUNT_NAME); \
- emit_library_call (fun, LCT_NORMAL, VOIDmode, 1, lr, Pmode); \
+ if (!flag_fentry)
+ {
+ lr = get_hard_reg_initial_val (Pmode, LR_REGNUM); \
+ fun = gen_rtx_SYMBOL_REF (Pmode, MCOUNT_NAME); \
+ emit_library_call (fun, LCT_NORMAL, VOIDmode, 1, lr, Pmode); \
+ }
}
-/* All the work done in PROFILE_HOOK, but still required. */
-#define FUNCTION_PROFILER(STREAM, LABELNO) do { } while (0)
+#define FUNCTION_PROFILER(STREAM, LABELNO)
+ aarch64_function_profiler(STREAM, LABELNO)
/* For some reason, the Linux headers think they know how to define
these macros. They don't!!! */
diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
index 266d873..9e4b408 100644
--- a/gcc/config/aarch64/aarch64.opt
+++ b/gcc/config/aarch64/aarch64.opt
@@ -124,3 +124,7 @@ Enum(aarch64_abi) String(ilp32) Value(AARCH64_ABI_ILP32)
EnumValue
Enum(aarch64_abi) String(lp64) Value(AARCH64_ABI_LP64)
+
+mfentry
+Target Report Var(flag_fentry) Init(0)
+Emit profiling counter call at function entry immediately after prologue.

.

AKASHI Takahiro

2015-10-22 10:14:28 UTC

Li,
(added linux-arm-kernel to Cc.)

Post by libin

Post by Maxim Kuvyrkov
Hi,
Akashi-san and I have been discussing required GCC changes to make kernel's livepatching work for AArch64 and other
architectures. At the moment livepatching is supported for x86[_64] using the following options: "-pg -mfentry
-mrecord-mcount -mnop-mcount" which is geek-speak for "please add several NOPs at the very beginning of each function,
and make a section with addresses of all those NOP pads".
The above long-ish list of options is a historical artifact of how livepatching support evolved for x86. The end
result is that for livepatching (or ftrace, or possible future kernel features) to work compiler needs to generate a
little bit of empty code space at the beginning of each function. Kernel can later use that space to insert call
sequences for various hooks.
Our proposal is that instead of adding -mfentry/-mnop-count/-mrecord-mcount options to other architectures, we should
implement a target-independent option -fprolog-pad=N, which will generate a pad of N nops at the beginning of each
function and add a section entry describing the pad similar to -mrecord-mcount [1].
Since adding NOPs is much less architecture-specific then outputting call instruction sequences, this option can be
handled in a target-independent way at least for some/most architectures.
Comments?
As I found out today, the team from Huawei has implemented [2], which follows x86 example of -mfentry option
generating a hard-coded call sequence. I hope that this proposal can be easily incorporated into their work since
most of the livepatching changes are in the kernel.

Thanks very much for your effort for this, and the arch-independed implementation
is very good to me, but only have one question that how to enture the atomic
replacement of multi instructions in kernel side?

I have one idea, but we'd better discuss this topic in, at least including, linux-arm-kernel.

Post by libin
And before this arch-independed option, can we consider the arch-depended -mfentry
implemention for arm64 like arch x86 firstly? I will post it soon.
https://github.com/libin2015/livepatch-for-arm64.git master

I also have my own version of livepatch support for arm64 using yet-coming "-fprolog-add=N" option :)
As we discussed before, the main difference will be how we should preserve LR register when invoking
a ftrace hook (ftrace_regs_caller).
But again, this is a topic to discuss mainly in linux-arm-kernel.
(I have no intention of excluding gcc ml from the discussions.)

Thanks,
-Takahiro AKASHI

Post by libin
https://lkml.org/lkml/2015/5/28/54
Thanks,
Li Bin

Post by Maxim Kuvyrkov
[1] Technically, generating a NOP pad and adding a section entry in .__mcount_loc are two separate actions, so we may
want to have a -fprolog-pad-record option. My instinct is to stick with a single option for now, since we can always
add more later.
[2] http://lists.infradead.org/pipermail/linux-arm-kernel/2015-May/346905.html
--
Maxim Kuvyrkov
www.linaro.org

Szabolcs Nagy

2015-10-22 10:26:28 UTC

Post by AKASHI Takahiro

Post by libin

Post by Maxim Kuvyrkov
Our proposal is that instead of adding -mfentry/-mnop-count/-mrecord-mcount options to other architectures,
we should
implement a target-independent option -fprolog-pad=N, which will generate a pad of N nops at the beginning
of each
function and add a section entry describing the pad similar to -mrecord-mcount [1].
Since adding NOPs is much less architecture-specific then outputting call instruction sequences, this option
can be
handled in a target-independent way at least for some/most architectures.
Comments?
As I found out today, the team from Huawei has implemented [2], which follows x86 example of -mfentry option
generating a hard-coded call sequence. I hope that this proposal can be easily incorporated into their work
since
most of the livepatching changes are in the kernel.

Thanks very much for your effort for this, and the arch-independed implementation
is very good to me, but only have one question that how to enture the atomic
replacement of multi instructions in kernel side?

I have one idea, but we'd better discuss this topic in, at least including, linux-arm-kernel.

Post by libin
And before this arch-independed option, can we consider the arch-depended -mfentry
implemention for arm64 like arch x86 firstly? I will post it soon.
https://github.com/libin2015/livepatch-for-arm64.git master

I also have my own version of livepatch support for arm64 using yet-coming "-fprolog-add=N" option :)
As we discussed before, the main difference will be how we should preserve LR register when invoking
a ftrace hook (ftrace_regs_caller).
But again, this is a topic to discuss mainly in linux-arm-kernel.
(I have no intention of excluding gcc ml from the discussions.)

is -fprolog-add=N enough from gcc?

i assume it solves the live patching, but i thought -mfentry
might be still necessary when live patching is not used.

or is the kernel fine with the current mcount abi for that?
(note that changes the code generation in leaf functions
and currently the kernel relies on frame pointers etc.)

AKASHI Takahiro

2015-10-23 09:11:03 UTC

Post by Szabolcs Nagy

Post by AKASHI Takahiro

Post by libin

Post by Maxim Kuvyrkov
Our proposal is that instead of adding -mfentry/-mnop-count/-mrecord-mcount options to other architectures,
we should
implement a target-independent option -fprolog-pad=N, which will generate a pad of N nops at the beginning
of each
function and add a section entry describing the pad similar to -mrecord-mcount [1].
Since adding NOPs is much less architecture-specific then outputting call instruction sequences, this option
can be
handled in a target-independent way at least for some/most architectures.
Comments?
As I found out today, the team from Huawei has implemented [2], which follows x86 example of -mfentry option
generating a hard-coded call sequence. I hope that this proposal can be easily incorporated into their work
since
most of the livepatching changes are in the kernel.

Thanks very much for your effort for this, and the arch-independed implementation
is very good to me, but only have one question that how to enture the atomic
replacement of multi instructions in kernel side?

I have one idea, but we'd better discuss this topic in, at least including, linux-arm-kernel.

Post by libin
And before this arch-independed option, can we consider the arch-depended -mfentry
implemention for arm64 like arch x86 firstly? I will post it soon.
https://github.com/libin2015/livepatch-for-arm64.git master

I also have my own version of livepatch support for arm64 using yet-coming "-fprolog-add=N" option :)
As we discussed before, the main difference will be how we should preserve LR register when invoking
a ftrace hook (ftrace_regs_caller).
But again, this is a topic to discuss mainly in linux-arm-kernel.
(I have no intention of excluding gcc ml from the discussions.)

is -fprolog-add=N enough from gcc?

Yes, as far as I correctly understand this option.

Post by Szabolcs Nagy
i assume it solves the live patching, but i thought -mfentry
might be still necessary when live patching is not used.

No.
- Livepatch depends on ftrace's DYNAMIC_FTRACE_WITH_REGS feature
- DYNAMIC_FTRACE_WITH_REGS can be implemented either with -fprolog-add=N or -mfentry
- x86 is the only architecture that supports -mfentry AFAIK
- and it is used in the kernel solely to implement this ftrace feature AFAIK
- So once a generic option, fprolog-add=N, is supported, we have no reason to add arch-specific -mfentry.

Post by Szabolcs Nagy
or is the kernel fine with the current mcount abi for that?
(note that changes the code generation in leaf functions

Can you please elaborate your comments in more details?
I didn't get your point here.

Thanks,
-Takahiro AKASHI

Post by Szabolcs Nagy
and currently the kernel relies on frame pointers etc.)

Szabolcs Nagy

2015-10-23 10:23:22 UTC

Post by AKASHI Takahiro

Post by Szabolcs Nagy

Post by AKASHI Takahiro
I also have my own version of livepatch support for arm64 using yet-coming "-fprolog-add=N" option :)
As we discussed before, the main difference will be how we should preserve LR register when invoking
a ftrace hook (ftrace_regs_caller).
But again, this is a topic to discuss mainly in linux-arm-kernel.
(I have no intention of excluding gcc ml from the discussions.)

is -fprolog-add=N enough from gcc?

Yes, as far as I correctly understand this option.

Post by Szabolcs Nagy
i assume it solves the live patching, but i thought -mfentry
might be still necessary when live patching is not used.

No.
- Livepatch depends on ftrace's DYNAMIC_FTRACE_WITH_REGS feature
- DYNAMIC_FTRACE_WITH_REGS can be implemented either with -fprolog-add=N or -mfentry
- x86 is the only architecture that supports -mfentry AFAIK
- and it is used in the kernel solely to implement this ftrace feature AFAIK
- So once a generic option, fprolog-add=N, is supported, we have no reason to add arch-specific -mfentry.

Post by Szabolcs Nagy
or is the kernel fine with the current mcount abi for that?
(note that changes the code generation in leaf functions

Can you please elaborate your comments in more details?
I didn't get your point here.

ok, i may be confused.

i thought there is a static ftrace (functions are
instrumented with mcount using -pg) and a dynamic one
where the code is modified at runtime.

then i thought adding -fprolog-pad=N would be good for the
dynamic case, but not for the static case.

the static case may need improvements too because the
current way (using regular c call abi for mcount) affects
code generation more significantly than the proposed
-mfentry solution would (e.g. leaf functions turn into
non-leaf ones).

hence the question: is the kernel satisfied with -pg mcount
for the static ftrace or does it want -mfentry behaviour
instead?

21 Replies
51 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Maxim Kuvyrkov 2015-05-28 08:39:27 UTC

Richard Biener 2015-05-28 08:59:10 UTC

Maxim Kuvyrkov 2015-05-28 09:16:58 UTC

Andreas Krebbel 2015-05-28 15:37:53 UTC

Ondřej Bílka 2015-06-04 07:15:31 UTC

Andreas Krebbel 2015-06-09 14:10:31 UTC

Andi Kleen 2015-06-09 15:52:57 UTC

Andi Kleen 2015-05-28 18:05:02 UTC

Jim Wilson 2015-05-28 22:41:07 UTC

Li Bin 2015-05-30 08:22:09 UTC

David Brown 2015-06-04 09:45:10 UTC

Andi Kleen 2015-06-04 15:17:48 UTC

Andrew Pinski 2015-06-04 15:30:20 UTC

Maxim Kuvyrkov 2015-10-13 12:51:12 UTC

libin 2015-10-22 09:07:04 UTC

libin 2015-10-22 09:23:30 UTC

Szabolcs Nagy 2015-10-22 10:16:33 UTC

libin 2015-10-22 13:06:30 UTC

AKASHI Takahiro 2015-10-22 10:14:28 UTC

Szabolcs Nagy 2015-10-22 10:26:28 UTC

AKASHI Takahiro 2015-10-23 09:11:03 UTC

Szabolcs Nagy 2015-10-23 10:23:22 UTC

about - legalese

Loading...