TLSDESC clobber ABI stability/futureproofness?

Discussion:

Rich Felker

2018-10-10 20:51:45 UTC

It's recently come up in musl libc development that the tlsdesc asm
functions, at least for some archs, are potentially not future-proof,
in that, for a given fixed version of the asm in the dynamic linker,
it seems possible for a future ISA level and compiler supporting that
ISA level to produce code, in the C functions called in the dynamic
fallback case, instructions which clobber registers which are normally
call-clobbered, but which are non-clobbered in the tlsdesc ABI. This
does not risk breakage when an existing valid build of libc/ldso is
used on new hardware and new appliations that provide new registers,
but it does risk breakage if an existing source version of libc/ldso
is built with a compiler supporting new extensions, which is difficult
to preclude and not something we want to try to preclude.

For aarch64 at least, according to discussions I had with Szabolcs
Nagy, there is an intent that any new extensions to the aarch64
register file be treated as clobbered by tlsdesc functions, rather
than preserved. However I can't find any statement of such intent for
x86 or in general.

In the x86 spec, the closest I can find are the phrasing:

"being able to assume no registers are clobbered by the call"

and the comment in the pseudo-C:

/* Preserve any call-clobbered registers not preserved because of
the above across the call below. */

Source: https://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-x86.txt

What is the policy for i386 and x86_64? Are normally call-clobbered
registers from new register file extensions intended to be preserved
by the tlsdesc functions, or clobberable by them? Is fxsave/xsave
supposed to be used when available? What about more obscure stuff like
3dnow or other vendor extensions?

Is there any intended policy for other/future archs that add TLSDESC
support?

I'm hopeful there's some hidden intent I'm just not finding that makes
this issue a lot simpler than I fear it is. If not we're considering
raising a signal to request installation of new dynamic TLS (punting
to the kernel to save/restore everything) or getting rid of
just-in-time installation of new TLS entirely in favor of havign
dlopen install it:

https://www.openwall.com/lists/musl/2018/10/10/2
https://www.openwall.com/lists/musl/2018/10/10/3

Rich

Alexandre Oliva

2018-10-11 03:53:04 UTC

Permalink

Post by Rich Felker
It's recently come up in musl libc development that the tlsdesc asm
functions, at least for some archs, are potentially not future-proof,
in that, for a given fixed version of the asm in the dynamic linker,
it seems possible for a future ISA level and compiler supporting that
ISA level to produce code, in the C functions called in the dynamic
fallback case, instructions which clobber registers which are normally
call-clobbered, but which are non-clobbered in the tlsdesc ABI. This
does not risk breakage when an existing valid build of libc/ldso is
used on new hardware and new appliations that provide new registers,
but it does risk breakage if an existing source version of libc/ldso
is built with a compiler supporting new extensions, which is difficult
to preclude and not something we want to try to preclude.

I understand the concern. I considered it back when I designed TLSDesc,
and my reasoning was that if the implementation of the fallback dynamic
TLS descriptor allocator could possibly use some register, the library
implementation should know about it as well, and work to preserve it. I
realize this might not be the case for an old library built by a new
compiler for a newer version of the library's target. Other pieces of
the library may fail as well, if registers unknown to it are available
and used by the compiler (setjmp/longjmp, *context, dynamic PLT
resolution come to mind), besides the usual difficulties building old
code with newer tools, so I figured it wasn't worth sacrificing the
performance of the normal TLSDesc case to make this aspect of register
set extensions easier.

There might be another particularly risky case, namely, that the memory
allocator used by TLS descriptors be overridden by code that uses more
registers than the library knows to preserve. Memory allocation within
the dynamic loader, including lazy TLS Descriptor relocation resolution,
is a context in which we should probably use internal, non-overridable
memory allocators, if we don't already. This would reduce the present
risky case to the one in the paragraph above.

Post by Rich Felker
For aarch64 at least, according to discussions I had with Szabolcs
Nagy, there is an intent that any new extensions to the aarch64
register file be treated as clobbered by tlsdesc functions, rather
than preserved.

That's unfortunate. I'm not sure I understand the reasoning behind this
intent. Maybe we should discuss it further?

Post by Rich Felker
"being able to assume no registers are clobbered by the call"
/* Preserve any call-clobbered registers not preserved because of
the above across the call below. */
Source: https://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-x86.txt
What is the policy for i386 and x86_64?

I don't know that my proposal ever became authoritative policy, but even
if it is, I guess I have to agree it is underspecified and the reasoning
above could be added.

Post by Rich Felker
Are normally call-clobbered registers from new register file
extensions intended to be preserved by the tlsdesc functions, or
clobberable by them?

My thinking has always been that they should be preserved, which doesn't
necessarily mean they have to be saved and restored. Only if the
implementation of tlsdesc could possibly modify them should it arrange
for their entry-point values to be restored before returning. This
implies not calling overridable functions in the internal
implementation, and compiling at least the bits used by the tlsdesc
implementation so as to use only the register set known and supported by
the library.

Anyway, thanks for bringing this up. I'm amending the x86 TLSDesc
proposal to cover this with the following footnote:

(*) Preserving a register does not necessarily imply saving and
restoring it. If the system library implementation does not use or
even know about a certain extended register set, it needs not save it,
because it will presumably not modify it. This assumes the TLS
Descriptor implementation is self-contained within the system library,
without no overridable callbacks. A consequence is that, even if
other parts of the system library are compiled so as to use an
extended register set, those used by the implementation of TLS
Descriptors, including lazy relocations, should be limited to using
the register set that the interfaces are known to preserve.

after:

[...] This penalizes
the case that requires dynamic TLS, since it must preserve (*) all
call-clobbered registers [...]

Please let me know your thoughts about this change, e.g., whether it's
enough to address your concerns or if you envision a need for more than
that. Thanks,

--
Alexandre Oliva, freedom fighter https://FSFLA.org/blogs/lxo
Be the change, be Free! FSF Latin America board member
GNU Toolchain Engineer Free Software Evangelist
Hay que enGNUrecerse, pero sin perder la terGNUra jamás-GNUChe

Szabolcs Nagy

2018-10-11 11:47:39 UTC

Permalink

Post by Alexandre Oliva

That's unfortunate. I'm not sure I understand the reasoning behind this
intent. Maybe we should discuss it further?

sve registers overlap with existing float registers
so float register access clobbers them.

so new code is suddenly not compatible with existing
tlsdesc entry points in the libc.

i think extensions should not cause such abi break.
we could mark binaries so they fail to load on an old
system instead of failing randomly at runtime, but
requiring new libc for a new system is suboptimal
(you cannot deploy stable linux distros then).

if we update the libc then the tlsdesc entry has to
save/restore all sve regs, which can be huge state
(cause excessive stack usage), but more importantly
suddenly the process becomes "sve enabled" even if it
otherwise does not use sve at all (linux kernel keeps
track of which processes use sve instructions, ones
that don't can enter the kernel more quickly as the
sve state does not have to be saved)

i don't see a good solution for this, but in practice
it's unlikely that user code would need tls access and
sve together much, so it seems reasonable to just add
sve registers to tlsdesc call clobber list and do the
same for future extensions too (tlsdesc call will not
be worse than a normal indirect call).

(in principle it's possible that tlsdesc entry avoids
using any float regs, but in practice that requires
hackery in the libc: marking every affected translation
units with -mgeneral-regs-only or si

Rich Felker

2018-10-11 15:15:56 UTC

Permalink

This post might be inappropriate. Click to display it.

Alexandre Oliva

2018-10-11 23:18:37 UTC

Permalink

Post by Rich Felker
This is indeed the big risk for glibc right now (with lazy,
non-fail-safe allocation of dynamic TLS)

Yeah, dynamic TLS was a can of works in that regard even before lazy TLS
relocations.

Post by Rich Felker
that it's unlikely for vector-heavy code to be using TLS where the TLS
address load can't be hoisted out of the blocks where the
call-clobbered vector regs are in use. Generally, if such hoisting is
performed, the main/only advantage of avoiding clobbers is for
registers which may contain incoming arguments.

I see. Well, the more registers are preserved, the better for the ideal
fast path, but even if some are not, you're still better off than
explicitly calling tls_get_addr...

Post by Rich Felker
unless there is some future-proof approach to
save-all/restore-all that works on all archs with TLSDESC

Please don't single-out TLSDESC as if the problem affected it alone.
Lazy relocation with traditional PLT entries for functions are also
supposed to save and restore all registers, and the same issues arise,
except they're a lot more common. The only difference is that failures
to preserve registers are less visible, because most of the time you're
resolving them to functions that abide by the normal ABI, but once
specialized calling conventions kick in, the very same issues arise.
TLS descriptors are just one case of such specialized calling
conventions. Indeed, one of the reasons that made me decide this
arrangement was acceptable was precisely because the problem already
existed with preexisting lazy PLT resolution.

Rich Felker

2018-10-11 23:32:31 UTC

Permalink

Post by Alexandre Oliva

Post by Rich Felker
This is indeed the big risk for glibc right now (with lazy,
non-fail-safe allocation of dynamic TLS)

Yeah, dynamic TLS was a can of works in that regard even before lazy TLS
relocations.

I see. Well, the more registers are preserved, the better for the ideal
fast path, but even if some are not, you're still better off than
explicitly calling tls_get_addr...

Also, it seems gcc is failing to do this hoisting right on x86_64
right now, regardless of which TLS model is used. Bug report should be
coming soon.

Post by Alexandre Oliva

Post by Rich Felker
unless there is some future-proof approach to
save-all/restore-all that works on all archs with TLSDESC

Right, but lazy relocations are a "feature" you can easily just omit,
and we do. However the only way to omit this path from TLSDESC is
installing the new TLS to all live threads at dlopen time. That's
actually not a bad idea -- it drops the compare/branch from the
dynamic tlsdesc code path, and likewise in __tls_get_addr, making both
forms of dynamic TLS (possibly considerably) faster. I'm just
concerned about whether it can be done without making thread
creation/exit significantly slower.

Post by Alexandre Oliva
except they're a lot more common. The only difference is that failures
to preserve registers are less visible, because most of the time you're
resolving them to functions that abide by the normal ABI, but once
specialized calling conventions kick in, the very same issues arise.
TLS descriptors are just one case of such specialized calling
conventions. Indeed, one of the reasons that made me decide this
arrangement was acceptable was precisely because the problem already
existed with preexisting lazy PLT resolution.

I see. From that perspective, it's less of a constraint than
constraints that already existed elsewhere. Unfortunately from our
perspective in musl those greater constraints don't exist, and the one
imposed by TLSDESC is the unique one of its kind.

Rich

Alexandre Oliva

2018-10-13 07:00:32 UTC

Permalink

Post by Rich Felker
However the only way to omit this path from TLSDESC is
installing the new TLS to all live threads at dlopen time

Well, one could just as easily drop dynamic TLS altogether, forcing all
TLS into the Static TLS block until it fills up, and failing for good if
it does. But then, you don't need TLS Descriptors at all, just go with
Initial Exec. It helps if init can set the Static TLS block size from
an environment variable or somesuch.

But your statement appears to be conflating two separate issues, namely
the allocation of a TLS Descriptor during lazy TLS resolution, and the
allocation of per-thread dynamic TLS blocks upon first access.

The former is just as easy and harmless to disable as lazy function
relocations. The latter is not exclusive to TLSDesc: __tls_get_addr has
historically used malloc to grow the DTV and to allocate dynamic TLS
blocks, and if overriders to malloc end up using/clobbering unexpected
registers, even if just because of lazy PLT resolution for calls in its
implementation, things might go wrong. Sure enough, __tls_get_addr
doesn't use a specialized ABI, so this is usually not an issue.

Post by Rich Felker
That's actually not a bad idea -- it drops the compare/branch from the
dynamic tlsdesc code path, and likewise in __tls_get_addr, making both
forms of dynamic TLS (possibly considerably) faster.

But then you have to add some form of synchronization so that other
threads can actually mess with your DTV without conflicts, from
releasing dlclose()d dynamic blocks to growing the DTV and releasing the
old DTV while its owner thread is using it.

I wonder if it would make sense to introduce an overridable
call-clobbered-regs-preserving wrapper function, that lazy PLT resolvers
and Dynamic TLSDesc calls would call, and that could be easily extended
to preserve extended register files without having to modify the library
proper. LD_PRELOAD could bring it in, and it could even use ifunc
relocations, to be able to cover all available registers on arches with
multiple register file extensions.

Rich Felker

2018-10-14 02:21:00 UTC

Permalink

This post might be inappropriate. Click to display it.