Richard Henderson
2018-09-08 10:04:38 UTC
This attempts to capture (most of) the two hour discussion that
we had in the Entrance Hall at GNU Cauldron yesterday. Please
correct any faulty memory on my part and forward or cc this to
the appropriate RISC-V forum.
The RISC-V vector extension described something other than what is
present in the currently released 2.2 standard. To clarify the
language within this message, based on what I remember:
vl: the current number of elements that are active in each vector reg
maxvl: the maximum number of elements in the current config
maxel: the (maximum) element width within each vector reg
vsz: (new language here, vector size) maxvl * maxel
(I) We talked about the needs of the register allocator.
No one wants their loop kernel to spill registers, but eventually
it will happen, and we have to be prepared for it.
For a series of loop nests, while there is no spilling, it is easy
to select a vconfig for each loop nest independently. However, when
spills occur, we need to be able to allocate spill slots of sufficient
size, and locate their position within the stack frame.
Because VECSZ can change based on vconfig, the only manageable frame
allocation strategy may require selecting a single vconfig for the
entire function.
We posited new instructions, vspill and vfill, that ignore VL, ignore
predication, and operate on all MAXVL elements of MAXEL. This allows
the compiler to save and restore the entire contents of the register
without knowing the current configuration.
We posited a new instruction, akin to SVE's addvl, such as
addvsz rd, rs, imm (rd = rs + imm * vsz)
which can be used to rapidly size the stack frame and form addresses
of spill slots within the stack frame, also without knowing the current
configuration. E.g.
addvsz sp, fp, -7
to allocate space for 7 spill slots below the frame pointer.
When a spill needs to occur, use e.g.
addvsz tmp, fp, -3
vspill v0, tmp
In particular, this approach allows one to form the address in one
instruction instead of 3, and also providing VSZ implicitly as opposed
to requiring the register allocator to somehow re-materialize this value.
(Which is not impossible, given the example of pic_offset_register.
But also given that example, requires invasive special-casing within
the register allocator. So, ew.)
(II) We talked about the needs of a "simd" abi
The normal integer abi, is already fixed, and because of that the
entire vector state must needs be call-clobbered. However, OpenMP
has a #pragma simd that instructs the compiler to generate a set of
vectorized versions of a given function.
Compare
http://infocenter.arm.com/help/topic/com.arm.doc.ecm0665628/abi_sve_aapcs64_100986_0000_00_en.pdf
https://software.intel.com/en-us/articles/vector-simd-function-abi
Something similar must be defined for RISC-V. Such an abi must
consider how vconfig is to be managed across function boundaries
and with separate compilation. In my opinion this should be done
before finalizing the ISA, as detailed below.
(II-a) The callee must know how many registers are enabled by vconfig.
The simplest solution is simply to require all 32 registers to be enabled.
Expanding on this slightly, one could require a reduced set N (e.g. 16)
and defined this as abi. This would trade off potentially unused
registers and potentially more spilling for longer vectors in the
(presumably) common case.
One could require N registers by default and override this by an
explicit target-specific clause in the #pragma. This would allow
programmers to tune the compiler output (bearing in mind that changing
the clause changes the function abi), while also providing a sensible
default for code that has not been explicitly tuned for a given risc-v
implementation.
(II-b) The callee must know MAXEL.
It seems obvious that MAXEL must be set as large as the largest vector
type that is passed in or returned from the callee. However, the
callee may be passed a vector of bytes but internally may need to
operate on doubles as part of the computation. One cannot simply
set a new vconfig without potentially losing data in the byte input.
We suggested the possibility of generating up to 4 different functions
(or equivalently 4 different entry point symbols), which allows the
caller to give the callee the minimal MAXEL, N, that may be assumed.
We didn't talk specifics, but assume N would be encoded into the
normal _Z name mangling that is used for C++ and the other target
specific simd abis. But for purposes of example here let's just
use foo_N.
A function taking a vector of doubles of course requires MAXEL = 8,
and so only one entry point foo_8 is generated.
A function taking a vector of bytes would generate all of foo_1,
foo_2, foo_4, foo_8. Suppose further that the function requires at
maximum a uint16_t immediate. In that case, foo_2, foo_4, foo_8
would all be aliases for the same symbol. However, foo_1 would need
to do something different.
While the "something" of course lies firmly in the compiler prerogative,
so long as correct results are obtained, we suggested a simple approach
that may perform as well as might be expected: spill the inputs to the
stack, adjust vconfig as required, and vectorize the operation on the
data on the stack just like any other array.
We again suggested that a target-specific clause in the #pragma might
allow tuning the compiler output by specifying the minimum MAXEL.
(II-c) The callee must be able to reset to a previous vconfig
When performing the "something" above, one needs to be able to save
the current vconfig and restore it at the end. The specification
of the vconfig insn is not within the 2.2 document, nor was it
fully defined within the presentation. However, the presentation
implied that the argument is always immediate. A RMW version is
required, and with input from a register.
(II-d) The callee should be able to call-save registers
The simd abi may well want to provide some call-saved vector registers.
This is where vspill, vfill, and addvsz being able to operate without
the callee knowing the exact vconfig is imperative.
r~
we had in the Entrance Hall at GNU Cauldron yesterday. Please
correct any faulty memory on my part and forward or cc this to
the appropriate RISC-V forum.
The RISC-V vector extension described something other than what is
present in the currently released 2.2 standard. To clarify the
language within this message, based on what I remember:
vl: the current number of elements that are active in each vector reg
maxvl: the maximum number of elements in the current config
maxel: the (maximum) element width within each vector reg
vsz: (new language here, vector size) maxvl * maxel
(I) We talked about the needs of the register allocator.
No one wants their loop kernel to spill registers, but eventually
it will happen, and we have to be prepared for it.
For a series of loop nests, while there is no spilling, it is easy
to select a vconfig for each loop nest independently. However, when
spills occur, we need to be able to allocate spill slots of sufficient
size, and locate their position within the stack frame.
Because VECSZ can change based on vconfig, the only manageable frame
allocation strategy may require selecting a single vconfig for the
entire function.
We posited new instructions, vspill and vfill, that ignore VL, ignore
predication, and operate on all MAXVL elements of MAXEL. This allows
the compiler to save and restore the entire contents of the register
without knowing the current configuration.
We posited a new instruction, akin to SVE's addvl, such as
addvsz rd, rs, imm (rd = rs + imm * vsz)
which can be used to rapidly size the stack frame and form addresses
of spill slots within the stack frame, also without knowing the current
configuration. E.g.
addvsz sp, fp, -7
to allocate space for 7 spill slots below the frame pointer.
When a spill needs to occur, use e.g.
addvsz tmp, fp, -3
vspill v0, tmp
In particular, this approach allows one to form the address in one
instruction instead of 3, and also providing VSZ implicitly as opposed
to requiring the register allocator to somehow re-materialize this value.
(Which is not impossible, given the example of pic_offset_register.
But also given that example, requires invasive special-casing within
the register allocator. So, ew.)
(II) We talked about the needs of a "simd" abi
The normal integer abi, is already fixed, and because of that the
entire vector state must needs be call-clobbered. However, OpenMP
has a #pragma simd that instructs the compiler to generate a set of
vectorized versions of a given function.
Compare
http://infocenter.arm.com/help/topic/com.arm.doc.ecm0665628/abi_sve_aapcs64_100986_0000_00_en.pdf
https://software.intel.com/en-us/articles/vector-simd-function-abi
Something similar must be defined for RISC-V. Such an abi must
consider how vconfig is to be managed across function boundaries
and with separate compilation. In my opinion this should be done
before finalizing the ISA, as detailed below.
(II-a) The callee must know how many registers are enabled by vconfig.
The simplest solution is simply to require all 32 registers to be enabled.
Expanding on this slightly, one could require a reduced set N (e.g. 16)
and defined this as abi. This would trade off potentially unused
registers and potentially more spilling for longer vectors in the
(presumably) common case.
One could require N registers by default and override this by an
explicit target-specific clause in the #pragma. This would allow
programmers to tune the compiler output (bearing in mind that changing
the clause changes the function abi), while also providing a sensible
default for code that has not been explicitly tuned for a given risc-v
implementation.
(II-b) The callee must know MAXEL.
It seems obvious that MAXEL must be set as large as the largest vector
type that is passed in or returned from the callee. However, the
callee may be passed a vector of bytes but internally may need to
operate on doubles as part of the computation. One cannot simply
set a new vconfig without potentially losing data in the byte input.
We suggested the possibility of generating up to 4 different functions
(or equivalently 4 different entry point symbols), which allows the
caller to give the callee the minimal MAXEL, N, that may be assumed.
We didn't talk specifics, but assume N would be encoded into the
normal _Z name mangling that is used for C++ and the other target
specific simd abis. But for purposes of example here let's just
use foo_N.
A function taking a vector of doubles of course requires MAXEL = 8,
and so only one entry point foo_8 is generated.
A function taking a vector of bytes would generate all of foo_1,
foo_2, foo_4, foo_8. Suppose further that the function requires at
maximum a uint16_t immediate. In that case, foo_2, foo_4, foo_8
would all be aliases for the same symbol. However, foo_1 would need
to do something different.
While the "something" of course lies firmly in the compiler prerogative,
so long as correct results are obtained, we suggested a simple approach
that may perform as well as might be expected: spill the inputs to the
stack, adjust vconfig as required, and vectorize the operation on the
data on the stack just like any other array.
We again suggested that a target-specific clause in the #pragma might
allow tuning the compiler output by specifying the minimum MAXEL.
(II-c) The callee must be able to reset to a previous vconfig
When performing the "something" above, one needs to be able to save
the current vconfig and restore it at the end. The specification
of the vconfig insn is not within the 2.2 document, nor was it
fully defined within the presentation. However, the presentation
implied that the argument is always immediate. A RMW version is
required, and with input from a register.
(II-d) The callee should be able to call-save registers
The simd abi may well want to provide some call-saved vector registers.
This is where vspill, vfill, and addvsz being able to operate without
the callee knowing the exact vconfig is imperative.
r~