RISC-V vector extension cauldron discussion

Discussion:

Richard Henderson

2018-09-08 10:04:38 UTC

This attempts to capture (most of) the two hour discussion that
we had in the Entrance Hall at GNU Cauldron yesterday. Please
correct any faulty memory on my part and forward or cc this to
the appropriate RISC-V forum.

The RISC-V vector extension described something other than what is
present in the currently released 2.2 standard. To clarify the
language within this message, based on what I remember:

vl: the current number of elements that are active in each vector reg
maxvl: the maximum number of elements in the current config
maxel: the (maximum) element width within each vector reg
vsz: (new language here, vector size) maxvl * maxel

(I) We talked about the needs of the register allocator.

No one wants their loop kernel to spill registers, but eventually
it will happen, and we have to be prepared for it.

For a series of loop nests, while there is no spilling, it is easy
to select a vconfig for each loop nest independently. However, when
spills occur, we need to be able to allocate spill slots of sufficient
size, and locate their position within the stack frame.

Because VECSZ can change based on vconfig, the only manageable frame
allocation strategy may require selecting a single vconfig for the
entire function.

We posited new instructions, vspill and vfill, that ignore VL, ignore
predication, and operate on all MAXVL elements of MAXEL. This allows
the compiler to save and restore the entire contents of the register
without knowing the current configuration.

We posited a new instruction, akin to SVE's addvl, such as

addvsz rd, rs, imm (rd = rs + imm * vsz)

which can be used to rapidly size the stack frame and form addresses
of spill slots within the stack frame, also without knowing the current
configuration. E.g.

addvsz sp, fp, -7

to allocate space for 7 spill slots below the frame pointer.
When a spill needs to occur, use e.g.

addvsz tmp, fp, -3
vspill v0, tmp

In particular, this approach allows one to form the address in one
instruction instead of 3, and also providing VSZ implicitly as opposed
to requiring the register allocator to somehow re-materialize this value.

(Which is not impossible, given the example of pic_offset_register.
But also given that example, requires invasive special-casing within
the register allocator. So, ew.)

(II) We talked about the needs of a "simd" abi

The normal integer abi, is already fixed, and because of that the
entire vector state must needs be call-clobbered. However, OpenMP
has a #pragma simd that instructs the compiler to generate a set of
vectorized versions of a given function.

Compare

http://infocenter.arm.com/help/topic/com.arm.doc.ecm0665628/abi_sve_aapcs64_100986_0000_00_en.pdf
https://software.intel.com/en-us/articles/vector-simd-function-abi

Something similar must be defined for RISC-V. Such an abi must
consider how vconfig is to be managed across function boundaries
and with separate compilation. In my opinion this should be done
before finalizing the ISA, as detailed below.

(II-a) The callee must know how many registers are enabled by vconfig.

The simplest solution is simply to require all 32 registers to be enabled.

Expanding on this slightly, one could require a reduced set N (e.g. 16)
and defined this as abi. This would trade off potentially unused
registers and potentially more spilling for longer vectors in the
(presumably) common case.

One could require N registers by default and override this by an
explicit target-specific clause in the #pragma. This would allow
programmers to tune the compiler output (bearing in mind that changing
the clause changes the function abi), while also providing a sensible
default for code that has not been explicitly tuned for a given risc-v
implementation.

(II-b) The callee must know MAXEL.

It seems obvious that MAXEL must be set as large as the largest vector
type that is passed in or returned from the callee. However, the
callee may be passed a vector of bytes but internally may need to
operate on doubles as part of the computation. One cannot simply
set a new vconfig without potentially losing data in the byte input.

We suggested the possibility of generating up to 4 different functions
(or equivalently 4 different entry point symbols), which allows the
caller to give the callee the minimal MAXEL, N, that may be assumed.

We didn't talk specifics, but assume N would be encoded into the
normal _Z name mangling that is used for C++ and the other target
specific simd abis. But for purposes of example here let's just
use foo_N.

A function taking a vector of doubles of course requires MAXEL = 8,
and so only one entry point foo_8 is generated.

A function taking a vector of bytes would generate all of foo_1,
foo_2, foo_4, foo_8. Suppose further that the function requires at
maximum a uint16_t immediate. In that case, foo_2, foo_4, foo_8
would all be aliases for the same symbol. However, foo_1 would need
to do something different.

While the "something" of course lies firmly in the compiler prerogative,
so long as correct results are obtained, we suggested a simple approach
that may perform as well as might be expected: spill the inputs to the
stack, adjust vconfig as required, and vectorize the operation on the
data on the stack just like any other array.

We again suggested that a target-specific clause in the #pragma might
allow tuning the compiler output by specifying the minimum MAXEL.

(II-c) The callee must be able to reset to a previous vconfig

When performing the "something" above, one needs to be able to save
the current vconfig and restore it at the end. The specification
of the vconfig insn is not within the 2.2 document, nor was it
fully defined within the presentation. However, the presentation
implied that the argument is always immediate. A RMW version is
required, and with input from a register.

(II-d) The callee should be able to call-save registers

The simd abi may well want to provide some call-saved vector registers.
This is where vspill, vfill, and addvsz being able to operate without
the callee knowing the exact vconfig is imperative.

r~

Jakub Jelinek

2018-09-08 21:08:16 UTC

Permalink

Post by Richard Henderson
Something similar must be defined for RISC-V. Such an abi must
consider how vconfig is to be managed across function boundaries
and with separate compilation. In my opinion this should be done
before finalizing the ISA, as detailed below.
(II-a) The callee must know how many registers are enabled by vconfig.
The simplest solution is simply to require all 32 registers to be enabled.

If masked vs. non-masked vector operations have approximately the same cost,
and due to the properties of the V extension only a single simdlen (variable
one) is meaningful, then it is possible to avoid duplicating
#pragma omp declare simd/__attribute__((simd)) functions without explicit
notinbranch or inbranch attributes, just use the masked ones in the ABI and
pass all true vector in v1.

If we reduce the number of copies that way, perhaps there is a way to offer
next to the scalar copy 2 or 3 vector variants that would differ by the
number of vector registers, say in the ABI document say that each function
should be emitted in 3 vector variants, one with number of registers 32,
another one for 16 and another one for 8 registers (of course, if the
function isn't externally visible, compiler can choose to emit just the ones
that are needed or use any other ABI it chooses) and always just require
that maxel is 8 in the simd ABI, because trying to spill all the arguments,
save current vconfig setting, reconfigure, do the work with multiple loop
iterations, spill all the results, restore previous vconfig setting and fill
in the result value might be too costly in case the implementation needs
MAXEL higher than the caller provided.

Jakub

Palmer Dabbelt

2018-09-11 16:28:59 UTC

Permalink

Sorry if this is a bit roundabout, I wrote this over a few days and ended up
thinking a bunch while writing. As a result, the beginning might not match the
end.

Post by Richard Henderson
This attempts to capture (most of) the two hour discussion that
we had in the Entrance Hall at GNU Cauldron yesterday. Please
correct any faulty memory on my part and forward or cc this to
the appropriate RISC-V forum.

Thanks!

I think Roger forwarded this to the private RISC-V vector mailing list, but I'm
not in the club so I'll have to talk about it here :)

Post by Richard Henderson
The RISC-V vector extension described something other than what is
present in the currently released 2.2 standard. To clarify the

Yes. The current RISC-V ISA standard contains no vector instructions, they
will be added under the "V" extension as part of a future revision of the
RISC-V standard. This is how we manage the standard: as new revisions of the
ISA manual come out we can add new extensions, but we can never change or
remove an existing extension.

Post by Richard Henderson
vl: the current number of elements that are active in each vector reg
maxvl: the maximum number of elements in the current config
maxel: the (maximum) element width within each vector reg
vsz: (new language here, vector size) maxvl * maxel
(I) We talked about the needs of the register allocator.
No one wants their loop kernel to spill registers, but eventually
it will happen, and we have to be prepared for it.
For a series of loop nests, while there is no spilling, it is easy
to select a vconfig for each loop nest independently. However, when
spills occur, we need to be able to allocate spill slots of sufficient
size, and locate their position within the stack frame.
Because VECSZ can change based on vconfig, the only manageable frame
allocation strategy may require selecting a single vconfig for the
entire function.
We posited new instructions, vspill and vfill, that ignore VL, ignore
predication, and operate on all MAXVL elements of MAXEL. This allows
the compiler to save and restore the entire contents of the register
without knowing the current configuration.

While I'm not part of the vector working group, I'd anticipate these sorts of
instructions don't make it into the V extension because they leak too much
about the microarchitecture to software. One of the goals of the V extension
is to allow for software compatibility between different implementations, and
instructions with semantics like these tend to lead to incompatible software.

Additionally, I don't think this is necessary because our proposed vector ABI
is to clobber the entire state of the vector unit on all function calls. As a
result, the compiler should always know how to spill registers: when you enter
the function you don't have to spill anything, and if you spill as part of the
loop you must have set up the vector unit in some manner and therefor should
know about it.

This ABI may be a bad idea, but if we stick with it then we should be able to
get away without this sort of instruction. There's a few other places in the
ISA where we've somewhat wed ourselves to this ABI, but with some modifications
it might be possible to produce a reasonable one.

An instruction that does something similar (or even more revealing of the
microarchitecture) may be necessary as part of the supervisor-mode vector
extensions, but those are less scary because we can just mandate that
supervisors write compatible code. We operate under the assumption that we
can't mandate anything about userspace.

Post by Richard Henderson
We posited a new instruction, akin to SVE's addvl, such as
addvsz rd, rs, imm (rd = rs + imm * vsz)
which can be used to rapidly size the stack frame and form addresses
of spill slots within the stack frame, also without knowing the current
configuration. E.g.
addvsz sp, fp, -7
to allocate space for 7 spill slots below the frame pointer.
When a spill needs to occur, use e.g.
addvsz tmp, fp, -3
vspill v0, tmp
In particular, this approach allows one to form the address in one
instruction instead of 3, and also providing VSZ implicitly as opposed
to requiring the register allocator to somehow re-materialize this value.
(Which is not impossible, given the example of pic_offset_register.
But also given that example, requires invasive special-casing within
the register allocator. So, ew.)

I agree here: providing some mechanism to obtain the current VL should be
provided by the ISA. While it's not strictly necessary given the proposed ABI,
it does avoid a bunch of headaches trying to ensure that value sticks around.
I think it will end up being useful for any ABI that avoids clobbering every
vector register. I'd propose that the vector config should also be obtainable,
which will also be useful for such an ABI.

Post by Richard Henderson
(II) We talked about the needs of a "simd" abi
The normal integer abi, is already fixed, and because of that the
entire vector state must needs be call-clobbered. However, OpenMP
has a #pragma simd that instructs the compiler to generate a set of
vectorized versions of a given function.
Compare
http://infocenter.arm.com/help/topic/com.arm.doc.ecm0665628/abi_sve_aapcs64_100986_0000_00_en.pdf
https://software.intel.com/en-us/articles/vector-simd-function-abi
Something similar must be defined for RISC-V. Such an abi must
consider how vconfig is to be managed across function boundaries
and with separate compilation. In my opinion this should be done
before finalizing the ISA, as detailed below.

Must is a strong word, but I agree that we should at least ensure that it's
possible to define a sane ABI that saves vector registers around function calls
and passes arguments via vector registers. In other words: I think we'll still
want to support something like "-march=rv64gcv -mabi=lp64d", but I don't think
we want to preclude ourselves from "-march=rv64gcv -mabi=lp64dv" being better.

I think the best way to go about this is to figure out what features of an ABI
might be worth having, and then to enumerate the mechanisms that an V-style ISA
extension must provide in order to sanely implement such an ABI. Essentially
we've still got time to change the ISA, so let's just design a good ABI, figure
out what's necessary from the ISA to implement said ABI, and then make sure
that's in the standard.

The ABI features I can think of are:

* Passing at least one argument in a vector register.
- Presumably we'll clobber vector argument registers on calls, like we do
for everything else. Thus there isn't any ISA requirement here.
- How does one go about indicating at the C level that an argument is
passed in a register? If we just say "any __attribute__((vector)) of
length less than N bytes/elements" then N must be less than the ISA
mandated minimum vector length (IIRC 4 elements?) -- that might be OK.
* Saving the contents of at least one vector register across a function call.
In order to do so we need:
- A mechanism for determining the number of bytes used by a vector
register, to reserve stack space.
- A mechanism for saving a vector register to the stack. This could be a
simple vector store, but if we want to maintain the entire register (as
opposed to just the first vl elements) we need
* Saving vl across a function call.
- We need a mechanism for determining the vector length. Currently the
only way to do so is destructive, we'd need a non-destructive way to do
so.
* Saving vconfig across a function call.
- There is no way to determine the config, we'd need a way to do so.

My proposed vector ABI is:

* Don't pass any vector arguments in registers.
* Save all enabled vector registers across function calls, but only the first
VL elements of each vector (again, with the number of saved-over-call
registers decided upon later). This allows some flexibility on the side of
the caller, as it can inform the callee that it does not actually need any of
the saved registers. This also avoids any of the headaches related to saving
MAXVL elements.
* Save vl and vconfig across function calls.

One of the nice things here is backwards compatibility -- we can actually avoid
a second vector ABI here, with a simply runtime check to see if the vector unit
is enabled before saving any registers. This also allows us to avoid the my
biggest fear with any proposed vector ABI: a big cliff for any function that
uses the vector unit where a bunch of data is saved that isn't actually needed
by the caller. My worry here is that this favors short-VL implementations in
favor of long-VL implementations, which seems like the wrong way to go -- the
long-VL implementations will be fast either, way, so it'd be better to penalize
those. Having two ABIs for the two implementation styles would be a mess.

I have no idea if this is sane or not, so feel free to suggest something else
:). I have to go play around with the OpenMP SIMD stuff to see if this makes
any sense in that realm.

Post by Richard Henderson
(II-a) The callee must know how many registers are enabled by vconfig.
The simplest solution is simply to require all 32 registers to be enabled.
Expanding on this slightly, one could require a reduced set N (e.g. 16)
and defined this as abi. This would trade off potentially unused
registers and potentially more spilling for longer vectors in the
(presumably) common case.
One could require N registers by default and override this by an
explicit target-specific clause in the #pragma. This would allow
programmers to tune the compiler output (bearing in mind that changing
the clause changes the function abi), while also providing a sensible
default for code that has not been explicitly tuned for a given risc-v
implementation.

Makes sense -- my only worry here is that we're leaving a lot on the floor.
Maybe this is just because I'm not really a vector guy, but my biggest worry
with the vector unit is ensuring that memcpy() and friends are reasonably
efficient. I'm a bit worried about throwing a factor of 32 in vector length on
the floor here (or requiring saving a huge vector state), particularly as I
think that most vectorized code won't need to worry about calling standard ABI
functions.

Post by Richard Henderson
(II-b) The callee must know MAXEL.
It seems obvious that MAXEL must be set as large as the largest vector
type that is passed in or returned from the callee. However, the
callee may be passed a vector of bytes but internally may need to
operate on doubles as part of the computation. One cannot simply
set a new vconfig without potentially losing data in the byte input.

I think the important distinction here is not that it's impossible to set a
vconfig without clearing the vector state, but instead that it's impossible to
interrogate the current state of the vector unit.

Post by Richard Henderson
We suggested the possibility of generating up to 4 different functions
(or equivalently 4 different entry point symbols), which allows the
caller to give the callee the minimal MAXEL, N, that may be assumed.
We didn't talk specifics, but assume N would be encoded into the
normal _Z name mangling that is used for C++ and the other target
specific simd abis. But for purposes of example here let's just
use foo_N.
A function taking a vector of doubles of course requires MAXEL = 8,
and so only one entry point foo_8 is generated.
A function taking a vector of bytes would generate all of foo_1,
foo_2, foo_4, foo_8. Suppose further that the function requires at
maximum a uint16_t immediate. In that case, foo_2, foo_4, foo_8
would all be aliases for the same symbol. However, foo_1 would need
to do something different.
While the "something" of course lies firmly in the compiler prerogative,
so long as correct results are obtained, we suggested a simple approach
that may perform as well as might be expected: spill the inputs to the
stack, adjust vconfig as required, and vectorize the operation on the
data on the stack just like any other array.

While this sounds like a reasonable approach, it also seems quite complicated.
This might be a bit of a pipe dream, but I was hoping to avoid this sort of
complexity.

Post by Richard Henderson
We again suggested that a target-specific clause in the #pragma might
allow tuning the compiler output by specifying the minimum MAXEL.

Unfortunately, it certainly smells like some mechanism for allowing users to
override the ABI on a per-function basis will be necessary.

Post by Richard Henderson
(II-c) The callee must be able to reset to a previous vconfig
When performing the "something" above, one needs to be able to save
the current vconfig and restore it at the end. The specification
of the vconfig insn is not within the 2.2 document, nor was it
fully defined within the presentation. However, the presentation
implied that the argument is always immediate. A RMW version is
required, and with input from a register.

I agree.

Post by Richard Henderson
(II-d) The callee should be able to call-save registers
The simd abi may well want to provide some call-saved vector registers.
This is where vspill, vfill, and addvsz being able to operate without
the callee knowing the exact vconfig is imperative.

I agree.

Richard Henderson

2018-09-11 21:34:24 UTC

Permalink

Post by Richard Henderson
The RISC-V vector extension described something other than what is
present in the currently released 2.2 standard. To clarify the

Well, right, but it does have a draft of the V extension.
What was presented did not match that, which is what I was trying to describe.

Post by Richard Henderson
We posited new instructions, vspill and vfill, that ignore VL, ignore
predication, and operate on all MAXVL elements of MAXEL. This allows
the compiler to save and restore the entire contents of the register
without knowing the current configuration.

Pardon? How do they leak micro-architecture detail?
They load and store the *architectural* contents of the registers.

Additionally, I don't think this is necessary because our proposed vector ABI
is to clobber the entire state of the vector unit on all function calls.

Yes, but I was foreshadowing...

Post by Richard Henderson
(II) We talked about the needs of a "simd" abi

... this, in which we would not necessarily know the vconfig.

Sure.

* Passing at least one argument in a vector register.
   - Presumably we'll clobber vector argument registers on calls, like we do
     for everything else. Thus there isn't any ISA requirement here.
   - How does one go about indicating at the C level that an argument is
passed in a register? If we just say "any __attribute__((vector)) of
length less than N bytes/elements" then N must be less than the ISA
mandated minimum vector length (IIRC 4 elements?) -- that might be OK.

Here I think you need to read the SVE document.

I would not use this abi for __attribute__((vector(fixed-size))) at all, but
for the variable length vectors that the auto-vectorizer uses, since that's
exactly what these functions are for.

* Saving the contents of at least one vector register across a function call.
   - A mechanism for determining the number of bytes used by a vector
register, to reserve stack space.
   - A mechanism for saving a vector register to the stack. This could be a
     simple vector store, but if we want to maintain the entire register (as
     opposed to just the first vl elements) we need

This is exactly what I was talking about above for vspill/vfill.

* Saving vl across a function call.
   - We need a mechanism for determining the vector length. Currently the
only way to do so is destructive, we'd need a non-destructive way to do      so.
* Saving vconfig across a function call.
   - There is no way to determine the config, we'd need a way to do so.

Correct.

I will note that the above addvsz can be used as "addvsz tmp, x0, 1" to extract
VSZ. I can't think of how often extracting MAXEL and MAXVL individually would
be useful, so maybe just being able to get them from a read-vconfig insn would
be enough.

* Don't pass any vector arguments in registers.

If you're going to do that why define a new ABI at all?

For memcpy, that's always going to be a normal abi, so it can legitimately
clobber all of the vector registers in any way it likes -- e.g. reconfig to
maximize byte vector length.

I'm a bit worried about throwing a factor of 32 in vector length on
the floor here (or requiring saving a huge vector state),

Jakub talked a bit about this in his reply.

particularly as I
think that most vectorized code won't need to worry about calling standard ABI
functions.

Well, yes, most things that we can vectorize don't need this.
But loops that would use this ABI would otherwise be non-vectorizable.

r~

Palmer Dabbelt

2018-09-29 01:45:05 UTC

Permalink

Post by Richard Henderson

Post by Richard Henderson
The RISC-V vector extension described something other than what is
present in the currently released 2.2 standard. To clarify the

Well, right, but it does have a draft of the V extension.
What was presented did not match that, which is what I was trying to describe.

Ah, I didn't know that. I guess I should look at our ISA manual more often...
:) Regardless, the presented extensions have drifted from v2.2 in various ways
such that I'm no longer sure what is real any more.

Post by Richard Henderson

Pardon? How do they leak micro-architecture detail?
They load and store the *architectural* contents of the registers.

Ya, I think I might have been wrong here. When I actually went through this I
think there might be a way to implement these with reasonable performance and
without leaking any micro-architectural state.

My worry here was exactly how this whole "ignore VL" idea interplays with what
values are allowed to exist in registers, which keeps flopping back and forth
based on how much type support we're baking into the base ISA and who yells
loudest in the meetings. IIRC the current proposal is to have something like

getvl t0 <- vl [imaginary instruction so we can restore vl]
setvl 0
vxor v0, v0
setvl t0

be defined to change no V state, in which case I think it would be possible to
define a sane version of these instructions. I also think that a reasonable
class of vector ABIs might rely on these, so we should probably figure out how
to make them work.

Post by Richard Henderson

Additionally, I don't think this is necessary because our proposed vector ABI
is to clobber the entire state of the vector unit on all function calls.

Yes, but I was foreshadowing...

Post by Richard Henderson
(II) We talked about the needs of a "simd" abi

... this, in which we would not necessarily know the vconfig.

I think we start agreeing in a hundred lines or so... :)

Post by Richard Henderson

Sure.

Here I think you need to read the SVE document.

Yes, I agree that I should read the SVE document. In fact, I opened it and
scanned through it and though "gee, I should really read this". By the time I
got through the rest of your email the cauldron was almost over, and I figured
I should send something.

Post by Richard Henderson
I would not use this abi for __attribute__((vector(fixed-size))) at all, but
for the variable length vectors that the auto-vectorizer uses, since that's
exactly what these functions are for.

This is exactly what I was talking about above for vspill/vfill.

Yes, and I think that by this point I'd already convinced myself your were
right. Sorry for being somewhat incoherent, I wrote my response at about one
line per hour because I learned a lot while doing so.

Post by Richard Henderson

Correct.
I will note that the above addvsz can be used as "addvsz tmp, x0, 1" to extract
VSZ. I can't think of how often extracting MAXEL and MAXVL individually would
be useful, so maybe just being able to get them from a read-vconfig insn would
be enough.

Yes. I was trying to avoid explicitly describing an instruction encoding of
what was necessary, as once we get into encodings we'll get painted into a
corner eventually. I generally like try to figure out what information is
necessary, and how fast we might need to obtain that information, before trying
to pack it into an encoding.

Post by Richard Henderson

* Don't pass any vector arguments in registers.

If you're going to do that why define a new ABI at all?

Ya, I think I might have gone too far down the rabbit hole of "let's not define
a new ABI". My biggest concern here is how the ABI maps to user code, which is
where I really need to read for a bit.

Post by Richard Henderson

For memcpy, that's always going to be a normal abi, so it can legitimately
clobber all of the vector registers in any way it likes -- e.g. reconfig to
maximize byte vector length.

I'm a bit worried about throwing a factor of 32 in vector length on
the floor here (or requiring saving a huge vector state),

Jakub talked a bit about this in his reply.

Yes, I saw this. My worry here is that these sort of things are in the realm
of the microarchitecture leaking into the ABI, and I'd really like to avoid
that. It might be a bit of a pipe dream, but I'd still like to give it a shot
-- I'm really trying to avoid a huge ABI explosion in RISC-V land, as that's a
mess.

Post by Richard Henderson

particularly as I
think that most vectorized code won't need to worry about calling standard ABI
functions.

Well, yes, most things that we can vectorize don't need this.
But loops that would use this ABI would otherwise be non-vectorizable.

Can we actually just sit down with everyone and talk about this at some point?
We're doing RISC-V things at Plumbers, ELC-E, and FOSDEM (as well as the RISC-V
Summit). I feel like doing this over email is going to be inefficient, largely
because I'm just going to be stuck making stupid responses for a while since I
really don't know what I'm doing here.

Thanks for spending so much time on this!