[RFC][GCC][mid-end] Support vectorization of complex numbers using machine instructions.

Discussion:

Tamar Christina

2018-10-16 14:32:51 UTC

Hi All,

I am trying to add support to the auto-vectorizer for complex operations where
a target has instructions for.

The instructions I have are only available as vector instructions. The operations
are complex addition with a rotation or complex fmla with a rotation for
half floats, floats and doubles.

They expect the complex number to be broken down and stored in vectors as
real/img parts. GCC already does this first part when it lowers complex numbers
very early on in tree, so that's good.

As a simple example, I am trying to get GCC to emit an internal function
.FCOMPLEX_ADD_ROT_90 (Complex addition with a 90* rotation)
when the target supports it.

my C example is:

void f90 (double complex a[N], double complex b[N], double complex c[N])
{
for (int i=0; i < N; i++)
c[i] = a[i] + b[i] * I;
}

Which in tree looks like

_3 = a_15(D) + _2;
_12 = REALPART_EXPR <*_3>;
_22 = IMAGPART_EXPR <*_3>;
_5 = b_16(D) + _2;
_6 = IMAGPART_EXPR <*_5>;
_8 = REALPART_EXPR <*_5>;
_10 = c_17(D) + _2;
_4 = _12 - _6;
_13 = _8 + _22;
REALPART_EXPR <*_10> = _4;
IMAGPART_EXPR <*_10> = _13;

after some rewriting from match.pd.

what I'm after is for it to get rewritten as something like

_3 = a_15(D) + _2;
_5 = b_16(D) + _2;
_10 = c_17(D) + _2;
*_10 = .FCOMPLEX_ADD_ROT_90 (*_5, *_3)

1) My first attempt to do this was in tree-vect-patterns.c as just another
vectorizer pattern. The first problem is that I need to match a pair of
statements

REALPART_EXPR <*_10> = _4;
IMAGPART_EXPR <*_10> = _13;

and not just a single one. This I can solve with getting the gsi for the
statement being inspected and walking back up the tree to find the second pair.

This works, but I am stopped by that the vectorizer (quite reasonably) doesn't
know what to do when the statement is already a vector stmt. So it bails out
and rejects the pattern substitution.

2) I thought about introducing two internal FN that would be treated as a pair to
match against later, but not sure this would work. The problem with generating
the two internal functions or doing the whole matching in combine (the vectorizer
will always vectorize this so I could match the add and sub in a pattern later)
is that I need to prevent it from treating them as a compound structure and
instead just as a normal vector. In AArch64 terms I want to stop it from doing
ld2 (load multiple 2-elem structures) and instead use ld1 loads (load multiple
single element structures). In certain cases (rotations) it also thinks it has a
permute and inserts a rotate in there which is also not desired.

3) So I abandoned vec-patterns and instead tried to do it in tree-vect-slp.c in
vect_analyze_slp_instance just after the SLP tree is created. Matching the SLP
tree is quite simple and getting it to emit the right SLP tree was simple enough,
except that at this point all data references and loads have already been calculated.

Which left me in a very painful process of removing the loads and forced me to
reconstruct all this information. But I kept hitting more and more things I
needed to manually recreate, which feels like not the right approach. If I just
add a new stmt in and leave the ones in place, it just ends up getting ignored
silently.

My guess is because this statement has no data reference to anything.

Any suggestions on what would be the right approach and that would be acceptable
for upstreaming?

Thanks,
Tamar

Richard Sandiford

2018-10-16 21:22:52 UTC

Permalink

Post by Tamar Christina
Hi All,
I am trying to add support to the auto-vectorizer for complex operations where
a target has instructions for.
The instructions I have are only available as vector instructions. The operations
are complex addition with a rotation or complex fmla with a rotation for
half floats, floats and doubles.
They expect the complex number to be broken down and stored in vectors as
real/img parts. GCC already does this first part when it lowers complex numbers
very early on in tree, so that's good.
As a simple example, I am trying to get GCC to emit an internal function
.FCOMPLEX_ADD_ROT_90 (Complex addition with a 90* rotation)
when the target supports it.
void f90 (double complex a[N], double complex b[N], double complex c[N])
{
for (int i=0; i < N; i++)
c[i] = a[i] + b[i] * I;
}
Which in tree looks like
_3 = a_15(D) + _2;
_12 = REALPART_EXPR <*_3>;
_22 = IMAGPART_EXPR <*_3>;
_5 = b_16(D) + _2;
_6 = IMAGPART_EXPR <*_5>;
_8 = REALPART_EXPR <*_5>;
_10 = c_17(D) + _2;
_4 = _12 - _6;
_13 = _8 + _22;
REALPART_EXPR <*_10> = _4;
IMAGPART_EXPR <*_10> = _13;
[...]
3) So I abandoned vec-patterns and instead tried to do it in
tree-vect-slp.c in vect_analyze_slp_instance just after the SLP tree
is created. Matching the SLP tree is quite simple and getting it to
emit the right SLP tree was simple enough,except that at this point
all data references and loads have already been calculated.

(3) seems like the way to go. Can you explain in more detail why it
didn't work? The SLP tree after matching should look something like this:

REALPART_EXPR <*_10> = _4;
IMAGPART_EXPR <*_10> = _13;

_4 = .COMPLEX_ADD_ROT_90 (_12, _8)
_13 = .COMPLEX_ADD_ROT_90 (_22, _6)

_12 = REALPART_EXPR <*_3>;
_22 = IMAGPART_EXPR <*_3>;

_8 = REALPART_EXPR <*_5>;
_6 = IMAGPART_EXPR <*_5>;

The operands to the individual .COMPLEX_ADD_ROT_90s aren't the
operands that actually determine the associated scalar result, but
that's bound to be the case with something that includes an internal
permute. All we're trying to describe is an operation that does the
right thing when vectorised.

If you didn't have the .COMPLEX_ADD_ROT_90 and just fell back
on mixed two-operator SLP, the final node would be in the
opposite order:

_6 = IMAGPART_EXPR <*_5>;
_8 = REALPART_EXPR <*_5>;

So if you're doing the matching after building the initial tree,
you'd need to swap the statements in that node so that _8 comes
first and cancel the associated load permute. If you're doing the
matching on the fly while building the SLP tree then the subnodes
should start out in the right order.

Thanks,
Richard

Tamar Christina

2018-10-17 12:41:52 UTC

Permalink

Hi Richard,

Post by Tamar Christina
[...]
3) So I abandoned vec-patterns and instead tried to do it in
tree-vect-slp.c in vect_analyze_slp_instance just after the SLP tree
is created. Matching the SLP tree is quite simple and getting it to
emit the right SLP tree was simple enough,except that at this point
all data references and loads have already been calculated.

(3) seems like the way to go. Can you explain in more detail why it didn't
REALPART_EXPR <*_10> = _4;
IMAGPART_EXPR <*_10> = _13;
_4 = .COMPLEX_ADD_ROT_90 (_12, _8)
_13 = .COMPLEX_ADD_ROT_90 (_22, _6)
_12 = REALPART_EXPR <*_3>;
_22 = IMAGPART_EXPR <*_3>;
_8 = REALPART_EXPR <*_5>;
_6 = IMAGPART_EXPR <*_5>;
The operands to the individual .COMPLEX_ADD_ROT_90s aren't the
operands that actually determine the associated scalar result, but that's
bound to be the case with something that includes an internal permute. All
we're trying to describe is an operation that does the right thing when
vectorised.
If you didn't have the .COMPLEX_ADD_ROT_90 and just fell back on mixed
_6 = IMAGPART_EXPR <*_5>;
_8 = REALPART_EXPR <*_5>;
So if you're doing the matching after building the initial tree, you'd need to
swap the statements in that node so that _8 comes first and cancel the
associated load permute. If you're doing the matching on the fly while
building the SLP tree then the subnodes should start out in the right order.

Ah, I hadn't tried it this way because in the SLP version, I had originally started with looking
at the complex fma, which would have a considerably longer match pattern.

_3 = c_14(D) + _2;
_11 = REALPART_EXPR <*_3>;
_21 = IMAGPART_EXPR <*_3>;
_5 = a_15(D) + _2;
_22 = REALPART_EXPR <*_5>;
_12 = IMAGPART_EXPR <*_5>;
_7 = b_16(D) + _2;
_19 = REALPART_EXPR <*_7>;
_20 = IMAGPART_EXPR <*_7>;
_25 = _19 * _22;
_26 = _12 * _20;
_27 = _20 * _22;
_28 = _12 * _19;
_29 = _25 - _26;
_30 = _27 + _28;
_31 = _11 + _29;
_32 = _21 + _30;
REALPART_EXPR <*_3> = _31;
IMAGPART_EXPR <*_3> = _32;

So In this case I should replace _31 and _32 right? but I can't remove the other statements otherwise it'll complain later about the missing references. I could replace _31 and _32 with something using all the variables I would need, however when I tried this previously in vect-patterns there was a block on build in functions with more than 4 arguments (and currently 3 is the limit for built in functions in the def file as well). I don’t know if that same limitation is in place if I replace it in SLP.

The complex add basically creates this vector

b⋅c - e⋅f + l
b⋅e + c⋅f + n

so I'd need 5 parameters and then I'm guessing the other expressions would be removed by DCE at some point?

Kind Regards,
Tamar

Thanks

Richard Sandiford

2018-10-17 20:37:02 UTC

Permalink

Post by Tamar Christina
Hi Richard,

Ah, I hadn't tried it this way because in the SLP version, I had originally started with looking
at the complex fma, which would have a considerably longer match pattern.
_3 = c_14(D) + _2;
_11 = REALPART_EXPR <*_3>;
_21 = IMAGPART_EXPR <*_3>;
_5 = a_15(D) + _2;
_22 = REALPART_EXPR <*_5>;
_12 = IMAGPART_EXPR <*_5>;
_7 = b_16(D) + _2;
_19 = REALPART_EXPR <*_7>;
_20 = IMAGPART_EXPR <*_7>;
_25 = _19 * _22;
_26 = _12 * _20;
_27 = _20 * _22;
_28 = _12 * _19;
_29 = _25 - _26;
_30 = _27 + _28;
_31 = _11 + _29;
_32 = _21 + _30;
REALPART_EXPR <*_3> = _31;
IMAGPART_EXPR <*_3> = _32;
So In this case I should replace _31 and _32 right? but I can't remove the other statements otherwise it'll complain later about the missing references. I could replace _31 and _32 with something using all the variables I would need, however when I tried this previously in vect-patterns there was a block on build in functions with more than 4 arguments (and currently 3 is the limit for built in functions in the def file as well). I don’t know if that same limitation is in place if I replace it in SLP.
The complex add basically creates this vector
b⋅c - e⋅f + l
b⋅e + c⋅f + n
so I'd need 5 parameters and then I'm guessing the other expressions would be removed by DCE at some point?

Are you planning to make the FCMLA behaviour directly available
as an internal function or provide a higher-level one that does
a full complex multiply, with the target lowering that into
individual instructions where necessary?

Either way, each individual FCMLA should only need three scalar inputs.
Like with FCADD, it doesn't matter whether the operands to the individual
scalar FCMLAs are the ones (or the only ones) that determine the
associated FCMLA scalar result. All the node needs to do is describe
something that would work when vectorised.

What to do with the intermediate results you don't need is an
interesting question :-). Like you say, I was hoping DCE would
get rid of them later. Does that not work?

Thanks,
Richard

Tamar Christina

2018-10-18 17:14:27 UTC

Permalink

Hi Richard,

Thanks for all the help so far,

Post by Tamar Christina

Post by Tamar Christina
so I'd need 5 parameters and then I'm guessing the other expressions

would be removed by DCE at some point?
Are you planning to make the FCMLA behaviour directly available as an
internal function or provide a higher-level one that does a full complex
multiply, with the target lowering that into individual instructions where
necessary?

I was planning on doing it as one internal function and leave it up to the target
to expand it however it needs to.

Post by Tamar Christina
Either way, each individual FCMLA should only need three scalar inputs.
Like with FCADD, it doesn't matter whether the operands to the individual
scalar FCMLAs are the ones (or the only ones) that determine the associated
FCMLA scalar result. All the node needs to do is describe something that
would work when vectorised.

Ah yes that makes sense. I see what you mean.

Post by Tamar Christina
What to do with the intermediate results you don't need is an interesting
question :-). Like you say, I was hoping DCE would get rid of them later.
Does that not work?

I haven't tried it yet 😊 But I assume it'll work too.
I have complex add almost working, it generates the right code for the vectorized
loop. The loads are also corrected and the permute is gone and I update all the data references
for the two statements I replaced.

However for the scalar tail loop I have a problem since I only have vector
versions of the instructions, and the scalar loop is created from the same SLP tree.
So I end up with the builtins in the tail loop with nothing to expand them to and with
no way to differentiate between the two calls to the internal fn.

I would need to somehow undo this for the scalar part..

Kind Regards,
Tamar

Post by Tamar Christina
Thank

Richard Sandiford

2018-10-19 07:50:55 UTC

Permalink

Post by Tamar Christina

Post by Tamar Christina
so I'd need 5 parameters and then I'm guessing the other expressions

I was planning on doing it as one internal function and leave it up to
the target to expand it however it needs to.

OK, sounds good.

Post by Tamar Christina

Post by Tamar Christina
What to do with the intermediate results you don't need is an interesting
question :-). Like you say, I was hoping DCE would get rid of them later.
Does that not work?

Not sure what you mean by the last bit. Why do you need to replace
data references rather than just use the existing ones?

Post by Tamar Christina
However for the scalar tail loop I have a problem since I only have
vector versions of the instructions, and the scalar loop is created
from the same SLP tree. So I end up with the builtins in the tail
loop with nothing to expand them to and with no way to differentiate
between the two calls to the internal fn.
I would need to somehow undo this for the scalar part..

The epilogue loop should just be a copy of the basic block before
vectorisation is applied. The new calls shouldn't be in that,
just in the SLP tree. (This is how pattern statements work too:
they're never added to the basic block, they're just temporary
statements attached to internal vectoriser structures.)

Thanks,
Richard