Tamar Christina
2018-10-16 14:32:51 UTC
Hi All,
I am trying to add support to the auto-vectorizer for complex operations where
a target has instructions for.
The instructions I have are only available as vector instructions. The operations
are complex addition with a rotation or complex fmla with a rotation for
half floats, floats and doubles.
They expect the complex number to be broken down and stored in vectors as
real/img parts. GCC already does this first part when it lowers complex numbers
very early on in tree, so that's good.
As a simple example, I am trying to get GCC to emit an internal function
.FCOMPLEX_ADD_ROT_90 (Complex addition with a 90* rotation)
when the target supports it.
my C example is:
void f90 (double complex a[N], double complex b[N], double complex c[N])
{
for (int i=0; i < N; i++)
c[i] = a[i] + b[i] * I;
}
Which in tree looks like
_3 = a_15(D) + _2;
_12 = REALPART_EXPR <*_3>;
_22 = IMAGPART_EXPR <*_3>;
_5 = b_16(D) + _2;
_6 = IMAGPART_EXPR <*_5>;
_8 = REALPART_EXPR <*_5>;
_10 = c_17(D) + _2;
_4 = _12 - _6;
_13 = _8 + _22;
REALPART_EXPR <*_10> = _4;
IMAGPART_EXPR <*_10> = _13;
after some rewriting from match.pd.
what I'm after is for it to get rewritten as something like
_3 = a_15(D) + _2;
_5 = b_16(D) + _2;
_10 = c_17(D) + _2;
*_10 = .FCOMPLEX_ADD_ROT_90 (*_5, *_3)
1) My first attempt to do this was in tree-vect-patterns.c as just another
vectorizer pattern. The first problem is that I need to match a pair of
statements
REALPART_EXPR <*_10> = _4;
IMAGPART_EXPR <*_10> = _13;
and not just a single one. This I can solve with getting the gsi for the
statement being inspected and walking back up the tree to find the second pair.
This works, but I am stopped by that the vectorizer (quite reasonably) doesn't
know what to do when the statement is already a vector stmt. So it bails out
and rejects the pattern substitution.
2) I thought about introducing two internal FN that would be treated as a pair to
match against later, but not sure this would work. The problem with generating
the two internal functions or doing the whole matching in combine (the vectorizer
will always vectorize this so I could match the add and sub in a pattern later)
is that I need to prevent it from treating them as a compound structure and
instead just as a normal vector. In AArch64 terms I want to stop it from doing
ld2 (load multiple 2-elem structures) and instead use ld1 loads (load multiple
single element structures). In certain cases (rotations) it also thinks it has a
permute and inserts a rotate in there which is also not desired.
3) So I abandoned vec-patterns and instead tried to do it in tree-vect-slp.c in
vect_analyze_slp_instance just after the SLP tree is created. Matching the SLP
tree is quite simple and getting it to emit the right SLP tree was simple enough,
except that at this point all data references and loads have already been calculated.
Which left me in a very painful process of removing the loads and forced me to
reconstruct all this information. But I kept hitting more and more things I
needed to manually recreate, which feels like not the right approach. If I just
add a new stmt in and leave the ones in place, it just ends up getting ignored
silently.
My guess is because this statement has no data reference to anything.
Any suggestions on what would be the right approach and that would be acceptable
for upstreaming?
Thanks,
Tamar
I am trying to add support to the auto-vectorizer for complex operations where
a target has instructions for.
The instructions I have are only available as vector instructions. The operations
are complex addition with a rotation or complex fmla with a rotation for
half floats, floats and doubles.
They expect the complex number to be broken down and stored in vectors as
real/img parts. GCC already does this first part when it lowers complex numbers
very early on in tree, so that's good.
As a simple example, I am trying to get GCC to emit an internal function
.FCOMPLEX_ADD_ROT_90 (Complex addition with a 90* rotation)
when the target supports it.
my C example is:
void f90 (double complex a[N], double complex b[N], double complex c[N])
{
for (int i=0; i < N; i++)
c[i] = a[i] + b[i] * I;
}
Which in tree looks like
_3 = a_15(D) + _2;
_12 = REALPART_EXPR <*_3>;
_22 = IMAGPART_EXPR <*_3>;
_5 = b_16(D) + _2;
_6 = IMAGPART_EXPR <*_5>;
_8 = REALPART_EXPR <*_5>;
_10 = c_17(D) + _2;
_4 = _12 - _6;
_13 = _8 + _22;
REALPART_EXPR <*_10> = _4;
IMAGPART_EXPR <*_10> = _13;
after some rewriting from match.pd.
what I'm after is for it to get rewritten as something like
_3 = a_15(D) + _2;
_5 = b_16(D) + _2;
_10 = c_17(D) + _2;
*_10 = .FCOMPLEX_ADD_ROT_90 (*_5, *_3)
1) My first attempt to do this was in tree-vect-patterns.c as just another
vectorizer pattern. The first problem is that I need to match a pair of
statements
REALPART_EXPR <*_10> = _4;
IMAGPART_EXPR <*_10> = _13;
and not just a single one. This I can solve with getting the gsi for the
statement being inspected and walking back up the tree to find the second pair.
This works, but I am stopped by that the vectorizer (quite reasonably) doesn't
know what to do when the statement is already a vector stmt. So it bails out
and rejects the pattern substitution.
2) I thought about introducing two internal FN that would be treated as a pair to
match against later, but not sure this would work. The problem with generating
the two internal functions or doing the whole matching in combine (the vectorizer
will always vectorize this so I could match the add and sub in a pattern later)
is that I need to prevent it from treating them as a compound structure and
instead just as a normal vector. In AArch64 terms I want to stop it from doing
ld2 (load multiple 2-elem structures) and instead use ld1 loads (load multiple
single element structures). In certain cases (rotations) it also thinks it has a
permute and inserts a rotate in there which is also not desired.
3) So I abandoned vec-patterns and instead tried to do it in tree-vect-slp.c in
vect_analyze_slp_instance just after the SLP tree is created. Matching the SLP
tree is quite simple and getting it to emit the right SLP tree was simple enough,
except that at this point all data references and loads have already been calculated.
Which left me in a very painful process of removing the loads and forced me to
reconstruct all this information. But I kept hitting more and more things I
needed to manually recreate, which feels like not the right approach. If I just
add a new stmt in and leave the ones in place, it just ends up getting ignored
silently.
My guess is because this statement has no data reference to anything.
Any suggestions on what would be the right approach and that would be acceptable
for upstreaming?
Thanks,
Tamar