Thomas Schwinge
2018-10-12 17:35:09 UTC
Hi!
I'm for the first time looking into the existing vectorization
functionality in GCC (yay!), and with that I'm also for the first time
encountering GCC's scalar evolution (scev) machinery (yay!), and the
chains of recurrences (chrec) used by that (yay!).
Obviously, I'm right now doing my own reading and experimenting, but
maybe somebody can cut that short, if my current question doesn't make
much sense, and is thus easily answered:
int a[NJ][NI];
#pragma acc loop collapse(2)
for (int j = 0; j < N_J; ++j)
for (int i = 0; i < N_I; ++i)
a[j][i] = 0;
Without "-fopenacc" (thus the pragma ignored), this does vectorize (for
the x86_64 target, for example, without OpenACC code offloading), and
also does it vectorize with "-fopenacc" enabled but the "collapse(2)"
clause removed and instead another "#pragma acc loop" added in front of
the inner "i" loop. But with the "collapse(2)" clause in effect, these
two nested loops get, well, "collapse"d by omp-expand into one:
for (int tmp = 0; tmp < N_J * N_I; ++tmp)
{
int j = tmp / N_I;
int i = tmp % N_I;
a[j][i] = 0;
}
This does not vectorize because of scalar evolution running into
unhandled (chrec_dont_know) TRUNC_DIV_EXPR and TRUNC_MOD_EXPR in
gcc/tree-scalar-evolution.c:interpret_rhs_expression. Do I have a chance
in teaching it to handle these, without big effort?
If that's not reasonable, I shall look for other options to address the
problem that currently vectorization gets pessimized by "-fopenacc" and
in particular the code rewriting for the "collapse" clause.
By the way, the problem can, similarly, also be displayed in an OpenMP
example, where also when such a "collapse" clause is present, the inner
loop's code no longer vectorizes. (But I've not considered that case in
any more detail; Jakub CCed in case that's something to look into? I
don't know how OpenMP threads' loop iterations are meant to interact with
OpenMP SIMD, basically.)
Hmm, and without any OpenACC/OpenMP etc., actually the same problem is
also present when running the following code through the vectorizer:
for (int tmp = 0; tmp < N_J * N_I; ++tmp)
{
int j = tmp / N_I;
int i = tmp % N_I;
a[j][i] = 0;
}
... whereas the following variant (obviously) does vectorize:
int a[NJ * NI];
for (int tmp = 0; tmp < N_J * N_I; ++tmp)
a[tmp] = 0;
Hmm. Linearization. From a quick search, I found some 2010 work by
Sebastian Pop on that topic, in the Graphite context
(gcc/graphite-flattening.c), but that got pulled out again in 2012.
(I have not yet looked up the history, and have not yet looked whether
that'd be relevant here at all -- and we're not using Graphite here.)
Regarding that, am I missing something obvious?
Grüße
Thomas
I'm for the first time looking into the existing vectorization
functionality in GCC (yay!), and with that I'm also for the first time
encountering GCC's scalar evolution (scev) machinery (yay!), and the
chains of recurrences (chrec) used by that (yay!).
Obviously, I'm right now doing my own reading and experimenting, but
maybe somebody can cut that short, if my current question doesn't make
much sense, and is thus easily answered:
int a[NJ][NI];
#pragma acc loop collapse(2)
for (int j = 0; j < N_J; ++j)
for (int i = 0; i < N_I; ++i)
a[j][i] = 0;
Without "-fopenacc" (thus the pragma ignored), this does vectorize (for
the x86_64 target, for example, without OpenACC code offloading), and
also does it vectorize with "-fopenacc" enabled but the "collapse(2)"
clause removed and instead another "#pragma acc loop" added in front of
the inner "i" loop. But with the "collapse(2)" clause in effect, these
two nested loops get, well, "collapse"d by omp-expand into one:
for (int tmp = 0; tmp < N_J * N_I; ++tmp)
{
int j = tmp / N_I;
int i = tmp % N_I;
a[j][i] = 0;
}
This does not vectorize because of scalar evolution running into
unhandled (chrec_dont_know) TRUNC_DIV_EXPR and TRUNC_MOD_EXPR in
gcc/tree-scalar-evolution.c:interpret_rhs_expression. Do I have a chance
in teaching it to handle these, without big effort?
If that's not reasonable, I shall look for other options to address the
problem that currently vectorization gets pessimized by "-fopenacc" and
in particular the code rewriting for the "collapse" clause.
By the way, the problem can, similarly, also be displayed in an OpenMP
example, where also when such a "collapse" clause is present, the inner
loop's code no longer vectorizes. (But I've not considered that case in
any more detail; Jakub CCed in case that's something to look into? I
don't know how OpenMP threads' loop iterations are meant to interact with
OpenMP SIMD, basically.)
Hmm, and without any OpenACC/OpenMP etc., actually the same problem is
also present when running the following code through the vectorizer:
for (int tmp = 0; tmp < N_J * N_I; ++tmp)
{
int j = tmp / N_I;
int i = tmp % N_I;
a[j][i] = 0;
}
... whereas the following variant (obviously) does vectorize:
int a[NJ * NI];
for (int tmp = 0; tmp < N_J * N_I; ++tmp)
a[tmp] = 0;
Hmm. Linearization. From a quick search, I found some 2010 work by
Sebastian Pop on that topic, in the Graphite context
(gcc/graphite-flattening.c), but that got pulled out again in 2012.
(I have not yet looked up the history, and have not yet looked whether
that'd be relevant here at all -- and we're not using Graphite here.)
Regarding that, am I missing something obvious?
Grüße
Thomas