Discussion:
Can support TRUNC_DIV_EXPR, TRUNC_MOD_EXPR in GCC vectorization/scalar evolution -- and/or linearization?
Thomas Schwinge
2018-10-12 17:35:09 UTC
Permalink
Hi!

I'm for the first time looking into the existing vectorization
functionality in GCC (yay!), and with that I'm also for the first time
encountering GCC's scalar evolution (scev) machinery (yay!), and the
chains of recurrences (chrec) used by that (yay!).

Obviously, I'm right now doing my own reading and experimenting, but
maybe somebody can cut that short, if my current question doesn't make
much sense, and is thus easily answered:

int a[NJ][NI];

#pragma acc loop collapse(2)
for (int j = 0; j < N_J; ++j)
for (int i = 0; i < N_I; ++i)
a[j][i] = 0;

Without "-fopenacc" (thus the pragma ignored), this does vectorize (for
the x86_64 target, for example, without OpenACC code offloading), and
also does it vectorize with "-fopenacc" enabled but the "collapse(2)"
clause removed and instead another "#pragma acc loop" added in front of
the inner "i" loop. But with the "collapse(2)" clause in effect, these
two nested loops get, well, "collapse"d by omp-expand into one:

for (int tmp = 0; tmp < N_J * N_I; ++tmp)
{
int j = tmp / N_I;
int i = tmp % N_I;
a[j][i] = 0;
}

This does not vectorize because of scalar evolution running into
unhandled (chrec_dont_know) TRUNC_DIV_EXPR and TRUNC_MOD_EXPR in
gcc/tree-scalar-evolution.c:interpret_rhs_expression. Do I have a chance
in teaching it to handle these, without big effort?


If that's not reasonable, I shall look for other options to address the
problem that currently vectorization gets pessimized by "-fopenacc" and
in particular the code rewriting for the "collapse" clause.


By the way, the problem can, similarly, also be displayed in an OpenMP
example, where also when such a "collapse" clause is present, the inner
loop's code no longer vectorizes. (But I've not considered that case in
any more detail; Jakub CCed in case that's something to look into? I
don't know how OpenMP threads' loop iterations are meant to interact with
OpenMP SIMD, basically.)


Hmm, and without any OpenACC/OpenMP etc., actually the same problem is
also present when running the following code through the vectorizer:

for (int tmp = 0; tmp < N_J * N_I; ++tmp)
{
int j = tmp / N_I;
int i = tmp % N_I;
a[j][i] = 0;
}

... whereas the following variant (obviously) does vectorize:

int a[NJ * NI];

for (int tmp = 0; tmp < N_J * N_I; ++tmp)
a[tmp] = 0;

Hmm. Linearization. From a quick search, I found some 2010 work by
Sebastian Pop on that topic, in the Graphite context
(gcc/graphite-flattening.c), but that got pulled out again in 2012.
(I have not yet looked up the history, and have not yet looked whether
that'd be relevant here at all -- and we're not using Graphite here.)

Regarding that, am I missing something obvious?


Grüße
Thomas
Marc Glisse
2018-10-12 19:14:11 UTC
Permalink
Post by Thomas Schwinge
Hmm, and without any OpenACC/OpenMP etc., actually the same problem is
for (int tmp = 0; tmp < N_J * N_I; ++tmp)
{
int j = tmp / N_I;
int i = tmp % N_I;
a[j][i] = 0;
}
int a[NJ * NI];
for (int tmp = 0; tmp < N_J * N_I; ++tmp)
a[tmp] = 0;
I had a quick look at the difference, and a[j][i] remains in this form
throughout optimization. If I write instead *((*(a+j))+i) = 0; I get

j_10 = tmp_17 / 1025;
i_11 = tmp_17 % 1025;
_1 = (long unsigned int) j_10;
_2 = _1 * 1025;
_3 = (sizetype) i_11;
_4 = _2 + _3;

or for a power of 2

j_10 = tmp_17 >> 10;
i_11 = tmp_17 & 1023;
_1 = (long unsigned int) j_10;
_2 = _1 * 1024;
_3 = (sizetype) i_11;
_4 = _2 + _3;

and in both cases we fail to notice that _4 = (sizetype) tmp_17; (at least
I think that's true).

So there are missing match.pd transformations in addition to whatever
scev/ivdep/other work is needed.
--
Marc Glisse
Sebastian Pop
2018-10-15 20:22:29 UTC
Permalink
Post by Marc Glisse
Post by Thomas Schwinge
Hmm, and without any OpenACC/OpenMP etc., actually the same problem is
for (int tmp = 0; tmp < N_J * N_I; ++tmp)
{
int j = tmp / N_I;
int i = tmp % N_I;
a[j][i] = 0;
}
int a[NJ * NI];
for (int tmp = 0; tmp < N_J * N_I; ++tmp)
a[tmp] = 0;
I had a quick look at the difference, and a[j][i] remains in this form
throughout optimization. If I write instead *((*(a+j))+i) = 0; I get
j_10 = tmp_17 / 1025;
i_11 = tmp_17 % 1025;
_1 = (long unsigned int) j_10;
_2 = _1 * 1025;
_3 = (sizetype) i_11;
_4 = _2 + _3;
or for a power of 2
j_10 = tmp_17 >> 10;
i_11 = tmp_17 & 1023;
_1 = (long unsigned int) j_10;
_2 = _1 * 1024;
_3 = (sizetype) i_11;
_4 = _2 + _3;
and in both cases we fail to notice that _4 = (sizetype) tmp_17; (at least
I think that's true).
If this folding is correct, the dependence analysis would not have
to handle array accesses with div and mod, and it would be able
to classify the loop as parallel which will enable vectorization.
Post by Marc Glisse
So there are missing match.pd transformations in addition to whatever
scev/ivdep/other work is needed.
--
Marc Glisse
Thomas Schwinge
2018-10-22 16:35:36 UTC
Permalink
Hi!

Thanks for all your comments already! I continued looked into this for a
bit (but then got interrupted by a higher-priority task). Regarding this
Post by Marc Glisse
Post by Thomas Schwinge
Hmm, and without any OpenACC/OpenMP etc., actually the same problem is
for (int tmp = 0; tmp < N_J * N_I; ++tmp)
{
int j = tmp / N_I;
int i = tmp % N_I;
a[j][i] = 0;
}
int a[NJ * NI];
for (int tmp = 0; tmp < N_J * N_I; ++tmp)
a[tmp] = 0;
I had a quick look at the difference, and a[j][i] remains in this form
throughout optimization. If I write instead *((*(a+j))+i) = 0; I get
j_10 = tmp_17 / 1025;
i_11 = tmp_17 % 1025;
_1 = (long unsigned int) j_10;
_2 = _1 * 1025;
_3 = (sizetype) i_11;
_4 = _2 + _3;
or for a power of 2
j_10 = tmp_17 >> 10;
i_11 = tmp_17 & 1023;
_1 = (long unsigned int) j_10;
_2 = _1 * 1024;
_3 = (sizetype) i_11;
_4 = _2 + _3;
and in both cases we fail to notice that _4 = (sizetype) tmp_17; (at least
I think that's true).
So there are missing match.pd transformations in addition to whatever
scev/ivdep/other work is needed.
With a very simplistic "match.pd" rule (not yet any special cases
checking etc.):

diff --git gcc/match.pd gcc/match.pd
index b36d7ccb5dc3..4c23116308da 100644
--- gcc/match.pd
+++ gcc/match.pd
@@ -5126,3 +5126,28 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
{ wide_int_to_tree (sizetype, off); })
{ swap_p ? @0 : @2; }))
{ rhs_tree; })))))))))
+
+/* Given:
+
+ j = in / N_I
+ i = in % N_I
+
+ ..., fold:
+
+ out = j * N_I + i
+
+ ..., into:
+
+ out = in
+*/
+
+/* As long as only considering N_I being INTEGER_CST (which are always second
+ argument?), probably don't need ":c" variants? */
+
+(simplify
+ (plus:c
+ (mult:c
+ (trunc_div @0 ***@1)
+ ***@1)
+ (trunc_mod @0 ***@1))
+ (convert @0))

..., the original code:

int f1(int in)
{
int j = in / N_I;
int i = in % N_I;

int out = j * N_I + i;

return out;
}

... gets simplified from ("div-mod-0.c.027t.objsz1"):

f1 (int in)
{
int out;
int i;
int j;
int _1;
int _6;

<bb 2> :
gimple_assign <trunc_div_expr, j_3, in_2(D), 100, NULL>
gimple_assign <trunc_mod_expr, i_4, in_2(D), 100, NULL>
gimple_assign <mult_expr, _1, j_3, 100, NULL>
gimple_assign <plus_expr, out_5, i_4, _1, NULL>
gimple_assign <ssa_name, _6, out_5, NULL, NULL>
gimple_return <_6>

}

... to ("div-mod-0.c.028t.ccp1"):

f1 (int in)
{
int out;
int i;
int j;
int _1;

<bb 2> :
gimple_assign <trunc_div_expr, j_3, in_2(D), 100, NULL>
gimple_assign <trunc_mod_expr, i_4, in_2(D), 100, NULL>
gimple_assign <mult_expr, _1, j_3, 100, NULL>
gimple_return <in_2(D)>

}

(The three dead "gimple_assign"s get eliminated later on.)

So, that works.

However, it doesn't work yet for the original construct that I'd ran
into, which looks like this:

[...]
int i;
int j;
[...]
signed int .offset.5_2;
[...]
unsigned int .offset.7_23;
unsigned int .iter.0_24;
unsigned int _25;
unsigned int _26;
[...]
unsigned int .iter.0_32;
[...]

<bb 9> :
# gimple_phi <.offset.5_2, .offset.5_21(8), .offset.5_30(9)>
gimple_assign <nop_expr, .offset.7_23, .offset.5_2, NULL, NULL>
gimple_assign <plus_expr, .iter.0_24, 0, .offset.7_23, NULL>
gimple_assign <trunc_div_expr, _25, .iter.0_24, 100, NULL>
gimple_assign <trunc_mod_expr, _26, .iter.0_24, 100, NULL>
gimple_assign <nop_expr, i_27, _26, NULL, NULL>
gimple_assign <nop_expr, j_28, _25, NULL, NULL>
gimple_assign <integer_cst, a[j_28][i_27], 123, NULL, NULL>
[...]

Resolving the "a[j][i] = 123" we'll need to look into later.

As Marc noted above, with that changed into "*(*(a + j) + i) = 123", we
get:

[...]
int i;
int j;
long unsigned int _1;
long unsigned int _2;
sizetype _3;
sizetype _4;
sizetype _5;
int * _6;
[...]
signed int .offset.5_8;
[...]
unsigned int .offset.7_29;
unsigned int .iter.0_30;
unsigned int _31;
unsigned int _32;
[...]

<bb 9> :
# gimple_phi <.offset.5_8, .offset.5_27(8), .offset.5_36(9)>
gimple_assign <nop_expr, .offset.7_29, .offset.5_8, NULL, NULL>
gimple_assign <plus_expr, .iter.0_30, 0, .offset.7_29, NULL>
gimple_assign <trunc_div_expr, _31, .iter.0_30, 100, NULL>
gimple_assign <trunc_mod_expr, _32, .iter.0_30, 100, NULL>
gimple_assign <nop_expr, i_33, _32, NULL, NULL>
gimple_assign <nop_expr, j_34, _31, NULL, NULL>
gimple_assign <nop_expr, _1, j_34, NULL, NULL>
gimple_assign <mult_expr, _2, _1, 100, NULL>
gimple_assign <nop_expr, _3, i_33, NULL, NULL>
gimple_assign <plus_expr, _4, _2, _3, NULL>
gimple_assign <mult_expr, _5, _4, 4, NULL>
gimple_assign <pointer_plus_expr, _6, &a, _5, NULL>
gimple_assign <integer_cst, *_6, 123, NULL, NULL>
[...]

Here, unless I'm confused, "_4" is supposed to be equal to ".iter.0_30",
but "match.pd" doesn't agree yet. Note the many "nop_expr"s here, which
I have not yet figured out how to handle, I suppose? I tried some things
but couldn't get it to work. Apparently the existing instances of
"(match (nop_convert @0)" and "Basic strip-useless-type-conversions /
strip_nops" rule also don't handle these; should they? Or, are in fact
here the types mixed up too much?

I hope to get some time again soon to continue looking into this, but if
anybody got any ideas, I'm all ears.


Grüße
Thomas
Richard Biener
2018-10-23 09:14:18 UTC
Permalink
Post by Thomas Schwinge
Hi!
Thanks for all your comments already! I continued looked into this for a
bit (but then got interrupted by a higher-priority task). Regarding this
Post by Marc Glisse
Post by Thomas Schwinge
Hmm, and without any OpenACC/OpenMP etc., actually the same problem is
for (int tmp = 0; tmp < N_J * N_I; ++tmp)
{
int j = tmp / N_I;
int i = tmp % N_I;
a[j][i] = 0;
}
int a[NJ * NI];
for (int tmp = 0; tmp < N_J * N_I; ++tmp)
a[tmp] = 0;
I had a quick look at the difference, and a[j][i] remains in this form
throughout optimization. If I write instead *((*(a+j))+i) = 0; I get
j_10 = tmp_17 / 1025;
i_11 = tmp_17 % 1025;
_1 = (long unsigned int) j_10;
_2 = _1 * 1025;
_3 = (sizetype) i_11;
_4 = _2 + _3;
or for a power of 2
j_10 = tmp_17 >> 10;
i_11 = tmp_17 & 1023;
_1 = (long unsigned int) j_10;
_2 = _1 * 1024;
_3 = (sizetype) i_11;
_4 = _2 + _3;
and in both cases we fail to notice that _4 = (sizetype) tmp_17; (at least
I think that's true).
So there are missing match.pd transformations in addition to whatever
scev/ivdep/other work is needed.
With a very simplistic "match.pd" rule (not yet any special cases
diff --git gcc/match.pd gcc/match.pd
index b36d7ccb5dc3..4c23116308da 100644
--- gcc/match.pd
+++ gcc/match.pd
@@ -5126,3 +5126,28 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
{ wide_int_to_tree (sizetype, off); })
{ rhs_tree; })))))))))
+
+
+ j = in / N_I
+ i = in % N_I
+
+
+ out = j * N_I + i
+
+
+ out = in
+*/
+
+/* As long as only considering N_I being INTEGER_CST (which are always second
+ argument?), probably don't need ":c" variants? */
+
+(simplify
+ (plus:c
+ (mult:c
int f1(int in)
{
int j = in / N_I;
int i = in % N_I;
int out = j * N_I + i;
return out;
}
f1 (int in)
{
int out;
int i;
int j;
int _1;
int _6;
gimple_assign <trunc_div_expr, j_3, in_2(D), 100, NULL>
gimple_assign <trunc_mod_expr, i_4, in_2(D), 100, NULL>
gimple_assign <mult_expr, _1, j_3, 100, NULL>
gimple_assign <plus_expr, out_5, i_4, _1, NULL>
gimple_assign <ssa_name, _6, out_5, NULL, NULL>
gimple_return <_6>
}
f1 (int in)
{
int out;
int i;
int j;
int _1;
gimple_assign <trunc_div_expr, j_3, in_2(D), 100, NULL>
gimple_assign <trunc_mod_expr, i_4, in_2(D), 100, NULL>
gimple_assign <mult_expr, _1, j_3, 100, NULL>
gimple_return <in_2(D)>
}
(The three dead "gimple_assign"s get eliminated later on.)
So, that works.
However, it doesn't work yet for the original construct that I'd ran
[...]
int i;
int j;
[...]
signed int .offset.5_2;
[...]
unsigned int .offset.7_23;
unsigned int .iter.0_24;
unsigned int _25;
unsigned int _26;
[...]
unsigned int .iter.0_32;
[...]
# gimple_phi <.offset.5_2, .offset.5_21(8), .offset.5_30(9)>
gimple_assign <nop_expr, .offset.7_23, .offset.5_2, NULL, NULL>
gimple_assign <plus_expr, .iter.0_24, 0, .offset.7_23, NULL>
gimple_assign <trunc_div_expr, _25, .iter.0_24, 100, NULL>
gimple_assign <trunc_mod_expr, _26, .iter.0_24, 100, NULL>
gimple_assign <nop_expr, i_27, _26, NULL, NULL>
gimple_assign <nop_expr, j_28, _25, NULL, NULL>
gimple_assign <integer_cst, a[j_28][i_27], 123, NULL, NULL>
[...]
Resolving the "a[j][i] = 123" we'll need to look into later.
As Marc noted above, with that changed into "*(*(a + j) + i) = 123", we
[...]
int i;
int j;
long unsigned int _1;
long unsigned int _2;
sizetype _3;
sizetype _4;
sizetype _5;
int * _6;
[...]
signed int .offset.5_8;
[...]
unsigned int .offset.7_29;
unsigned int .iter.0_30;
unsigned int _31;
unsigned int _32;
[...]
# gimple_phi <.offset.5_8, .offset.5_27(8), .offset.5_36(9)>
gimple_assign <nop_expr, .offset.7_29, .offset.5_8, NULL, NULL>
gimple_assign <plus_expr, .iter.0_30, 0, .offset.7_29, NULL>
gimple_assign <trunc_div_expr, _31, .iter.0_30, 100, NULL>
gimple_assign <trunc_mod_expr, _32, .iter.0_30, 100, NULL>
gimple_assign <nop_expr, i_33, _32, NULL, NULL>
gimple_assign <nop_expr, j_34, _31, NULL, NULL>
gimple_assign <nop_expr, _1, j_34, NULL, NULL>
gimple_assign <mult_expr, _2, _1, 100, NULL>
gimple_assign <nop_expr, _3, i_33, NULL, NULL>
gimple_assign <plus_expr, _4, _2, _3, NULL>
gimple_assign <mult_expr, _5, _4, 4, NULL>
gimple_assign <pointer_plus_expr, _6, &a, _5, NULL>
gimple_assign <integer_cst, *_6, 123, NULL, NULL>
[...]
Here, unless I'm confused, "_4" is supposed to be equal to ".iter.0_30",
but "match.pd" doesn't agree yet. Note the many "nop_expr"s here, which
I have not yet figured out how to handle, I suppose? I tried some things
but couldn't get it to work. Apparently the existing instances of
strip_nops" rule also don't handle these; should they? Or, are in fact
here the types mixed up too much?
There are sign-extensions to larger types involved which are not no-op
ones.

So it's a little bit more complicated than the case you are catching
with the pattern.

Richard.
Post by Thomas Schwinge
I hope to get some time again soon to continue looking into this, but if
anybody got any ideas, I'm all ears.
Grüße
Thomas
Marc Glisse
2018-11-04 20:53:39 UTC
Permalink
(resent because of mail issues on my end)
Post by Thomas Schwinge
Post by Marc Glisse
I had a quick look at the difference, and a[j][i] remains in this form
throughout optimization. If I write instead *((*(a+j))+i) = 0; I get
j_10 = tmp_17 / 1025;
i_11 = tmp_17 % 1025;
_1 = (long unsigned int) j_10;
_2 = _1 * 1025;
_3 = (sizetype) i_11;
_4 = _2 + _3;
or for a power of 2
j_10 = tmp_17 >> 10;
i_11 = tmp_17 & 1023;
_1 = (long unsigned int) j_10;
_2 = _1 * 1024;
_3 = (sizetype) i_11;
_4 = _2 + _3;
and in both cases we fail to notice that _4 = (sizetype) tmp_17; (at least
I think that's true).
So there are missing match.pd transformations in addition to whatever
scev/ivdep/other work is needed.
With a very simplistic "match.pd" rule (not yet any special cases
diff --git gcc/match.pd gcc/match.pd
index b36d7ccb5dc3..4c23116308da 100644
--- gcc/match.pd
+++ gcc/match.pd
@@ -5126,3 +5126,28 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
{ wide_int_to_tree (sizetype, off); })
{ rhs_tree; })))))))))
+
+
+ j = in / N_I
+ i = in % N_I
+
+
+ out = j * N_I + i
+
+
+ out = in
+*/
+
+/* As long as only considering N_I being INTEGER_CST (which are always second
+ argument?), probably don't need ":c" variants? */
+
+(simplify
+ (plus:c
+ (mult:c
You should only specify ***@1 on the first occurence, the others
can be just @1. (you may be interested in @@1 at some point, but that
gets tricky)
Post by Thomas Schwinge
int f1(int in)
{
int j = in / N_I;
int i = in % N_I;
int out = j * N_I + i;
return out;
}
f1 (int in)
{
int out;
int i;
int j;
int _1;
int _6;
gimple_assign <trunc_div_expr, j_3, in_2(D), 100, NULL>
gimple_assign <trunc_mod_expr, i_4, in_2(D), 100, NULL>
gimple_assign <mult_expr, _1, j_3, 100, NULL>
gimple_assign <plus_expr, out_5, i_4, _1, NULL>
gimple_assign <ssa_name, _6, out_5, NULL, NULL>
gimple_return <_6>
}
f1 (int in)
{
int out;
int i;
int j;
int _1;
gimple_assign <trunc_div_expr, j_3, in_2(D), 100, NULL>
gimple_assign <trunc_mod_expr, i_4, in_2(D), 100, NULL>
gimple_assign <mult_expr, _1, j_3, 100, NULL>
gimple_return <in_2(D)>
}
(The three dead "gimple_assign"s get eliminated later on.)
So, that works.
However, it doesn't work yet for the original construct that I'd ran
[...]
int i;
int j;
[...]
signed int .offset.5_2;
[...]
unsigned int .offset.7_23;
unsigned int .iter.0_24;
unsigned int _25;
unsigned int _26;
[...]
unsigned int .iter.0_32;
[...]
# gimple_phi <.offset.5_2, .offset.5_21(8), .offset.5_30(9)>
gimple_assign <nop_expr, .offset.7_23, .offset.5_2, NULL, NULL>
gimple_assign <plus_expr, .iter.0_24, 0, .offset.7_23, NULL>
gimple_assign <trunc_div_expr, _25, .iter.0_24, 100, NULL>
gimple_assign <trunc_mod_expr, _26, .iter.0_24, 100, NULL>
gimple_assign <nop_expr, i_27, _26, NULL, NULL>
gimple_assign <nop_expr, j_28, _25, NULL, NULL>
gimple_assign <integer_cst, a[j_28][i_27], 123, NULL, NULL>
[...]
Resolving the "a[j][i] = 123" we'll need to look into later.
As Marc noted above, with that changed into "*(*(a + j) + i) = 123", we
[...]
int i;
int j;
long unsigned int _1;
long unsigned int _2;
sizetype _3;
sizetype _4;
sizetype _5;
int * _6;
[...]
signed int .offset.5_8;
[...]
unsigned int .offset.7_29;
unsigned int .iter.0_30;
unsigned int _31;
unsigned int _32;
[...]
# gimple_phi <.offset.5_8, .offset.5_27(8), .offset.5_36(9)>
gimple_assign <nop_expr, .offset.7_29, .offset.5_8, NULL, NULL>
gimple_assign <plus_expr, .iter.0_30, 0, .offset.7_29, NULL>
gimple_assign <trunc_div_expr, _31, .iter.0_30, 100, NULL>
gimple_assign <trunc_mod_expr, _32, .iter.0_30, 100, NULL>
gimple_assign <nop_expr, i_33, _32, NULL, NULL>
gimple_assign <nop_expr, j_34, _31, NULL, NULL>
gimple_assign <nop_expr, _1, j_34, NULL, NULL>
gimple_assign <mult_expr, _2, _1, 100, NULL>
gimple_assign <nop_expr, _3, i_33, NULL, NULL>
gimple_assign <plus_expr, _4, _2, _3, NULL>
gimple_assign <mult_expr, _5, _4, 4, NULL>
gimple_assign <pointer_plus_expr, _6, &a, _5, NULL>
gimple_assign <integer_cst, *_6, 123, NULL, NULL>
[...]
Here, unless I'm confused, "_4" is supposed to be equal to ".iter.0_30",
but "match.pd" doesn't agree yet. Note the many "nop_expr"s here, which
I have not yet figured out how to handle, I suppose? I tried some things
but couldn't get it to work. Apparently the existing instances of
strip_nops" rule also don't handle these; should they? Or, are in fact
here the types mixed up too much?
"(match (nop_convert @0)" defines a shortcut so some transformations can
use nop_convert to detect some specific conversions, but it doesn't do
anything by itself. "Basic strip-useless-type-conversions" strips
conversions that are *useless*, essentially from a type to the same
type. If you want to handle true conversions, you need to do that
explicitly, see the many transformations that use convert? convert1?
convert2? and specify for which particular conversions the
transformation is valid. Finding out the right conditions to detect
these conversions is often the most painful part of writing a match.pd
transformation.
Post by Thomas Schwinge
I hope to get some time again soon to continue looking into this, but if
anybody got any ideas, I'm all ears.
--
Marc Glisse
Jakub Jelinek
2018-10-12 19:52:13 UTC
Permalink
Post by Thomas Schwinge
int a[NJ][NI];
#pragma acc loop collapse(2)
for (int j = 0; j < N_J; ++j)
for (int i = 0; i < N_I; ++i)
a[j][i] = 0;
For e.g.
int a[128][128];

void
foo (int m, int n)
{
#pragma omp for simd collapse(2)
for (int i = 0; i < m; i++)
for (int j = 0; j < n; j++)
a[i][j]++;
}
we emit in the inner loop:
<bb 8> :
i = i.0;
j = j.1;
_1 = a[i][j];
_2 = _1 + 1;
a[i][j] = _2;
.iter.4 = .iter.4 + 1;
j.1 = j.1 + 1;
D.2912 = j.1 < n.7 ? 0 : 1;
i.0 = D.2912 + i.0;
j.1 = j.1 < n.7 ? j.1 : 0;

<bb 9> :
if (.iter.4 < D.2902)
goto <bb 8>; [87.50%]
else
goto <bb 10>; [12.50%]
to make it more vectorization friendly (though, in this particular case it
isn't vectorized either) and not do the expensive % and / operations inside
of the inner loop. Without -fopenmp it does vectorize only the inner loop,
there is no collapse.

Jakub
Richard Biener
2018-10-15 08:55:26 UTC
Permalink
Post by Jakub Jelinek
Post by Thomas Schwinge
int a[NJ][NI];
#pragma acc loop collapse(2)
for (int j = 0; j < N_J; ++j)
for (int i = 0; i < N_I; ++i)
a[j][i] = 0;
For e.g.
int a[128][128];
void
foo (int m, int n)
{
#pragma omp for simd collapse(2)
for (int i = 0; i < m; i++)
for (int j = 0; j < n; j++)
a[i][j]++;
}
i = i.0;
j = j.1;
_1 = a[i][j];
_2 = _1 + 1;
a[i][j] = _2;
.iter.4 = .iter.4 + 1;
j.1 = j.1 + 1;
D.2912 = j.1 < n.7 ? 0 : 1;
i.0 = D.2912 + i.0;
j.1 = j.1 < n.7 ? j.1 : 0;
if (.iter.4 < D.2902)
goto <bb 8>; [87.50%]
else
goto <bb 10>; [12.50%]
to make it more vectorization friendly (though, in this particular case it
isn't vectorized either) and not do the expensive % and / operations inside
of the inner loop. Without -fopenmp it does vectorize only the inner loop,
there is no collapse.
Yeah. Note this still makes the IVs not analyzable since i now effectively
becomes wrapping in the inner loop. For some special values we might
get away with a wrapping CHREC in a bit-precision type but we cannot
represent wrapping at some (possibly non-constant) value.

So - collapsing loops is a bad idea. Why's that done anyways?

Richard.
Post by Jakub Jelinek
Jakub
Jakub Jelinek
2018-10-15 09:11:13 UTC
Permalink
Post by Richard Biener
Yeah. Note this still makes the IVs not analyzable since i now effectively
becomes wrapping in the inner loop. For some special values we might
get away with a wrapping CHREC in a bit-precision type but we cannot
represent wrapping at some (possibly non-constant) value.
So - collapsing loops is a bad idea. Why's that done anyways?
Because the standards (both OpenMP and OpenACC) mandate that if one uses
collapse(2) or more. The semantics is that that many nested loops form a
larger iteration space then and that is then handled according to the rules
of the particular construct. Sometimes it can be very much beneficial,
sometimes less so, but e.g. with OpenMP user has the option to say what they
want. They can e.g. do:
#pragma omp distribute
for (int i = 0; i < M; i++)
#pragma omp parallel for
for (int j = 0; j < N; j++)
#pragma omp simd
for (int k = 0; k < O; k++)
do_something (i, j, k);
and that way distribute the outermost loop, parallelize the middle one and
vectorize the innermost one, or they can do:
#pragma omp distribute parallel for simd collapse (3)
for (int i = 0; i < M; i++)
for (int j = 0; j < N; j++)
for (int k = 0; k < O; k++)
do_something (i, j, k);
and let the implementation split the M x N x O iteration space itself (or
use clauses to say how exactly it is done). Say if O is very large and N
small and there are many cores, it might be more beneficial to parallelize
it more, etc.
If we come up with some way to help the vectorizer with the collapsed loop,
whether in a form of some loop flags, or internal fns, whatever, I'm all for
it.

Jakub
Richard Biener
2018-10-15 09:30:56 UTC
Permalink
Post by Jakub Jelinek
Post by Richard Biener
Yeah. Note this still makes the IVs not analyzable since i now effectively
becomes wrapping in the inner loop. For some special values we might
get away with a wrapping CHREC in a bit-precision type but we cannot
represent wrapping at some (possibly non-constant) value.
So - collapsing loops is a bad idea. Why's that done anyways?
Because the standards (both OpenMP and OpenACC) mandate that if one uses
collapse(2) or more. The semantics is that that many nested loops form a
larger iteration space then and that is then handled according to the rules
of the particular construct. Sometimes it can be very much beneficial,
sometimes less so, but e.g. with OpenMP user has the option to say what they
#pragma omp distribute
for (int i = 0; i < M; i++)
#pragma omp parallel for
for (int j = 0; j < N; j++)
#pragma omp simd
for (int k = 0; k < O; k++)
do_something (i, j, k);
and that way distribute the outermost loop, parallelize the middle one and
#pragma omp distribute parallel for simd collapse (3)
for (int i = 0; i < M; i++)
for (int j = 0; j < N; j++)
for (int k = 0; k < O; k++)
do_something (i, j, k);
and let the implementation split the M x N x O iteration space itself (or
use clauses to say how exactly it is done). Say if O is very large and N
small and there are many cores, it might be more beneficial to parallelize
it more, etc.
If we come up with some way to help the vectorizer with the collapsed loop,
whether in a form of some loop flags, or internal fns, whatever, I'm all for
it.
But isn't _actual_ collapsing an implementation detail? That is, isn't it
enough to interpret clauses in terms of the collapse result?

That is, can we delay the actual collapsing until after vectorization
for example?

Richard.
Post by Jakub Jelinek
Jakub
Jakub Jelinek
2018-10-15 09:45:47 UTC
Permalink
Post by Richard Biener
But isn't _actual_ collapsing an implementation detail?
No, it is required by the standard and in many cases it is very much
observable.
#pragma omp parallel for schedule(nonmonotonic: static, 23) collapse (2)
for (int i = 0; i < 64; i++)
for (int j = 0; j < 16; j++)
a[i][j] = omp_get_thread_num ();
The standard says that from the logical iteration space 64 x 16,
first 23 iterations go to the first thread (i.e. i=0, j=0..15 and i=1,
j=0..14), then 23 iterations go to the second thread, etc.
In other constructs, e.g. the new loop construct, it is a request to
distribute, parallelize and vectorize as much as possible with optional
guarantee of no cross-iteration dependencies at all, but even in that case
using the source loops might not be always a win, e.g. the loopnest could be
5 loops and the iteration space might be diagonal or other not exactly
rectangular.
Post by Richard Biener
That is, can we delay the actual collapsing until after vectorization
for example?
No. We can come up with some way to propagate some of the original info to
the vectorizer if it helps (or teach vectorizer to recognize whatever we
produce), but the mandatory transformation needs to be done
immediately before optimizations make those impossible.

Jakub
Richard Biener
2018-10-15 13:17:44 UTC
Permalink
Post by Jakub Jelinek
Post by Richard Biener
But isn't _actual_ collapsing an implementation detail?
No, it is required by the standard and in many cases it is very much
observable.
#pragma omp parallel for schedule(nonmonotonic: static, 23) collapse (2)
for (int i = 0; i < 64; i++)
for (int j = 0; j < 16; j++)
a[i][j] = omp_get_thread_num ();
The standard says that from the logical iteration space 64 x 16,
first 23 iterations go to the first thread (i.e. i=0, j=0..15 and i=1,
j=0..14), then 23 iterations go to the second thread, etc.
In other constructs, e.g. the new loop construct, it is a request to
distribute, parallelize and vectorize as much as possible with optional
guarantee of no cross-iteration dependencies at all, but even in that case
using the source loops might not be always a win, e.g. the loopnest could be
5 loops and the iteration space might be diagonal or other not exactly
rectangular.
But then you could do

for (int i = si1; i < n1; i++)
for (int j = sj1; j < m1; j++)
{
a[i][j] = omp_get_thread_num ();
}
if (m_tail1)
for (int j = 0; j < m_tail1; j++)
...

with appropriate start/end for the i/j loop and the "epilogue" loop?
Post by Jakub Jelinek
Post by Richard Biener
That is, can we delay the actual collapsing until after vectorization
for example?
No. We can come up with some way to propagate some of the original info to
the vectorizer if it helps (or teach vectorizer to recognize whatever we
produce), but the mandatory transformation needs to be done
immediately before optimizations make those impossible.
The issue is that with refs like

a[i % m] = a[(i + 1) % m];

you do not know whether you have a backwards or forward dependence
so I do not see how you could perform loop vectorization. That implies
that one option might be to have the OMP lowering unroll & interleave
loops when asked for SIMD so that the SLP vectorizer could pick up
things?

But then how is safelen() defined in the context of collapse()?

Richard.
Post by Jakub Jelinek
Jakub
Loading...