Discussion:
Gimpel lowering question.
Michael Eager
2018-11-28 17:00:56 UTC
Permalink
I have a small test case which generates poor quality code on my target.
Here is the original:

if (cond1 == 2048 || cond2 == 8)
{
x = x + y;
}
return x;

This ends up generating a series of instructions to compute a flag with
the result of the condition followed by a single compare with zero and
a jump. Better code would be two compares and two jumps.

The gimple is

_1 = cond1 == 2048;
_2 = cond2 == 8;
_3 = _1 | _2;
if (_3 != 0) goto <D.1464>; else goto <D.1465>;
...

so this poor code sequence is essentially a correct translation.

On MIPS, for the same test case, the gimple is different:

if (cond1 == 2048) goto <D.1491>; else goto <D.1493>;
<D.1493>:
if (cond2 == 8) goto <D.1491>; else goto <D.1492>;
<D.1491>:
...

which generates the better code sequence.

Can someone give me a pointer where to find where the lowering
pass decides to break up a condition into multiple tests? Is
there a target configuration option which I have overlooked or
maybe set incorrectly?
--
Michael Eager ***@eagerm.com
1960 Park Blvd., Palo Alto, CA 94306
Jeff Law
2018-11-28 17:10:36 UTC
Permalink
Post by Michael Eager
I have a small test case which generates poor quality code on my target.
  if (cond1 == 2048 || cond2 == 8)
    {
      x = x + y;
    }
  return x;
This ends up generating a series of instructions to compute a flag with
the result of the condition followed by a single compare with zero and
a jump.  Better code would be two compares and two jumps.
The gimple is
  _1 = cond1 == 2048;
  _2 = cond2 == 8;
  _3 = _1 | _2;
  if (_3 != 0) goto <D.1464>; else goto <D.1465>;
...
so this poor code sequence is essentially a correct translation.
  if (cond1 == 2048) goto <D.1491>; else goto <D.1493>;
  if (cond2 == 8) goto <D.1491>; else goto <D.1492>;
...
which generates the better code sequence.
Can someone give me a pointer where to find where the lowering
pass decides to break up a condition into multiple tests?  Is
there a target configuration option which I have overlooked or
maybe set incorrectly?
BRANCH_COST, which comes into play during generation of the initial
trees as well in passes which try to optimize branchy code into
straighter code.

jeff
Michael Eager
2018-11-28 17:47:16 UTC
Permalink
Post by Jeff Law
Post by Michael Eager
I have a small test case which generates poor quality code on my target.
  if (cond1 == 2048 || cond2 == 8)
    {
      x = x + y;
    }
  return x;
This ends up generating a series of instructions to compute a flag with
the result of the condition followed by a single compare with zero and
a jump.  Better code would be two compares and two jumps.
The gimple is
  _1 = cond1 == 2048;
  _2 = cond2 == 8;
  _3 = _1 | _2;
  if (_3 != 0) goto <D.1464>; else goto <D.1465>;
...
so this poor code sequence is essentially a correct translation.
  if (cond1 == 2048) goto <D.1491>; else goto <D.1493>;
  if (cond2 == 8) goto <D.1491>; else goto <D.1492>;
...
which generates the better code sequence.
Can someone give me a pointer where to find where the lowering
pass decides to break up a condition into multiple tests?  Is
there a target configuration option which I have overlooked or
maybe set incorrectly?
BRANCH_COST, which comes into play during generation of the initial
trees as well in passes which try to optimize branchy code into
straighter code.
Thanks. I did look at BRANCH_COST, and played with the values, but it
didn't seem to have any affect. I'll take a closer look.
--
Michael Eager ***@eagerm.com
1960 Park Blvd., Palo Alto, CA 94306
Jeff Law
2018-11-28 17:51:42 UTC
Permalink
Post by Jeff Law
Post by Michael Eager
I have a small test case which generates poor quality code on my target.
   if (cond1 == 2048 || cond2 == 8)
     {
       x = x + y;
     }
   return x;
This ends up generating a series of instructions to compute a flag with
the result of the condition followed by a single compare with zero and
a jump.  Better code would be two compares and two jumps.
The gimple is
   _1 = cond1 == 2048;
   _2 = cond2 == 8;
   _3 = _1 | _2;
   if (_3 != 0) goto <D.1464>; else goto <D.1465>;
...
so this poor code sequence is essentially a correct translation.
   if (cond1 == 2048) goto <D.1491>; else goto <D.1493>;
   if (cond2 == 8) goto <D.1491>; else goto <D.1492>;
...
which generates the better code sequence.
Can someone give me a pointer where to find where the lowering
pass decides to break up a condition into multiple tests?  Is
there a target configuration option which I have overlooked or
maybe set incorrectly?
BRANCH_COST, which comes into play during generation of the initial
trees as well in passes which try to optimize branchy code into
straighter code.
Thanks.  I did look at BRANCH_COST, and played with the values, but it
didn't seem to have any affect.  I'll take a closer look.
I'd start by looking at the state the trees during gimplification then
walk forward or backward as necessary.

Jeff
Andrew Pinski
2018-11-28 22:37:39 UTC
Permalink
Post by Michael Eager
Post by Jeff Law
Post by Michael Eager
I have a small test case which generates poor quality code on my target.
if (cond1 == 2048 || cond2 == 8)
{
x = x + y;
}
return x;
This ends up generating a series of instructions to compute a flag with
the result of the condition followed by a single compare with zero and
a jump. Better code would be two compares and two jumps.
The gimple is
_1 = cond1 == 2048;
_2 = cond2 == 8;
_3 = _1 | _2;
if (_3 != 0) goto <D.1464>; else goto <D.1465>;
...
so this poor code sequence is essentially a correct translation.
if (cond1 == 2048) goto <D.1491>; else goto <D.1493>;
if (cond2 == 8) goto <D.1491>; else goto <D.1492>;
...
which generates the better code sequence.
Can someone give me a pointer where to find where the lowering
pass decides to break up a condition into multiple tests? Is
there a target configuration option which I have overlooked or
maybe set incorrectly?
BRANCH_COST, which comes into play during generation of the initial
trees as well in passes which try to optimize branchy code into
straighter code.
Thanks. I did look at BRANCH_COST, and played with the values, but it
didn't seem to have any affect. I'll take a closer look.
Look at LOGICAL_OP_NON_SHORT_CIRCUIT . By defualt, it is defined as:
(BRANCH_COST (optimize_function_for_speed_p (cfun), \
false) >= 2)
But MIPS defines it as 0.

Changing LOGICAL_OP_NON_SHORT_CIRCUIT to 0 will disable this optimization.

LOGICAL_OP_NON_SHORT_CIRCUIT controls both places where (cond1 == 2048
|| cond2 == 8) would remove the branch.

NOTE I think MIPS defines LOGICAL_OP_NON_SHORT_CIRCUIT incorrectly but
that is a different story.

Thanks,
Andrew Pinski
Post by Michael Eager
--
1960 Park Blvd., Palo Alto, CA 94306
Michael Eager
2018-11-29 18:28:50 UTC
Permalink
Post by Andrew Pinski
Post by Michael Eager
Post by Jeff Law
Post by Michael Eager
I have a small test case which generates poor quality code on my target.
if (cond1 == 2048 || cond2 == 8)
{
x = x + y;
}
return x;
This ends up generating a series of instructions to compute a flag with
the result of the condition followed by a single compare with zero and
a jump. Better code would be two compares and two jumps.
The gimple is
_1 = cond1 == 2048;
_2 = cond2 == 8;
_3 = _1 | _2;
if (_3 != 0) goto <D.1464>; else goto <D.1465>;
...
so this poor code sequence is essentially a correct translation.
if (cond1 == 2048) goto <D.1491>; else goto <D.1493>;
if (cond2 == 8) goto <D.1491>; else goto <D.1492>;
...
which generates the better code sequence.
Can someone give me a pointer where to find where the lowering
pass decides to break up a condition into multiple tests? Is
there a target configuration option which I have overlooked or
maybe set incorrectly?
BRANCH_COST, which comes into play during generation of the initial
trees as well in passes which try to optimize branchy code into
straighter code.
Thanks. I did look at BRANCH_COST, and played with the values, but it
didn't seem to have any affect. I'll take a closer look.
(BRANCH_COST (optimize_function_for_speed_p (cfun), \
false) >= 2)
But MIPS defines it as 0.
Changing LOGICAL_OP_NON_SHORT_CIRCUIT to 0 will disable this optimization.
LOGICAL_OP_NON_SHORT_CIRCUIT controls both places where (cond1 == 2048
|| cond2 == 8) would remove the branch.
NOTE I think MIPS defines LOGICAL_OP_NON_SHORT_CIRCUIT incorrectly but
that is a different story.
Thanks,
Andrew Pinski
I set BRANCH_COST to 1 for both predicted and non-predicted branches.
This generates the shorter sequence with a compare, but I'm concerned
that it will adversely affect code generation elsewhere. Branches are
higher cost than simple instructions.

Looking at the code generated with BRANCH_COST > 1, it doesn't do what
I indicated above. This did not reduce the number of branches.
In both cases there are two branches in this short test case. When
BRANCH_COST > 1, there are three instructions to do a compare vs one
when BRANCH_COST = 1. I'm at a loss to see where there is any benefit.

I'll take a look at the LOGICAL_OP_NON_SHORT_CIRCUIT settings and see
if this makes a difference.
--
Michael Eager ***@eagerm.com
1960 Park Blvd., Palo Alto, CA 94306
Michael Eager
2018-11-30 19:09:06 UTC
Permalink
Post by Michael Eager
Post by Jeff Law
Post by Michael Eager
I have a small test case which generates poor quality code on my target.
    if (cond1 == 2048 || cond2 == 8)
      {
        x = x + y;
      }
    return x;
This ends up generating a series of instructions to compute a flag with
the result of the condition followed by a single compare with zero and
a jump.  Better code would be two compares and two jumps.
The gimple is
    _1 = cond1 == 2048;
    _2 = cond2 == 8;
    _3 = _1 | _2;
    if (_3 != 0) goto <D.1464>; else goto <D.1465>;
...
so this poor code sequence is essentially a correct translation.
    if (cond1 == 2048) goto <D.1491>; else goto <D.1493>;
    if (cond2 == 8) goto <D.1491>; else goto <D.1492>;
...
which generates the better code sequence.
Can someone give me a pointer where to find where the lowering
pass decides to break up a condition into multiple tests?  Is
there a target configuration option which I have overlooked or
maybe set incorrectly?
BRANCH_COST, which comes into play during generation of the initial
trees as well in passes which try to optimize branchy code into
straighter code.
Thanks.  I did look at BRANCH_COST, and played with the values, but it
didn't seem to have any affect.  I'll take a closer look.
   (BRANCH_COST (optimize_function_for_speed_p (cfun), \
                 false) >= 2)
But MIPS defines it as 0.
Changing  LOGICAL_OP_NON_SHORT_CIRCUIT to 0 will disable this
optimization.
LOGICAL_OP_NON_SHORT_CIRCUIT controls both places where (cond1 == 2048
|| cond2 == 8) would remove the branch.
NOTE I think MIPS defines LOGICAL_OP_NON_SHORT_CIRCUIT incorrectly but
that is a different story.
Thanks,
Andrew Pinski
I set BRANCH_COST to 1 for both predicted and non-predicted branches.
This generates the shorter sequence with a compare, but I'm concerned
that it will adversely affect code generation elsewhere.  Branches are
higher cost than simple instructions.
Looking at the code generated with BRANCH_COST > 1, it doesn't do what
I indicated above.  This did not reduce the number of branches.
In both cases there are two branches in this short test case.  When
BRANCH_COST > 1, there are three instructions to do a compare vs one
when BRANCH_COST = 1.  I'm at a loss to see where there is any benefit.
I'll take a look at the LOGICAL_OP_NON_SHORT_CIRCUIT settings and see
if this makes a difference.
Setting LOGICAL_OP_NON_SHORT_CIRCUIT to zero generates the better code
sequence. Thanks for the pointer.

I stepped through the code at the two places where
LOGICAL_OP_NON_SHORT_CIRCUIT is tested, with it set both to 0 and 1, and
it is not clear what the code is trying to do. Or what this parameter
is really supposed to mean. I'll have to go back and look again.

Setting LOGICAL_OP_NON_SHORT_CIRCUIT to 0 or 1 does not change the
number of branches. Why this would depend on BRANCH_COST is not clear,
and especially why it would depend on the costs of branches which are
predicted to fail makes no sense to me.

Is there a use case which would help this make some sense?
--
Michael Eager ***@eagerm.com
1960 Park Blvd., Palo Alto, CA 94306
Richard Biener
2018-11-30 07:39:02 UTC
Permalink
Post by Jeff Law
Post by Michael Eager
I have a small test case which generates poor quality code on my target.
if (cond1 == 2048 || cond2 == 8)
{
x = x + y;
}
return x;
This ends up generating a series of instructions to compute a flag with
the result of the condition followed by a single compare with zero and
a jump. Better code would be two compares and two jumps.
The gimple is
_1 = cond1 == 2048;
_2 = cond2 == 8;
_3 = _1 | _2;
if (_3 != 0) goto <D.1464>; else goto <D.1465>;
...
so this poor code sequence is essentially a correct translation.
if (cond1 == 2048) goto <D.1491>; else goto <D.1493>;
if (cond2 == 8) goto <D.1491>; else goto <D.1492>;
...
which generates the better code sequence.
Can someone give me a pointer where to find where the lowering
pass decides to break up a condition into multiple tests? Is
there a target configuration option which I have overlooked or
maybe set incorrectly?
BRANCH_COST, which comes into play during generation of the initial
trees
Sth I don't like very much... maybe we can revisit removing
the few cases in fold-const.c (thus GENERIC folding) and rely
on later GIMPLE passes instead plus on RTL expansion to
do the reverse transform.

Not for GCC 9 of course.
Post by Jeff Law
as well in passes which try to optimize branchy code into
straighter code.
jeff
Jeff Law
2018-11-30 16:08:34 UTC
Permalink
Post by Richard Biener
Post by Jeff Law
Post by Michael Eager
I have a small test case which generates poor quality code on my target.
if (cond1 == 2048 || cond2 == 8)
{
x = x + y;
}
return x;
This ends up generating a series of instructions to compute a flag with
the result of the condition followed by a single compare with zero and
a jump. Better code would be two compares and two jumps.
The gimple is
_1 = cond1 == 2048;
_2 = cond2 == 8;
_3 = _1 | _2;
if (_3 != 0) goto <D.1464>; else goto <D.1465>;
...
so this poor code sequence is essentially a correct translation.
if (cond1 == 2048) goto <D.1491>; else goto <D.1493>;
if (cond2 == 8) goto <D.1491>; else goto <D.1492>;
...
which generates the better code sequence.
Can someone give me a pointer where to find where the lowering
pass decides to break up a condition into multiple tests? Is
there a target configuration option which I have overlooked or
maybe set incorrectly?
BRANCH_COST, which comes into play during generation of the initial
trees
Sth I don't like very much... maybe we can revisit removing
the few cases in fold-const.c (thus GENERIC folding) and rely
on later GIMPLE passes instead plus on RTL expansion to
do the reverse transform.
Not for GCC 9 of course.
Agreed, 100%. That was supposed to be where Kai's work was headed, but
we never seemed to make significant headway. Getting BRANCH_COST out
of the tree generation/folding would be a good thing at many levels.

There'll be fallout, of course.

jeff
Jakub Jelinek
2018-11-30 16:14:11 UTC
Permalink
Post by Jeff Law
Post by Richard Biener
Sth I don't like very much... maybe we can revisit removing
the few cases in fold-const.c (thus GENERIC folding) and rely
on later GIMPLE passes instead plus on RTL expansion to
do the reverse transform.
Not for GCC 9 of course.
Agreed, 100%. That was supposed to be where Kai's work was headed, but
we never seemed to make significant headway. Getting BRANCH_COST out
of the tree generation/folding would be a good thing at many levels.
There'll be fallout, of course.
Note, for the GCC 9 timeframe, for PR85368 I wrote today
--param logical-op-non-short-circuit={0,1} support that allows
to override LOGICAL_OP_NON_SHORT_CIRCUIT, because clearly it is impossible
to handle it in testcases otherwise, just -mbranch-cost= and the defaults
aren't good enough, as some targets override LOGICAL_OP_NON_SHORT_CIRCUIT
and figuring out all the details is too hard in tcl.

I prefer a param to make it clear that it won't last too long if we
implement the above for GCC 10 or 11.

Jakub
Jeff Law
2018-11-30 16:18:11 UTC
Permalink
Post by Jakub Jelinek
Post by Jeff Law
Post by Richard Biener
Sth I don't like very much... maybe we can revisit removing
the few cases in fold-const.c (thus GENERIC folding) and rely
on later GIMPLE passes instead plus on RTL expansion to
do the reverse transform.
Not for GCC 9 of course.
Agreed, 100%. That was supposed to be where Kai's work was headed, but
we never seemed to make significant headway. Getting BRANCH_COST out
of the tree generation/folding would be a good thing at many levels.
There'll be fallout, of course.
Note, for the GCC 9 timeframe, for PR85368 I wrote today
--param logical-op-non-short-circuit={0,1} support that allows
to override LOGICAL_OP_NON_SHORT_CIRCUIT, because clearly it is impossible
to handle it in testcases otherwise, just -mbranch-cost= and the defaults
aren't good enough, as some targets override LOGICAL_OP_NON_SHORT_CIRCUIT
and figuring out all the details is too hard in tcl.
Yea, that sounds quite reasonable. The target conditions in those tests
are just insane and obviously unmaintainable. A param to control
behavior seems like a good idea.

The param also makes it easier to experiment with this stuff as we try
to kill the BRANCH_COST and LOGICAL_OP_NON_SHORT_CIRCUIT bits early.
jeff
Loading...