Discussion:
Some thoughts and quetsions about the data flow infrastracture
Vladimir Makarov
2007-02-12 20:33:17 UTC
Permalink
On Sunday I had accidentally chat about the df infrastructure on
IIRC. I've got some thoughts which I'd like to share.

I like df infrastructure code from the day one for its clearness.
Unfortunately users don't see it and probably don't care about it.
With my point of view the df infrastructure has a design flaw. It
extracts a lot of information about RTL and keep it on the side. It
does not make the code fast. It would be ok if we got a better code
quality. Danny told me that they have 1.5% better code using df. It
is a really big improvement (about half year work for all compiler
team according to Proebsting's law). IMHO, it could justify promised
5% compiler slowness.

I've checked the branch with last merge point revision 121802 (done
Feb 10) on SPECINT2000 on 2.66 Core2 Duo (-mtune=generic -O2 were
used) and found that there is no improvement but degradation about
0.5% (1888 for the branch vs 1896 for mainline on the branch point. it
is without twolf which is broken for now). The code size is also
about 0.45% bigger on the branch. Getting 0.5% and 11.5% slowness
(308sec vs 275sec for compiling SPECINT2000) does not seems reasonable
especially if we have no alternative faster path (even if the old live
analysis is the mess).

Someone could object me that this infrastructure makes new
optimization implementation easier. But I've heard opinion that rtl
optimizations should be moved to SSA and should be only few
optimizations on RTL as code selection, the insn scheduling and RA.

Even rewriting the current optimizations on the new data flow
infrastructure makes situation worse because it will be not easy to
rid off the data flow infrastructure because probably part of the flaw
in the df interface. So it might create problems in the future.

Especially I did not like David Edelhson's phrase "and no new
private dataflow schemes will be allowed in gcc passes". It was not
such his first expression. Such phrases are killing competition which
is bad for gcc. What if the new specialized scheme is faster. What
if somebody decides to write another better df infrastructure from the
scratch to solve the coming df infrastructure problems.

So could be somebody from steering committee be kind and answer me

Is it (prohibiting to use other dataflow schemes) steering committee
decision?

If it so, should steering committee do such technical decision?
Some people on steering committee are not experts in gcc compiler,
some of them are not experts in this part of the compiler.

I am not in opposition to merge if it satisfies the merge criteria.
People've done a lot of work. It is too late. I've should oppose the
criteria when they were discussed. Sorry I've missed the discussion
if there were such discussion. I am just rising some questions and
saying that more work will be needed for df-infrastructure even after
the merge.

Vlad
Steven Bosscher
2007-02-12 20:50:06 UTC
Permalink
Post by Vladimir Makarov
I like df infrastructure code from the day one for its clearness.
Unfortunately users don't see it and probably don't care about it.
With my point of view the df infrastructure has a design flaw. It
extracts a lot of information about RTL and keep it on the side. It
does not make the code fast.
It also does not make code slow. And the data it extracts and keeps
on the side, could be used to simplify many algorithms in gcc (most
notably, cprop, mode switching, and regmove). There is a tremendous
potential for speedups in RTL passes if they start using the df
register caches, instead of traversing the PATTERN of every insn.

Gr.
Steven
Vladimir N. Makarov
2007-02-13 04:16:00 UTC
Permalink
Post by Steven Bosscher
Post by Vladimir Makarov
I like df infrastructure code from the day one for its clearness.
Unfortunately users don't see it and probably don't care about it.
With my point of view the df infrastructure has a design flaw. It
extracts a lot of information about RTL and keep it on the side. It
does not make the code fast.
It also does not make code slow. And the data it extracts and keeps
on the side, could be used to simplify many algorithms in gcc (most
notably, cprop, mode switching, and regmove). There is a tremendous
potential for speedups in RTL passes if they start using the df
register caches, instead of traversing the PATTERN of every insn.
In some passes RTL is scanned once. For example, insn scheduler now
does it once to build the data dependency. In this case the df scans
RTL once + writes and reads the df structures and always loosing in the
speed.

I know some work is being done on incremental df analysis. It could
decrease time for rescanning RTL between passes. Let us hope on this.
Ian Lance Taylor
2007-02-13 06:21:50 UTC
Permalink
Post by Vladimir N. Makarov
I know some work is being done on incremental df analysis. It could
decrease time for rescanning RTL between passes. Let us hope on this.
My understanding is that on dataflow-branch the DF information is now
fully incremental.


I don't really grasp where you are headed with this criticism.
Criticism can be a good thing. But I think it would help me, at
least, if you were more clear about what you see as the alternative.
Moving all the RTL passes into tree-ssa might be a good idea. But it
would be orders of magnitude more work than the dataflow work has
required, so it is not comparable.

We could contrain the dataflow work to be a 100% improvement on all
platforms before it can be committed. But that would be extremely
difficult, because they would continually be catching up to problems
introduced in mainline. For example, I slowed them down by a few days
as they fit my recent lower-subreg.c patch into the DF framework. So
I think that imposing such a requirement would be unwise. I believe
it would be tantamount to saying that we can never make a major
infrastructure change. For tree-ssa (admittedly a much bigger change)
we accepted slowdowns when it was committed because we believed that
the end result would be better.

I think that you should channel your concern into looking at the code
on dataflow-branch and seeing how it can be improved, or determining
that it is hopeless. I don't think this is a discussion which can be
usefully carried on in the abstract.

Ian
Mark Mitchell
2007-02-12 21:03:29 UTC
Permalink
Post by Vladimir Makarov
On Sunday I had accidentally chat about the df infrastructure on
IIRC. I've got some thoughts which I'd like to share.
I like df infrastructure code from the day one for its clearness.
Unfortunately users don't see it and probably don't care about it.
You're right that users don't care what the guts of the compiler look
like. That's exactly right: they care only about about the speed of the
compiler, the speed of the generated code, correctness of the generated
code, overall quality, etc.

However, my understanding (as someone who's not an expert on the DF code
base) is that, as you say, the new stuff is much tidier. I understood
the objective to be not so much that DF itself would directly improve
the generated code, but rather than it would provide much-need
infrastructure that would allow future improvements. This is a lot like
TREE-SSA which, by itself, wasn't so much about optimization as it was
about providing a platform for optimization.

Obviously, making sure that the DF code isn't actively hurting us when
it goes into the compiler is part of the agreed-upon criteria. But, I
don't think it needs to markedly help at this point, as long as people
are comfortable with the interfaces and functionality it provides.
Post by Vladimir Makarov
So could be somebody from steering committee be kind and answer me
Is it (prohibiting to use other dataflow schemes) steering committee
decision?
No, the Steering Committee has not considered this issue. As you
suggest, I don't think it's something the SC would want to consider;
it's a technical issue. I'd certainly think that it would be good for
all passes to use the DF infrastructure.

However, if there were really some special case where we could do
markedly better without it, and no feasible way to give the special-case
information to the core DF code, then I'm sure everyone would agree that
it made sense to use something different. But, that would be only in
extraordinary situations, rather than having lots of reinvention of the
same infrastructure.
--
Mark Mitchell
CodeSourcery
***@codesourcery.com
(650) 331-3385 x713
Vladimir N. Makarov
2007-02-13 04:25:21 UTC
Permalink
Post by Mark Mitchell
Post by Vladimir Makarov
On Sunday I had accidentally chat about the df infrastructure on
IIRC. I've got some thoughts which I'd like to share.
I like df infrastructure code from the day one for its clearness.
Unfortunately users don't see it and probably don't care about it.
You're right that users don't care what the guts of the compiler look
like. That's exactly right: they care only about about the speed of the
compiler, the speed of the generated code, correctness of the generated
code, overall quality, etc.
However, my understanding (as someone who's not an expert on the DF code
base) is that, as you say, the new stuff is much tidier. I understood
the objective to be not so much that DF itself would directly improve
the generated code, but rather than it would provide much-need
infrastructure that would allow future improvements. This is a lot like
TREE-SSA which, by itself, wasn't so much about optimization as it was
about providing a platform for optimization.
Imho, it is not a right analogy. The right analogy would be a new RTL
(low-level IR). I know insn scheduler and RA well and I would not
exaggerrate the DF importance such way at least for these tasks.
Post by Mark Mitchell
Obviously, making sure that the DF code isn't actively hurting us when
it goes into the compiler is part of the agreed-upon criteria. But, I
don't think it needs to markedly help at this point, as long as people
are comfortable with the interfaces and functionality it provides.
Post by Vladimir Makarov
So could be somebody from steering committee be kind and answer me
Is it (prohibiting to use other dataflow schemes) steering committee
decision?
No, the Steering Committee has not considered this issue. As you
suggest, I don't think it's something the SC would want to consider;
it's a technical issue. I'd certainly think that it would be good for
all passes to use the DF infrastructure.
However, if there were really some special case where we could do
markedly better without it, and no feasible way to give the special-case
information to the core DF code, then I'm sure everyone would agree that
it made sense to use something different. But, that would be only in
extraordinary situations, rather than having lots of reinvention of the
same infrastructure.
Thanks for the clarification, Mark.
Mark Mitchell
2007-02-13 04:45:29 UTC
Permalink
Post by Vladimir N. Makarov
Post by Mark Mitchell
However, my understanding (as someone who's not an expert on the DF code
base) is that, as you say, the new stuff is much tidier. I understood
the objective to be not so much that DF itself would directly improve
the generated code, but rather than it would provide much-need
infrastructure that would allow future improvements. This is a lot like
TREE-SSA which, by itself, wasn't so much about optimization as it was
about providing a platform for optimization.
Imho, it is not a right analogy. The right analogy would be a new RTL
(low-level IR). I know insn scheduler and RA well and I would not
exaggerrate the DF importance such way at least for these tasks.
I was not trying to suggest that DF is necessarily as sweeping a change
as TREE-SSA. Certainly, it's not a completely change to the
representation.

The sense in which I meant to suggest similarity is that DF is
infrastructure. It's not that DF is a new optimization; it's that it's
a facilitator of optimizations. The point is to get more accurate
information to the various RTL passes, in a more consistent way, so that
they can do their jobs better. And, to be able to adjust those patches
to use newer, better algorithms, which depend on easy access to dataflow
information. And, perhaps, to permit the creation of new optimization
passes that will only be useful because more accurate information is
available.

My point is that I don't think that the right criteria for DF is whether
it makes the generated code faster. So long as it doesn't make the
generated code slower, and so long as the APIs seem well designed, and
so long as it doesn't make the compiler itself too much slower, I think
it's a win.
--
Mark Mitchell
CodeSourcery
***@codesourcery.com
(650) 331-3385 x713
Vladimir Makarov
2007-02-13 16:25:33 UTC
Permalink
Wow, I got so many emails. I'll try to answer them in one email in
order not to repeat.
Post by Mark Mitchell
I was not trying to suggest that DF is necessarily as sweeping a change
as TREE-SSA. Certainly, it's not a completely change to the
representation.
It is not sweeping change as Tree-SSA. Tree-SSA is designed and
sharpened for global optimizations. Ken knows that well because he
is one inventor of SSA.

Let us look at major RTL optimizations: combiner, scheduler, RA. Do
we need a global analysis for building def-use use-def chains? We
don't need it for combiner (only in bb scope), we don't need it for
scheduler (only for DAG region), we don't need at all for reload. So
building and uses use-def chains is definitely overkill which will
make the compiler slower. A lot of things were done to break
dependencies in scheduler manageable because scheduling is quadratic
algorithm, addressing it is just moving all this code to the
infrastructure.

Some algorithms would benefit of global analysis for def-use, use-def
chains (like gcse after reload or webizier which is switched off by
default) but it can be done without fat structures of the data flow
infrastructure.

We need more accurate life analysis (liveness with availability). It
could be fixed in the current life analysis. They say about other
inaccuracies in which importance I doubt. But it can be fixed there
again. I agree the lif analysis is the mess but you can rewrite it
without introducing fat structures to calculate the relations.

So I think they did not investigate drawbacks of the df with its fat
structures which imho are reasons for slowness. They just took
existing Mike Hyes's code without investigating alternative slimmer
representations. And possible alternative slimmer representations
might change the interface well.

So what is the rest is duplicated may be faster representation of insn
operands, whose time and memory resources for creating is not
justified.

That is not how tree-SSA was designed and developed during 5 years.

Once again I am not against a new df infrastructure. I am against to
rush it into the mainline. I understand that there are a lot of
things to be investigated.

They are promising blue sky but I don't see it now. May be I am
blind.

I understand we are under the pressure of our employers sometimes.
I feel it too. But may be one year is not enough for such work.
Post by Mark Mitchell
I don't really grasp where you are headed with this criticism.
Criticism can be a good thing. But I think it would help me, at
least, if you were more clear about what you see as the alternative.
I mentioned alternatives:

o investigating what analysis is really necessary for major RTL passes
o fixing life analysis code
o rewriting life analysis code
o investigating slimmer representation (may be attaching it to the
rtl reg/subreg. although it is complicated because of pseudo-register
sharing before reload).
o trying to rewrite e.g. gcse after reload and see what is really
happened.

Now I am saying where it is headed. Many users skipped 4.0 release.
4.1 was a good release. imho 4.2 probably is another candidate for
skipping. I don't want to make 4.3 one more such candidate because of
the new DF infrastructure. I think producing two releases to be
avoided is luxury for us.

Changing the df without providing older path means that probably
some ports will be broken. Including just infrastructure which makes
compiler 5% slower is not reasonable too. I think such change should
be included into the mainline only just right after the transition
from stage 3 to stage 1. People will have more time to fix the broken
ports and may be write something which show the potential of the new
infrastructure.

I think it is too late to include the new df into the mainline. I
think achieving the criteria will take even more time.

I ask the steering committee to reconsider its decision of inclusion
of DF infrastructure in 4.3. It should be done as the first step on
stage 1 for 4.4 release of course if they achieve the merge criteria.
Richard Kenner
2007-02-13 16:40:37 UTC
Permalink
Post by Vladimir Makarov
Let us look at major RTL optimizations: combiner, scheduler, RA. Do
we need a global analysis for building def-use use-def chains? We
don't need it for combiner (only in bb scope),
Combine would work *far better* if it had some uniform data structure. For
one thing, that means it could work across blocks, which would be very
valuable. For another, there's a tremendous amount of highly-complex code
to do the sorts of tracking that is best done with a uniform infrastructure.
Post by Vladimir Makarov
we don't need it for scheduler (only for DAG region)
Yes, but again, it would work better if it didn't.
Post by Vladimir Makarov
we don't need at all for reload.
Again, a tremendous amount of kludgery in reload would be removed by
having accurate and better-tracked information.
David Edelsohn
2007-02-13 16:40:55 UTC
Permalink
Do you realize how confrontational your emails sound? Have you
considered asking about the technical reasoning and justification instead
of making unfounded assertions? Do you want everyone to refute your
incorrect facts point by point?

David
Vladimir Makarov
2007-02-13 17:14:06 UTC
Permalink
Post by David Edelsohn
Do you realize how confrontational your emails sound? Have you
considered asking about the technical reasoning and justification instead
of making unfounded assertions? Do you want everyone to refute your
incorrect facts point by point?
David, I am sorry if you feel such. I am a bit confrontational. But
how can not I be if I am alone and try to express my concerns in which I
really believe.
Jeffrey Law
2007-02-13 17:27:13 UTC
Permalink
Post by Vladimir Makarov
Post by David Edelsohn
Do you realize how confrontational your emails sound? Have you
considered asking about the technical reasoning and justification instead
of making unfounded assertions? Do you want everyone to refute your
incorrect facts point by point?
David, I am sorry if you feel such. I am a bit confrontational. But
how can not I be if I am alone and try to express my concerns in which I
really believe.
I think everyone would be best served if they realized none of this is
personal -- it's about technical decisions in an effort to improve GCC.

Jeff
David Edelsohn
2007-02-13 19:16:58 UTC
Permalink
Jeff> I think everyone would be best served if they realized none of this is
Jeff> personal -- it's about technical decisions in an effort to improve GCC.

GCC development is far from perfect. The recent model generally
seems to be effective, although there is plenty of room for improvement.

I am not trying to discourage feedback and comments, but
complaining is easy -- throwing stones is easy. The question is: How are
we going to improve it and/or fix it?

There are many things that we may want to do, but what can we do?

And most importantly: Are the alternatives really better or just
look better from the outside?


This discussion has strayed from the specific dataflow topic, but
for that specific project (and other projects), I would encourage the GCC
community to get involved and improve the design and impleentation of the
feature.

Waiting for perfect is not a good strategy. Competition is good,
but GCC developers generally do not reject offers of assistance. A lot of
separate, duplicative projects may not get done, while compromising on a
single design with everyone contributing to the implementation has a
better chance of completion. In the long run, we get a single,
functional, complete, good but imperfect, and easier to maintain feature.

David
Steven Bosscher
2007-02-13 17:11:24 UTC
Permalink
Post by Vladimir Makarov
Wow, I got so many emails. I'll try to answer them in one email in
Let us look at major RTL optimizations: combiner, scheduler, RA.
...PRE, CPROP,SEE, RTL loop optimizers, if-conversion, ... It is easy
to make your arguments look valid if you take it as a proposition that
only register allocation and scheduling ought to be done on RTL.

The reality is that GIMPLE is too high level (by design) to catch many
useful transformations performed on RTL. Tthink CSE of lowered
addresses, expanded builtins, code sequences generated for bitfield
operations and expensive instructions (e.g. mul, div). So we are
going to have more RTL optimizers than just regalloc and sched.

Many RTL optimizations still matter very much (disable some of them
and test SPEC again, if you're unconvinced). Having a uniform
dataflow framework for those optimizations is IMHO a good thing.
Post by Vladimir Makarov
Do
we need a global analysis for building def-use use-def chains? We
don't need it for combiner (only in bb scope)
It seems to me that this limitation is only there because when combine
was written, the idea of "global dataflow information" was in the
"future work" section for most practical compilers. So, perhaps
combine, as it is now, does not need DU/UD chains. But maybe we can
improve passes like this if we re-implement them in, or migrate them
to a better dataflow framework.

Gr.
Steven
Vladimir Makarov
2007-02-13 19:22:36 UTC
Permalink
Post by Steven Bosscher
Post by Vladimir Makarov
Wow, I got so many emails. I'll try to answer them in one email in
Let us look at major RTL optimizations: combiner, scheduler, RA.
...PRE, CPROP,SEE, RTL loop optimizers, if-conversion, ... It is easy
to make your arguments look valid if you take it as a proposition that
only register allocation and scheduling ought to be done on RTL.
The reality is that GIMPLE is too high level (by design) to catch many
useful transformations performed on RTL. Tthink CSE of lowered
addresses, expanded builtins, code sequences generated for bitfield
operations and expensive instructions (e.g. mul, div). So we are
going to have more RTL optimizers than just regalloc and sched.
Many RTL optimizations still matter very much (disable some of them
and test SPEC again, if you're unconvinced). Having a uniform
dataflow framework for those optimizations is IMHO a good thing.
Steven, I am agree with you. I am not against a df infrastructure.
Well defined and efficient one is always good thing. As I wrote I even
uses the proposed DF in my RA project.

I am just trying to convince that the proposed df infrastructure is not
ready and might create serious problems for this release and future
development because it is slow. Danny is saying that the beauty of the
infrastracture is just in improving it in one place. I am agree in this
partially. I am only affraid that solution for faster infrastructure
(e.g. another slimmer data representation) might change the interface
considerably. I am not sure that I can convinince in this. But I am
more worried about 4.3 release and I really believe that inclusion of
the data flow infrastructure should be the 1st step of stage 1 to give
people more time to solve at least some problems.

Saying that I hurt some feeling people who put a lof of efforts in the
infrastracture like Danny, Ken, and Seongbae and I am sorry for that.
Post by Steven Bosscher
Post by Vladimir Makarov
Do
we need a global analysis for building def-use use-def chains? We
don't need it for combiner (only in bb scope)
It seems to me that this limitation is only there because when combine
was written, the idea of "global dataflow information" was in the
"future work" section for most practical compilers. So, perhaps
combine, as it is now, does not need DU/UD chains. But maybe we can
improve passes like this if we re-implement them in, or migrate them
to a better dataflow framework.
Combiner is an older approach of code selection. It was designed by the
same authors (Fraser and Proebsting) before they designed BURG. I
remeber even intermidate approaches when minimal cover of tree by
subtrees representing the machine insns was tried to be solved by
context free grammar parsers. Modern code selection like BURG (dynamic
programming approach which tries to find real *minimal cost* cover)
works on tree of IR insns in one BB (or in more complex case on DAG.
In this case it not optimal solution is used). I even have own tool
for this NONA http://cocom.sf.net . Although it might be a good
research to make it work on insns from diffrent BBs.

The problem is that to use the modern approach you need another
description of insns (with one pattern - one machine insn relation) in
tree representation with given cost for the tree. And it is a huge work
to rewrite current machine descriptions even only for this.
David Edelsohn
2007-02-13 19:54:18 UTC
Permalink
Vlad> I am just trying to convince that the proposed df infrastructure is not
Vlad> ready and might create serious problems for this release and future
Vlad> development because it is slow. Danny is saying that the beauty of the
Vlad> infrastracture is just in improving it in one place. I am agree in this
Vlad> partially. I am only affraid that solution for faster infrastructure
Vlad> (e.g. another slimmer data representation) might change the interface
Vlad> considerably. I am not sure that I can convinince in this. But I am
Vlad> more worried about 4.3 release and I really believe that inclusion of
Vlad> the data flow infrastructure should be the 1st step of stage 1 to give
Vlad> people more time to solve at least some problems.

DF has been successfully tested on many more targets than
originally requested by the GCC SC. The original requirement for targets
was the same as for the Tree-SSA merge. Tree-SSA continued to be cleaned
up, fixed, and improved after it was merged. Tree-SSA performance
improved by the time of the release and was not required to be perfect on
day one.

DF will be good when merged and will continue to improve on
mainline in Stage 2. GCC previously has not had a requirement that a
patch be committed at the beginning of Stage 1.

We understand your concerns, but unsubstantiated assertions like
"might create serious problems..." are not very helpful or convincing
arguments. You are selectively quoting other developers and pulling their
comments out of context to support your objections.

Why, specifically, is the df infrastructure not ready? Have you
investigated the current status? Have you looked at the design documents,
implementation, and comments? Have you followed the mailinglist
discussions and patches?

Why is it unacceptable for it to mature further on mainline like
Tree-SSA?

Why is it better to delay merging an announced, planned and
approved project that the developers believe is ready, which will impose
the complexity and work of maintaining a branch with invasive changes for
a full release cycle? It took a long time to fix all of the current users
of dataflow and recent mainline patches continue to introduce new bugs.

Why are the discussions about the current performance, known
performance problems, and specific plans for performance improvement
throughout the rest of the release cycle insufficient to address your
concerns?

David
Vladimir Makarov
2007-02-13 20:40:14 UTC
Permalink
Post by David Edelsohn
Vlad> I am just trying to convince that the proposed df infrastructure is not
Vlad> ready and might create serious problems for this release and future
Vlad> development because it is slow. Danny is saying that the beauty of the
Vlad> infrastracture is just in improving it in one place. I am agree in this
Vlad> partially. I am only affraid that solution for faster infrastructure
Vlad> (e.g. another slimmer data representation) might change the interface
Vlad> considerably. I am not sure that I can convinince in this. But I am
Vlad> more worried about 4.3 release and I really believe that inclusion of
Vlad> the data flow infrastructure should be the 1st step of stage 1 to give
Vlad> people more time to solve at least some problems.
DF has been successfully tested on many more targets than
originally requested by the GCC SC. The original requirement for targets
was the same as for the Tree-SSA merge. Tree-SSA continued to be cleaned
up, fixed, and improved after it was merged. Tree-SSA performance
improved by the time of the release and was not required to be perfect on
day one.
DF will be good when merged and will continue to improve on
mainline in Stage 2. GCC previously has not had a requirement that a
patch be committed at the beginning of Stage 1.
We understand your concerns, but unsubstantiated assertions like
"might create serious problems..." are not very helpful or convincing
arguments. You are selectively quoting other developers and pulling their
comments out of context to support your objections.
There were too many of them. I hope you are not saying there is no one
reason for my concerns.
Post by David Edelsohn
Why, specifically, is the df infrastructure not ready? Have you
investigated the current status? Have you looked at the design documents,
implementation, and comments? Have you followed the mailinglist
discussions and patches?
I did investigate the current status of the infrastructure on future
mainstream processor Core2 (> 11% slower compiler, worse code and bigger
code size). That is the reason why I started this. I know this
infrastructure sufficientlly well for a long time. I always had concern
about the fat representation.
Post by David Edelsohn
Why is it unacceptable for it to mature further on mainline like
Tree-SSA?
Two releases one after another to avoid. No one real experiment to try
to rewrite an RTL optimization to figure out how def-use chain will work.
Post by David Edelsohn
Why is it better to delay merging an announced, planned and
approved project that the developers believe is ready, which will impose
the complexity and work of maintaining a branch with invasive changes for
a full release cycle? It took a long time to fix all of the current users
of dataflow and recent mainline patches continue to introduce new bugs.
Why are the discussions about the current performance, known
performance problems, and specific plans for performance improvement
throughout the rest of the release cycle insufficient to address your
concerns?
Ok, at least I tried. I hope 5% compiler time degradation is still in
effect.
Steven Bosscher
2007-02-13 20:46:09 UTC
Permalink
Post by Vladimir Makarov
Post by David Edelsohn
Why is it unacceptable for it to mature further on mainline like
Tree-SSA?
Two releases one after another to avoid. No one real experiment to try
to rewrite an RTL optimization to figure out how def-use chain will work.
Vlad, this FUD-spreading is beginning to annoy me. Please get your
view of the facts in order.

There *are* passes rewritten in the new framework to figure out how
this will work. In fact, some of those passes existed even before the
rest of the backend was converted to the new dataflow scheme. Existing
on trunk even now: fwprop, see, web, loop-iv. New on the branch: at
least auto-inc-dec.

Gr.
Steven
Vladimir Makarov
2007-02-13 21:04:56 UTC
Permalink
Post by Steven Bosscher
Post by Vladimir Makarov
Post by David Edelsohn
Why is it unacceptable for it to mature further on mainline like
Tree-SSA?
Two releases one after another to avoid. No one real experiment to try
to rewrite an RTL optimization to figure out how def-use chain will work.
Vlad, this FUD-spreading is beginning to annoy me. Please get your
view of the facts in order.
There *are* passes rewritten in the new framework to figure out how
this will work. In fact, some of those passes existed even before the
rest of the backend was converted to the new dataflow scheme. Existing
on trunk even now: fwprop, see, web, loop-iv. New on the branch: at
least auto-inc-dec.
Sorry for that. May be I missed something. I should be more prepared
to start this discussion. I'll look what is going on with more
attention. I only know that web was originally written on DF so the
comparison (i mean speed, resources) is not possible.
David Edelsohn
2007-02-13 20:55:24 UTC
Permalink
Vlad> I did investigate the current status of the infrastructure on future
Vlad> mainstream processor Core2 (> 11% slower compiler, worse code and bigger
Vlad> code size). That is the reason why I started this.

You do not believe that this is a concern of others? You do not
believe that this will be addressed after the merge?

This could be an example of GCC optimization becoming more
aggressive with increased accuracy, causing increased register pressure
problems, which is particularly detrimental to GCC for IA-32.

Why don't we analyse and fix any problems instead of trying to
keep GCC's infrastructure weak and stupid to cover up its inadequacies?
Complaining about and blocking the merge of df does not solve the problem,
it only delays it.

David
Vladimir Makarov
2007-02-13 21:07:43 UTC
Permalink
Post by David Edelsohn
Vlad> I did investigate the current status of the infrastructure on future
Vlad> mainstream processor Core2 (> 11% slower compiler, worse code and bigger
Vlad> code size). That is the reason why I started this.
You do not believe that this is a concern of others? You do not
believe that this will be addressed after the merge?
This could be an example of GCC optimization becoming more
aggressive with increased accuracy, causing increased register pressure
problems, which is particularly detrimental to GCC for IA-32.
Why don't we analyse and fix any problems instead of trying to
keep GCC's infrastructure weak and stupid to cover up its inadequacies?
Complaining about and blocking the merge of df does not solve the problem,
it only delays it.
Once again. I am agree to stop this discussion. This is the last my
email on the thread. I just hope the thread was not waisting people's
time. If it is so I appologize for that.
Paolo Bonzini
2007-02-13 20:02:21 UTC
Permalink
I even have own tool for
this NONA http://cocom.sf.net . Although it might be a good research to
make it work on insns from diffrent BBs.
Of course instruction selection is usually done intra-BB. However, some
analyses that combine performs, such as nonzero_bits and
num_sign_bit_copies, maybe could be extended to use the new framework.
The problem is that to use the modern approach you need another
description of insns (with one pattern - one machine insn relation) in
tree representation with given cost for the tree. And it is a huge work
to rewrite current machine descriptions even only for this.
This is not really necessary. I have a stripped down version of NONA
working on RTL. It is possible to build the tree using a mechanism
similar to subst in combine, simplify the resulting expression, and use
NONA (which is actually the same algorithms as iburg) to split it.

Paolo
Vladimir Makarov
2007-02-13 20:47:06 UTC
Permalink
Post by Paolo Bonzini
Post by Vladimir Makarov
The problem is that to use the modern approach you need another
description of insns (with one pattern - one machine insn relation)
in tree representation with given cost for the tree. And it is a
huge work to rewrite current machine descriptions even only for this.
This is not really necessary. I have a stripped down version of NONA
working on RTL. It is possible to build the tree using a mechanism
similar to subst in combine, simplify the resulting expression, and
use NONA (which is actually the same algorithms as iburg) to split it.
I tried to do this 7 years ago without a success. I am glad that you
are sucessfull. Thank you for doing this, Paolo. I never thought that
this code created 15 year ago can be used somewhere else besides one
internal production compiler.
Paolo Bonzini
2007-02-14 06:07:00 UTC
Permalink
Post by Vladimir Makarov
Post by Paolo Bonzini
Post by Vladimir Makarov
The problem is that to use the modern approach you need another
description of insns (with one pattern - one machine insn relation)
in tree representation with given cost for the tree. And it is a
huge work to rewrite current machine descriptions even only for this.
This is not really necessary. I have a stripped down version of NONA
working on RTL. It is possible to build the tree using a mechanism
similar to subst in combine, simplify the resulting expression, and
use NONA (which is actually the same algorithms as iburg) to split it.
I tried to do this 7 years ago without a success. I am glad that you
are sucessfull. Thank you for doing this, Paolo. I never thought that
this code created 15 year ago can be used somewhere else besides one
internal production compiler.
I never said I finished. :-) In particular, my code does not do
combine-style simplifications.

Most current machine description only have a one pattern - one machine
insn relation post reload, which can be a problem. However, current
RTX_COSTS are much more precise than 7 years ago to the point that
combine actually uses them, and this somewhat offsets the problem.

Paolo
Steven Bosscher
2007-02-13 20:05:20 UTC
Permalink
Post by Vladimir Makarov
I am just trying to convince that the proposed df infrastructure is not
ready and might create serious problems for this release and future
development because it is slow. Danny is saying that the beauty of the
infrastracture is just in improving it in one place. I am agree in this
partially. I am only affraid that solution for faster infrastructure
(e.g. another slimmer data representation) might change the interface
considerably. I am not sure that I can convinince in this. But I am
more worried about 4.3 release and I really believe that inclusion of
the data flow infrastructure should be the 1st step of stage 1 to give
people more time to solve at least some problems.
I recall this wonderful quote of just a few days ago, which perfectly
expresses my feelings about the proposed merge of the dataflow branch
for GCC 4.3:

"I would hope that the
community would accept the major structural improvement, even if it is
not a 100% complete transition, and that we can then work on any
remaining conversions in the fullness of time."
-- Mark Mitchell, 11 Feb 2007 [1]

:-D

Gr.
Steven





[1] http://gcc.gnu.org/ml/gcc-patches/2007-02/msg01012.html
Seongbae Park
2007-02-13 21:09:03 UTC
Permalink
On 2/13/07, Vladimir Makarov <***@redhat.com> wrote:
...
Post by Vladimir Makarov
I am only affraid that solution for faster infrastructure
(e.g. another slimmer data representation) might change the interface
considerably. I am not sure that I can convinince in this. But I am
more worried about 4.3 release and I really believe that inclusion of
the data flow infrastructure should be the 1st step of stage 1 to give
people more time to solve at least some problems.
Vlad,

I'm really interested in hearing what aspect of the current interface
is not right. Can you tell us about just some rough sketch of
what slimmer (hence better) data representation would look like,
and why the current interface won't be ideal for that ?
It's not too late - if there's anything we can do to change
the interface so that it can accomodate
a potentially better implementation in the future,
I won't object to it - it's just that I haven't talked to you
to figure out what you have in mind.
Post by Vladimir Makarov
Saying that I hurt some feeling people who put a lof of efforts in the
infrastracture like Danny, Ken, and Seongbae and I am sorry for that.
No feelings hurt. Thanks for all your feedback.
More eyes, especially experienced ones like you, can only help.
Also, thanks for trying out DF on Core2
- we need to look more closely why those regressions are
there and what/how we can do to fix those,
before such an evaluation, I can't tell whether this is a serious fundamental
problem or some easily fixable things that we haven't gotten to yet.
--
#pragma ident "Seongbae Park, compiler, http://seongbae.blogspot.com"
Richard Kenner
2007-02-14 12:35:06 UTC
Permalink
Post by Vladimir Makarov
Combiner is an older approach of code selection.
Combine can be considered both as code selection or optimization. Likewise,
CSE. In many cases, especially now that we have the tree optimizers, CSE
does more code selection (best choice of operand) than CSE. So you could say
that CSE and Combine are, in some sense, doing the same thing but with
different data structures.

Well before GCC 4.x there was an attempt that a few of us worked on to try to
move the algebraic simplifications from combine to simplify-rtx.c, just like
we put tree folding in fold-const.c. At that point, combine just becomes
"bookkeeping". The problem with that approach was what to do with such
things as nonzero_bits and num sign bit copies. But if those (and more!)
become part of a global DF infrastructure, then they are available in *all*
RTL passes.

Next, one can finish the task of moving the simplifications to simplify-rtx.c
(since the global information will always be around) and much of the rest of
combine is replaced by the incremental update part of DF.

That means we could merge CSE and combine and do a *lot* more than we ever
were able to do before while having the code for both passes be much simpler.
Paolo Bonzini
2007-02-14 12:54:36 UTC
Permalink
Post by Richard Kenner
Well before GCC 4.x there was an attempt that a few of us worked on to try to
move the algebraic simplifications from combine to simplify-rtx.c, just like
we put tree folding in fold-const.c. At that point, combine just becomes
"bookkeeping". The problem with that approach was what to do with such
things as nonzero_bits and num sign bit copies.
Actually, simplify-rtx.c now uses nonzero_bits and num_sign_bit_copies:
these ask combine for the value in the case of pseudos, via the "RTL
hooks" mechanism. The same RTL hooks mechanism allow combine to try
using its (clobber (const_int 0)) idiom in gen_lowpart, for example. It
was done in 4.1 or 4.2, I don't remember, and it allowed quite a few
simplifications to be moved from combine_simplify_rtx to simplify-rtx.c.

It is possible to move a lot more indeed. The problems are mainly these
two:

1) what to do with (clobber (const_int 0)). This should be not so much
of a problem thanks to validate_change, but I'd be weary of having such
CLOBBER rtx-en in REG_EQUAL notes!

2) a lot of this is hard to be done incrementally. For example, it is
hard to move any one of simplify_and_const_int, make_compound_operation,
expand_compound_operation, force_to_mode, simplify_shift_const without
moving the others. This means that the patches would be huge. Or it
could be possible to move simplify_comparison, but it is a 1200-line
function so the almost-mechanic changes needed to move it would be also
very hard to review.

The problem with fold_rtx is the other way round; it's hard to move the
lookup logic into simplify-rtx.c even using RTL hooks.
Post by Richard Kenner
That means we could merge CSE and combine and do a *lot* more than we ever
were able to do before while having the code for both passes be much simpler.
Unfortunately, we should not forget CSE's secondary functionality, which
is to act as garbageman for the GCSE pass.

Paolo
Richard Kenner
2007-02-14 14:36:26 UTC
Permalink
Post by Paolo Bonzini
these ask combine for the value in the case of pseudos, via the "RTL
hooks" mechanism.
Right. That was certainly a step (and was discussed a while ago), but doing
it more globally would make it even easier.
Post by Paolo Bonzini
1) what to do with (clobber (const_int 0)). This should be not so much
of a problem thanks to validate_change, but I'd be weary of having such
CLOBBER rtx-en in REG_EQUAL notes!
Just return NULL. The philosophy of simplify_rtx is different from combine.
In the former, you're being asked "can you simplify this?" while in the
latter, there's a lot of intertwining between simplification and
substitution. So you can just return "no, I can't simplify it" and combine
would then do what it wants do with that result.
Post by Paolo Bonzini
2) a lot of this is hard to be done incrementally. For example, it is
hard to move any one of simplify_and_const_int, make_compound_operation,
expand_compound_operation, force_to_mode, simplify_shift_const without
moving the others.
Indeed. And you can't just move them intact because of the change of
philosphy. I originally wrote those routines and so am the most familiar
with them. Once the infrastructure is ready, I probably should take a look
at what's involved.
Post by Paolo Bonzini
Post by Richard Kenner
That means we could merge CSE and combine and do a *lot* more than we ever
were able to do before while having the code for both passes be much simpler.
Unfortunately, we should not forget CSE's secondary functionality, which
is to act as garbageman for the GCSE pass.
Right, but much of that functionality is more in the arena of operand selection
than true "cse".
Paolo Bonzini
2007-02-14 14:59:55 UTC
Permalink
[trimming down the Cc list]
Post by Richard Kenner
Post by Paolo Bonzini
1) what to do with (clobber (const_int 0)). This should be not so much
of a problem thanks to validate_change, but I'd be weary of having such
CLOBBER rtx-en in REG_EQUAL notes!
Just return NULL. The philosophy of simplify_rtx is different from combine.
In the former, you're being asked "can you simplify this?" while in the
latter, there's a lot of intertwining between simplification and
substitution. So you can just return "no, I can't simplify it" and combine
would then do what it wants do with that result.
Yes, one possibility is to use a RTX hook for this too. By default you
would return NULL (and this would propagate up); in combine you could
override it to return the CLOBBER.

To some extent, simplify-rtx.c could *know* about CLOBBER. It would
just not return it unless in combine.

Paolo
Richard Kenner
2007-02-14 15:17:23 UTC
Permalink
Post by Paolo Bonzini
Yes, one possibility is to use a RTX hook for this too. By default you
would return NULL (and this would propagate up); in combine you could
override it to return the CLOBBER.
I really don't see why. Look at when combine calls the simplify routines now.
If they return zero, it uses the original value. If the simplify routines
do more, that will still be true. I don't see the point in preserving
the CLOBBER kludge in an environment that no longer needs it.
Post by Paolo Bonzini
To some extent, simplify-rtx.c could *know* about CLOBBER.
It *could*, but I think it would be much cleaner if it *didn't*.
Paolo Bonzini
2007-02-14 15:27:57 UTC
Permalink
Post by Richard Kenner
Post by Paolo Bonzini
Yes, one possibility is to use a RTX hook for this too. By default you
would return NULL (and this would propagate up); in combine you could
override it to return the CLOBBER.
I really don't see why. Look at when combine calls the simplify routines now.
If they return zero, it uses the original value.
Some of the combine simplifications (you obviously know that) work by
"hoping" that the CLOBBER is simplified away. I don't think you can
preserve all their power if you propagate NULL. In most cases you can
replace CLOBBER with NULL, but I don't think that's possible everywhere.

Paolo
Richard Kenner
2007-02-14 16:35:45 UTC
Permalink
Post by Paolo Bonzini
Some of the combine simplifications (you obviously know that) work by
"hoping" that the CLOBBER is simplified away. I don't think you can
preserve all their power if you propagate NULL. In most cases you can
replace CLOBBER with NULL, but I don't think that's possible everywhere.
Yeah, but those are quite rare and I think each could probably be
handled some other way on a case-by-case-basis.

David Edelsohn
2007-02-12 21:26:24 UTC
Permalink
Vlad> Especially I did not like David Edelhson's phrase "and no new
Vlad> private dataflow schemes will be allowed in gcc passes". It was not
Vlad> such his first expression. Such phrases are killing competition which
Vlad> is bad for gcc. What if the new specialized scheme is faster. What
Vlad> if somebody decides to write another better df infrastructure from the
Vlad> scratch to solve the coming df infrastructure problems.

First, "another better df infrastructure" is not a private
dataflow scheme. If someone else wants to rewrite it from scratch and do
a better job, be my guest. Most commercial compilers rewrite their
infrastructure every 5 to 10 years; GCC accretes kludges. The other
commercial compilers learn from their algorithms and previous design to
implement a new, maintainable infrastructure that meets the needs of all
algorithms.

Second, this has nothing to do with competition. As I and others
explained on the IRC chat, the new df is general infrastructure. If you
can speed it up more, that's great. If you need another dataflow problem
solved, add it to the infrastructure. GCC is not served well by five (5)
different dataflow solvers, each with its own quirks, bugs, duplicative
memory and duplicative maintenance. It would be a waste to improve GCC's
infrastructure and then have the hard work undermined by recreating the
duplication without good justification.

Third, I am disappointed that you chose to make this argument
personal.

David
Vladimir Makarov
2007-02-12 21:39:57 UTC
Permalink
Post by David Edelsohn
Third, I am disappointed that you chose to make this argument
personal.
David, I really apologize to make it personal. We are all one community
and we are all thinking to make gcc a better compiler.
Richard Kenner
2007-02-12 23:44:56 UTC
Permalink
Post by David Edelsohn
Vlad> Especially I did not like David Edelhson's phrase "and no new
Vlad> private dataflow schemes will be allowed in gcc passes". It was not
Vlad> such his first expression. Such phrases are killing competition which
Vlad> is bad for gcc. What if the new specialized scheme is faster. What
Vlad> if somebody decides to write another better df infrastructure from the
Vlad> scratch to solve the coming df infrastructure problems.
First, "another better df infrastructure" is not a private
explained on the IRC chat, the new df is general infrastructure. If you
can speed it up more, that's great. If you need another dataflow problem
solved, add it to the infrastructure. GCC is not served well by five (5)
different dataflow solvers, each with its own quirks, bugs, duplicative
memory and duplicative maintenance.
I agree. "Competition" and making things "better" is not always the right
approach for a project the size of GCC: in fact, it's more often *not*
the right approach. In a project like GCC, stability is more important.
Once something is written, it's obsolete. But changing something in the
infrastructure (or duplicating it) affects stability and the gain is
rarely worth it.
Vladimir N. Makarov
2007-02-13 03:50:17 UTC
Permalink
Post by Richard Kenner
Post by David Edelsohn
Vlad> Especially I did not like David Edelhson's phrase "and no new
Vlad> private dataflow schemes will be allowed in gcc passes". It was not
Vlad> such his first expression. Such phrases are killing competition which
Vlad> is bad for gcc. What if the new specialized scheme is faster. What
Vlad> if somebody decides to write another better df infrastructure from the
Vlad> scratch to solve the coming df infrastructure problems.
First, "another better df infrastructure" is not a private
explained on the IRC chat, the new df is general infrastructure. If you
can speed it up more, that's great. If you need another dataflow problem
solved, add it to the infrastructure. GCC is not served well by five (5)
different dataflow solvers, each with its own quirks, bugs, duplicative
memory and duplicative maintenance.
I agree. "Competition" and making things "better" is not always the right
approach for a project the size of GCC: in fact, it's more often *not*
the right approach. In a project like GCC, stability is more important.
Once something is written, it's obsolete. But changing something in the
infrastructure (or duplicating it) affects stability and the gain is
rarely worth it.
In general, I am agree with this. I am just opposing to the strong
words "not allowed". I think it should be decided in each concrete case.
Richard Kenner
2007-02-13 13:17:11 UTC
Permalink
Post by Vladimir N. Makarov
In general, I am agree with this. I am just opposing to the strong
words "not allowed". I think it should be decided in each concrete case.
I agree that "not allowed" was a poor choice of words in a project
such as GCC and, although we, as a project, do tend to normally make
decisions one at a time, it's hard for me to see how there could be a
justifcation for a "private" dataflow structure, even if "better", but
one never knows ...
Kenneth Zadeck
2007-02-13 00:06:25 UTC
Permalink
Post by Vladimir Makarov
On Sunday I had accidentally chat about the df infrastructure on
IIRC. I've got some thoughts which I'd like to share.
I like df infrastructure code from the day one for its clearness.
Unfortunately users don't see it and probably don't care about it.
With my point of view the df infrastructure has a design flaw. It
extracts a lot of information about RTL and keep it on the side. It
does not make the code fast. It would be ok if we got a better code
quality. Danny told me that they have 1.5% better code using df. It
is a really big improvement (about half year work for all compiler
team according to Proebsting's law). IMHO, it could justify promised
5% compiler slowness.
Vlad,

I think that different people can have different perspectives.

You have been working on improving the register allocation for several
years, but very little has come of it because the reload
infrastructure does not suit itself to being integrated with modern
register allocators. You have spent several years of work without
touching the underlying problem that reload is generally going to
defeat almost any effort to get good benefits out of a new register
allocator. I do not want to denigrate your work in any way, but at
the end of the day, any new register allocator will be compromised by
the existing reload implementation.

I am interested bringing the rest of the back end into the modern
world. While some of the passes can and should be moved into the ssa
middle end of the compiler, there are several optimizations that can
only be done after the details of the target have been fully exposed.

My experience with trying to do this was that the number one problem
was that the existing dataflow is in many cases wrong or too
conservative and that it was not flexible enough to accommodate many
most modern optimization techniques. So rather than hack around the
problem, I decided to attack the bad infrastructure problem first, and
open the way for myself and the others who work on the back end to
benefit from that infrastructure to get the rest of passes into shape.

There are certainly performance issues here. There are limits on how
much I, and the others who have worked on this have been able to
change before we do our merge. So far, only those passes that were
directly hacked into flow, such as dce, and auto-inc-dec detection
have been rewritten from the ground up to fully utilize the new
framework.

However, it had gotten to the point where the two frameworks really
should not coexist. Both implementations expect to work in an
environment where the information is maintained from pass to pass and
doing with two systems was not workable. So the plan accepted by the
steering committee accommodates the wholesale replacement of the
dataflow analysis but even after the merge, there will still be many
passes that will be changed.

I would have liked to have the df information more tightly integrated
into the rtl rather than it being on the side. It is cumbersome to
keep this information up to date. However, the number of places in
the backends that depend on the existing rtl data structures apis make
such a replacement very difficult.

I do believe that by the time that we merge the branch, we will be
down to a 5% compile time regression. While I would like this number
to be 0% or negative, I personally believe that having precise and correct
information is worth it and that over time we will be able to remove
that 5% penalty.

As far as the other regressions, these will be dealt with very soon.

Kenny
Vladimir N. Makarov
2007-02-13 02:40:40 UTC
Permalink
Post by Kenneth Zadeck
Vlad,
I think that different people can have different perspectives.
You have been working on improving the register allocation for several
years, but very little has come of it because the reload
infrastructure does not suit itself to being integrated with modern
register allocators. You have spent several years of work without
touching the underlying problem that reload is generally going to
defeat almost any effort to get good benefits out of a new register
allocator. I do not want to denigrate your work in any way, but at
the end of the day, any new register allocator will be compromised by
the existing reload implementation.
Ken, to be exact I've been working for a bit more 2 years on the
register allocator itself. Probably you don't know I attacked exactly
underlying problem - the reload. If you look at YARA branch, it is
register allocator without the reload. And I am really worring that
there is a little outcome. But ridding the reload off is so complex
problem. I can not work on it for a few more years. I need some result
too.

Using the experience I've got from YARA branch I've created another
register allocator (IRA branch) to make it ready for gcc-4.4. IRA
still uses reload. But may be I have higher standards. I don't want
to include the code in sake of inclusion. I have no worse code
generated by IRA than the current register allocator. The code size
is smaller for the most platforms. For some platforms I have better
generated code (up to 4% on SPECINT2000 in 32-bt mode: bo exact 1930
vs 1850 for Core2 according this weekend benchmarking).

Actually I could make IRA ready for gcc4.3. It works for x86, x86_64,
itanium, ppc, sparc, s390, arm. It is optional so other platform can
use the current register allocator. But I don't to rush.

But still you are right the reload comprising the generated code. If
you are interesting in my opinion, df infrastracture is a tiny part of
RA implementation problem (and as I understand insn scheduler and code
selection). Actually IRA uses he tDF infrastracture but it can be easily
to be switched to the old life anaysis.
Post by Kenneth Zadeck
I am interested bringing the rest of the back end into the modern
world. While some of the passes can and should be moved into the ssa
middle end of the compiler, there are several optimizations that can
only be done after the details of the target have been fully exposed.
Bring the rest of the back end into the modern word is too chalenging
task. If you really want it, imho you should attack RTL and machine
descriptions. But this task is a magnitude more difficult than
introducing tree-SSA.
Post by Kenneth Zadeck
My experience with trying to do this was that the number one problem
was that the existing dataflow is in many cases wrong or too
conservative and that it was not flexible enough to accommodate many
most modern optimization techniques. So rather than hack around the
problem, I decided to attack the bad infrastructure problem first, and
open the way for myself and the others who work on the back end to
benefit from that infrastructure to get the rest of passes into shape.
I am not against a new DF infrastracture, more accurate one. I am
against a slower infrastracture.
Post by Kenneth Zadeck
There are certainly performance issues here. There are limits on
how much I, and the others who have worked on this have been able to
change before we do our merge. So far, only those passes that were
directly hacked into flow, such as dce, and auto-inc-dec detection
have been rewritten from the ground up to fully utilize the new
framework. However, it had gotten to the point where the two
frameworks really should not coexist. Both implementations expect
to work in an environment where the information is maintained from
pass to pass and doing with two systems was not workable. So the
plan accepted by the steering committee accommodates the wholesale
replacement of the dataflow analysis but even after the merge, there
will still be many passes that will be changed.
Does it means that compiler will be even more slower?
Post by Kenneth Zadeck
I would have liked
to have the df information more tightly integrated into the rtl
rather than it being on the side. It is cumbersome to keep this
information up to date. However, the number of places in the
backends that depend on the existing rtl data structures apis make
such a replacement very difficult. I do believe that by the time
that we merge the branch, we will be down to a 5% compile time
regression. While I would like this number to be 0% or negative, I
personally believe that having precise and correct information is
worth it and that over time we will be able to remove that 5%
penalty. As far as the other regressions, these will be dealt with
very soon.
Great!
Steven Bosscher
2007-02-13 05:24:08 UTC
Permalink
Post by Vladimir N. Makarov
Post by Kenneth Zadeck
There are certainly performance issues here. There are limits on
how much I, and the others who have worked on this have been able to
change before we do our merge. So far, only those passes that were
directly hacked into flow, such as dce, and auto-inc-dec detection
have been rewritten from the ground up to fully utilize the new
framework. However, it had gotten to the point where the two
frameworks really should not coexist. Both implementations expect
to work in an environment where the information is maintained from
pass to pass and doing with two systems was not workable. So the
plan accepted by the steering committee accommodates the wholesale
replacement of the dataflow analysis but even after the merge, there
will still be many passes that will be changed.
Does it means that compiler will be even more slower?
No, it will mean the compiler will be faster. Sooner if you help. You
seem to believe that the DF infrastructure is fundamentally slower
than flow is. I believe that there are other reasons for the current
differences in compile time.

AFAICT the current compile time slowdowns on the dataflow branch are due to:

* bitmaps bitmaps bitmaps. We badly need a faster bitmap implementation.

* duplicate work on insns scanning:
1. DF scans all insns and makes available accurate information
2. Many (most) passes see it and think, "Hey, I can do that
myself!", and they
rescan all insns for no good reason.
The new passes, that use the new infrastructure, are among the fastest in
the RTL path right now. The slow passes are the passes doing their own
thing (CSE, GCSE, regmove, etc.).

* duplicate work between passes (minor):
- on the trunk, regmove can make auto increment insns
- on the df branch, the auto-inc-dec pass makes those
transformations redundant

* earlier availability of liveness information:
- On the trunk we compute liveness for the first time just before combine
- On the dataflow branch, we have liveness already after the first CSE pass
Updating it between CSE and combine over ~20 passes is probably costly
compared to doing nothing on the trunk. (I believe having cfglayout mode
early in the compiler will help reduce this cost thanks to no iterations in
cleanup_cfg)

Maybe I overestimate the cost of some of these items, and maybe I'm
missing a few items. But the message is the same: There is still
considerable potential for speeding up GCC using the new dataflow
infrastructure.

Gr.
Steven
Kenneth Zadeck
2007-02-13 14:04:49 UTC
Permalink
Post by Vladimir N. Makarov
Post by Kenneth Zadeck
Vlad,
I think that different people can have different perspectives.
You have been working on improving the register allocation for several
years, but very little has come of it because the reload
infrastructure does not suit itself to being integrated with modern
register allocators. You have spent several years of work without
touching the underlying problem that reload is generally going to
defeat almost any effort to get good benefits out of a new register
allocator. I do not want to denigrate your work in any way, but at
the end of the day, any new register allocator will be compromised by
the existing reload implementation.
Ken, to be exact I've been working for a bit more 2 years on the
register allocator itself. Probably you don't know I attacked exactly
underlying problem - the reload. If you look at YARA branch, it is
register allocator without the reload. And I am really worring that
there is a little outcome. But ridding the reload off is so complex
problem. I can not work on it for a few more years. I need some result
too.
Using the experience I've got from YARA branch I've created another
register allocator (IRA branch) to make it ready for gcc-4.4. IRA
still uses reload. But may be I have higher standards. I don't want
to include the code in sake of inclusion. I have no worse code
generated by IRA than the current register allocator. The code size
is smaller for the most platforms. For some platforms I have better
generated code (up to 4% on SPECINT2000 in 32-bt mode: bo exact 1930
vs 1850 for Core2 according this weekend benchmarking).
Actually I could make IRA ready for gcc4.3. It works for x86, x86_64,
itanium, ppc, sparc, s390, arm. It is optional so other platform can
use the current register allocator. But I don't to rush.
But still you are right the reload comprising the generated code. If
you are interesting in my opinion, df infrastracture is a tiny part of
RA implementation problem (and as I understand insn scheduler and code
selection). Actually IRA uses he tDF infrastracture but it can be easily
to be switched to the old life anaysis.
Post by Kenneth Zadeck
I am interested bringing the rest of the back end into the modern
world. While some of the passes can and should be moved into the ssa
middle end of the compiler, there are several optimizations that can
only be done after the details of the target have been fully exposed.
Bring the rest of the back end into the modern word is too chalenging
task. If you really want it, imho you should attack RTL and machine
descriptions. But this task is a magnitude more difficult than
introducing tree-SSA.
It is a hard project and you are right that replacing rtl would be
better. However, I do know how to do that either from a logistical
point of view or from the point of view of having a better replacement
that covers all of the platforms as well.

However there are a lot of sins in the back end and a large number of
them are either being directly addressed by this replacement or are now
accessible. The addition of df will allow others to introduce better
technology.
Post by Vladimir N. Makarov
Post by Kenneth Zadeck
My experience with trying to do this was that the number one problem
was that the existing dataflow is in many cases wrong or too
conservative and that it was not flexible enough to accommodate many
most modern optimization techniques. So rather than hack around the
problem, I decided to attack the bad infrastructure problem first, and
open the way for myself and the others who work on the back end to
benefit from that infrastructure to get the rest of passes into shape.
I am not against a new DF infrastracture, more accurate one. I am
against a slower infrastracture.
Post by Kenneth Zadeck
There are certainly performance issues here. There are limits on
how much I, and the others who have worked on this have been able to
change before we do our merge. So far, only those passes that were
directly hacked into flow, such as dce, and auto-inc-dec detection
have been rewritten from the ground up to fully utilize the new
framework. However, it had gotten to the point where the two
frameworks really should not coexist. Both implementations expect
to work in an environment where the information is maintained from
pass to pass and doing with two systems was not workable. So the
plan accepted by the steering committee accommodates the wholesale
replacement of the dataflow analysis but even after the merge, there
will still be many passes that will be changed.
Does it means that compiler will be even more slower?
Post by Kenneth Zadeck
I would have liked
to have the df information more tightly integrated into the rtl
rather than it being on the side. It is cumbersome to keep this
information up to date. However, the number of places in the
backends that depend on the existing rtl data structures apis make
such a replacement very difficult. I do believe that by the time
that we merge the branch, we will be down to a 5% compile time
regression. While I would like this number to be 0% or negative, I
personally believe that having precise and correct information is
worth it and that over time we will be able to remove that 5%
penalty. As far as the other regressions, these will be dealt with
very soon.
Algorithmically, the core of the dataflow branch, the scanning, the
setting up of the problems and the solution finder are pretty good.
People with the skill set and the interest are still going to be able to
shave a little time here and there but that is not likely to be where
the big performance gains are going to come from.

The likely source of many of the performance issues are based on our
still having to maintain some of the older outdated datastructures.
Regs_ever_live is the poster child of this. In theory regs_ever_live is
easy, it is just the set of hard registers that are used. In practice
this is a disaster to keep track of because it was only updated
occasionally and its values are "randomly" changed by the backends in
totally undocumented ways. Maintaining regs_ever_live requires a lot of
special mechanism that slows down a the incremental scanning. To the
extent that someone just wants to know is some register used, df has
accessable are counters for each reg that give the exact number of uses
and defs. However, because of the manual fiddling of regs_ever_live, it
is impossible to just replace one structure with another. It is quite
likely that 1% of slowdown can be charged to having to maintain this
duplicate structure.

These older data structures, with time, will be replaced with direct
access to df's "informantion on the side", but for now it is just too
much work to try to fix them all of the merge.
Post by Vladimir N. Makarov
Great!
Richard Kenner
2007-02-13 14:29:38 UTC
Permalink
Post by Kenneth Zadeck
Regs_ever_live is the poster child of this. In theory regs_ever_live is
easy, it is just the set of hard registers that are used. In practice
this is a disaster to keep track of because it was only updated
occasionally and its values are "randomly" changed by the backends in
totally undocumented ways. Maintaining regs_ever_live requires a lot of
special mechanism that slows down a the incremental scanning.
The history here, by the way, is that it was originally very simple and just
supposed to provide a "quick and easy" way of having a conservative view of
which registers *weren't* ever used. So it was set when a register might
possibly be used. That was indeed easy.

But then people wanted to be able to know *for sure* which registers were
used, so mechanisms were added to clear it out when we knew a register
*wasn't* used, which added the complexity you mention.

This is a problem with a lot of the ad-hoc structures used: they were
originally meant for one specific purpose and often used very locally and
were reasonably well-designed for that purpose, but then were used more
globally and/or for other purposes and they weren't quite so well designed
for that purpose anymore, but nobody went to the troule to change them.

I strongly support a new, common infrastructure that will allow all of these
older structures to be replaced. But the history is important in my opinion
because it means that we need to think as generally as possible and to ensure
we come up with as broad a structure as possible in order both to replace the
current structures, but also to support many other uses in the future. What
what I understand, the current mechanism does that, but I think it's
important to keep this criterion in mind when evaluating any possible
"competitors".
Kenneth Zadeck
2007-02-13 15:29:01 UTC
Permalink
Post by Richard Kenner
Post by Kenneth Zadeck
Regs_ever_live is the poster child of this. In theory regs_ever_live is
easy, it is just the set of hard registers that are used. In practice
this is a disaster to keep track of because it was only updated
occasionally and its values are "randomly" changed by the backends in
totally undocumented ways. Maintaining regs_ever_live requires a lot of
special mechanism that slows down a the incremental scanning.
The history here, by the way, is that it was originally very simple and just
supposed to provide a "quick and easy" way of having a conservative view of
which registers *weren't* ever used. So it was set when a register might
possibly be used. That was indeed easy.
But then people wanted to be able to know *for sure* which registers were
used, so mechanisms were added to clear it out when we knew a register
*wasn't* used, which added the complexity you mention.
This is a problem with a lot of the ad-hoc structures used: they were
originally meant for one specific purpose and often used very locally and
were reasonably well-designed for that purpose, but then were used more
globally and/or for other purposes and they weren't quite so well designed
for that purpose anymore, but nobody went to the troule to change them.
I strongly support a new, common infrastructure that will allow all of these
older structures to be replaced. But the history is important in my opinion
because it means that we need to think as generally as possible and to ensure
we come up with as broad a structure as possible in order both to replace the
current structures, but also to support many other uses in the future. What
what I understand, the current mechanism does that, but I think it's
important to keep this criterion in mind when evaluating any possible
"competitors".
I very much understand the "I just need to make this one little tweak"
road to hell. By definition, this was no one's plan. It will most
likely take us all of stage 2 to understand and replace all of the one
little tweaks with respect to regs_ever_live.
Daniel Berlin
2007-02-13 04:56:05 UTC
Permalink
Post by Vladimir Makarov
On Sunday I had accidentally chat about the df infrastructure on
IIRC. I've got some thoughts which I'd like to share.
I like df infrastructure code from the day one for its clearness.
Unfortunately users don't see it and probably don't care about it.
With my point of view the df infrastructure has a design flaw. It
extracts a lot of information about RTL and keep it on the side. It
does not make the code fast. It would be ok if we got a better code
quality. Danny told me that they have 1.5% better code using df.
I also said we hadn't run numbers in about a year and a half, because
it wasnt our main goal anymore.

...
Post by Vladimir Makarov
especially if we have no alternative faster path (even if the old live
analysis is the mess).
I also pointed out that df's merge criteria are no more than 5%
compile time degradation.
So what are you worried about here?

Life analysis isn't just a mess, it's inaccurate, and intertwined
with dead store elimination and dead code elimination.
Post by Vladimir Makarov
Even rewriting the current optimizations on the new data flow
infrastructure makes situation worse because it will be not easy to
rid off the data flow infrastructure because probably part of the flaw
in the df interface.
What?
The flaw we have now is that every pass creates it's own
datastructures and dataflow, and is complete impossible to make
dataflow faster without rewriting every single pass.

With DF, you could make every single pass faster simply by improving ... DF!

If the datastructures it has don't work well enough for any pass of
course, you can add your own as df problems and results.
Post by Vladimir Makarov
So it might create problems in the future.
Especially I did not like David Edelhson's phrase "and no new
private dataflow schemes will be allowed in gcc passes". It was not
such his first expression. Such phrases are killing competition which
is bad for gcc. What if the new specialized scheme is faster. What
if somebody decides to write another better df infrastructure from the
scratch to solve the coming df infrastructure problems.
If you want to rewrite DF, please do.
But honesty, GCC has enough separate solvers that simply are not
faster anymore than df branch's solver. We know. We replaced a lot of
them.

And that's the thing. We had to go and replace every single one of
these, when if they had just used df's solver in the first place (and
taken the 1-2% slowdown they probably faced), they would all just have
been sped up.
Worse, some of these solvers were buggy or inaccurate, so now that we
give it better information, faster, we have to go fix bugs that never
would have existed had they reused the infrastructure we provided.
This is in fact, a lot of what has taken up df branch time. Fixing
bugs that fixing the dataflow exposed.
Post by Vladimir Makarov
I am not in opposition to merge if it satisfies the merge criteria.
People've done a lot of work. It is too late. I've should oppose the
criteria when they were discussed. Sorry I've missed the discussion
if there were such discussion. I am just rising some questions and
saying that more work will be needed for df-infrastructure even after
the merge.
There is always more work to be done.

BTW, I'll happily remove DF when all that is left of RTL is the
scheduler, RA, and instruction selector.
Hell, i'll throw a party.

But i wouldn't hold your breath for this to happen. :)
--Dan
Steven Bosscher
2007-02-13 21:34:42 UTC
Permalink
Post by Vladimir Makarov
Getting 0.5% and 11.5% slowness
(308sec vs 275sec for compiling SPECINT2000) does not seems reasonable
Just to be sure: Did you build with --disable-checking for both
compilers? I often find myself comparing compilers with checking
enabled, so, you know, just checking... ;-)
Thanks,

Gr.
Steven
Loading...