Loop fusion.

Discussion:

Loop fusion.

Toon Moene

2015-04-22 16:59:35 UTC

L.S.,

Last week, a colleague of mine from Meteo France held a talk at the
yearly meeting of all researchers working on HARMONIE (see
http://hirlam.org) discussing the performance of our code when compiled
with each of the supported compilers on the Cray XC30 at ECMWF
(http://www.ecmwf.int/en/computing/our-facilities).

In the context of GCC this is relevant, because one of the three
compilers is gfortran (version 4.9.2).

One of his slides discussed the differences in optimizations that the
three compilers offer; I was surprised to learn that GCC/gfortran
doesn't do loop fusion *at all*. Note, I discussed loop fusion (among
other optimizations) at LinuxExpo 99 (http://moene.org/~toon/nwp.ps)
which, unsurprisingly, was held 16 years ago :-)

Why is loop fusion important, especially in Fortran 90 and later programs ?

Because without it, every array assignment is a single loop nest,
isolated from related, same-shape assignments.

Consider this (artificial, but typical) example [updating atmospheric
quantities after the computation of the rate of change during a time
step of the integration]:

SUBROUTINE UPDATE_DT(T, U, V, Q, DTDT, DUDT, DVDT, DQDT, &
& NLON, NLAT, NLEV, TSTEP)
...
REAL, DIMENSION(NLON, NLAT, NLEV) :: T, U, V, Q, DTDT, DUDT, DVDT, DQDT
...
T = T + TSTEP*DTDT ! Update temperature
U = U + TSTEP*DUDT ! Update east-west wind component
V = V + TSTEP*DVDT ! Update north-south wind component
Q = Q + TSTEP*DQDT ! Update specific humidity
...
END

This generates four consecutive 3 deep loop nests over NLEV, NLAT, NLON.
Of course, it would be much more efficient if this were just one loop
nest, as Fortran 77 programmers would write it:

DO JLEV = 1, NLEV
DO JLAT = 1, NLAT
DO JLON = 1, NLON
T(JLON, JLAT, JLEV) = T(JLON, JLAT, JLEV) + TSTEP*DTDT(JLON,
JLAT, JLEV)
U(JLON, JLAT, JLEV) = U(JLON, JLAT, JLEV) + TSTEP*DUDT(JLON,
JLAT, JLEV)
V(JLON, JLAT, JLEV) = V(JLON, JLAT, JLEV) + TSTEP*DVDT(JLON,
JLAT, JLEV)
Q(JLON, JLAT, JLEV) = Q(JLON, JLAT, JLEV) + TSTEP*DQDT(JLON,
JLAT, JLEV)
ENDDO
ENDDO
ENDDO

After a loop fusion optimization pass the Fortran 90 and the Fortran 77
code should result in the same assembler output.

Is this something the Graphite infrastructure could help with ? From the
wiki documentation I get the impression that it only works on single
loop nests, but I must confess that I am not familiar with the
nomenclature in its description ...

Would it be hard to write a loop fusion pass otherwise ?

Kind regards,

--
Toon Moene - e-mail: ***@moene.org - phone: +31 346 214290
Saturnushof 14, 3738 XG Maartensdijk, The Netherlands
At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/
Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news

Toon Moene

2015-04-22 20:05:15 UTC

Permalink

Post by Toon Moene
Why is loop fusion important, especially in Fortran 90 and later programs ?
Because without it, every array assignment is a single loop nest, isolated
from related, same-shape assignments.

Why is this a bad thing? When you're talking about single-node
machines, separate loops is probably faster if your arrays are large
enough: better cache locality and easier to vectorize.
Loop fusion is only a win if you iterate through the same array
variables. Writing such a pass is not so hard for the simple, most
common cases. The front end could do some of the rewriting from
F90-style array assignments to fused loops if it notices consecutive
array assignments/operations on the same variables.

It could well be that my artificial example was not what my colleague
measured ...

Indeed, I thought about the front end doing this, but that would limit
it to those that the front end could recognize; on the other hand, that
might be the right limitation.

Thanks !

Richard Biener

2015-04-23 08:29:02 UTC

Permalink

Post by Toon Moene

It could well be that my artificial example was not what my colleague
measured ...
Indeed, I thought about the front end doing this, but that would limit it to
those that the front end could recognize; on the other hand, that might be
the right limitation.

Generally loop fusion is a nice-to-have thing but as Steven points out
your example isn't particularly well-suited here. In fact loop distribution
(which performs fission) would undo such fusion.

But it also points to infrastructure we already have - loop distribution
can do the reverse transform so doing fusion there as well looks
natural (given you might fuse parts of a loop with parts of another
loop but keep parts separate as well - all based on data locality
analysis loop distribution performs).

But usually fusion requires some enablement transform(s) to make
the loop iteration spaces match.

Oh, and it's probably time to commit the patches I developed some
years ago to make loop distribution handle loop nests...

Oh, and I'm probably going to look into this and related transforms
(that enablement stuff) for GCC 6.

Richard.

Post by Toon Moene
Thanks !
--
Saturnushof 14, 3738 XG Maartensdijk, The Netherlands
At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/
Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news