Discussion:
GCC performance questions
Gunnar von Boehn
2015-11-04 15:07:09 UTC
Permalink
Dear List,

I'm new to this list and have some questions.
Looking at the created code of GCC on ARMv8, we noticed some areas where there is room for performance improvements.
I assume that these items might already be noticed by you guys.

For example:

1) We noticed that when writing typical DGEMM like code, GCC includes unnecessary DUP instruction

2) GCC seems unwilling to use LDP loads

3) For optimal FPU performance on some A57 its needed to interleave instruction working on ODD and EVEN registers

GCC seem not properly support this. Here sometimes 100% performance increase could be reached by different instruction interleaving.

4) Some work loops highly benefit of interleaving of FPU instructinons and loads.

GCC seems to likes to re-arrange the code so that most or all loads are put on top of the loop.
This can reduce the performance of a well written workloop significantly.


I have no patches to fix this.
But I can produce C- code and ASM output which will show these performance issues.

Please tell me what the next recommended step will be now.
Are all these items known already, or shall I provide code examples to further explain them?


Kind regards
Gunnar von Boehn
Jim Wilson
2015-11-04 20:32:04 UTC
Permalink
On Wed, Nov 4, 2015 at 7:07 AM, Gunnar von Boehn
Post by Gunnar von Boehn
Looking at the created code of GCC on ARMv8, we noticed some areas where
there is room for performance improvements.
I assume that these items might already be noticed by you guys.
There is a known problem that the current register allocator doesn't
handle partial overlap very well. Both aarch64 and aarch32 use the
same register set for FP and SIMD/neon, which results in lots of
partial overlaps, which can confuse the register allocator into using
unnecessary temporaries. Otherwise, I don't think that we have any
major problems, other than the fact that vectorization is a hard
problem to solve, and we do have lots of examples showing non-optimal
code generation in certain cases.
Post by Gunnar von Boehn
1) We noticed that when writing typical DGEMM like code, GCC includes
unnecessary DUP instruction
This could be the known register allocation problem. It is hard to
say more without a testcase.
Post by Gunnar von Boehn
2) GCC seems unwilling to use LDP loads
There is support for generating LDP/STP, but since this usually
involves combining unrelated data, it is done in a peephole pass and
may not be triggering as often as we like, as the peephole
optimization only works well if you get lucky register allocation and
instruction scheduling that creates peephole optimization
opportunities. Can't say more without a testcase.
Post by Gunnar von Boehn
3) For optimal FPU performance on some A57 its needed to interleave
instruction working on ODD and EVEN registers
GCC seem not properly support this. Here sometimes 100% performance
increase could be reached by different instruction interleaving.
A patch was added to GCC 6 for this. It looks like it has been
backported into the Linaro gcc-5.x sources. From git log on the
Linaro gcc-5.x tree:

commit 9c9ff2bc6885aa07d55ecef8248c08a8e14ff9b6
Author: Christophe Lyon <***@linaro.org>
Date: Mon Oct 5 15:17:57 2015 +0200

gcc/
Backport from trunk r222512.
2015-04-28 Thomas Preud'homme <***@arm.com>

PR target/63503
* config.gcc: Add cortex-a57-fma-steering.o to extra_objs for
aarch64-*-*.
* config/aarch64/t-aarch64: Add a rule for cortex-a57-fma-steering.o.
* config/aarch64/aarch64.h (AARCH64_FL_USE_FMA_STEERING_PASS): Define.
(AARCH64_TUNE_FMA_STEERING): Likewise.
* config/aarch64/aarch64-cores.def: Set
AARCH64_FL_USE_FMA_STEERING_PASS for cores with dynamic steering of
FMUL/FMADD instructions.
* config/aarch64/aarch64.c (aarch64_register_fma_steering): Declare.
(aarch64_override_options): Include cortex-a57-fma-steering.h. Call
aarch64_register_fma_steering () if AARCH64_TUNE_FMA_STEERING is true.
* config/aarch64/cortex-a57-fma-steering.h: New file.
* config/aarch64/cortex-a57-fma-steering.c: Likewise.

Change-Id: I92e0e8d06fc5212e8856d6d5f9c7c6b83a737ca8

There are a number of related changes after this one. I don't know
how well this works as I haven't tried using it.
Post by Gunnar von Boehn
4) Some work loops highly benefit of interleaving of FPU instructinons and loads.
GCC seems to likes to re-arrange the code so that most or all loads are put
on top of the loop.
This can reduce the performance of a well written workloop significantly.
This isn't an ARM specific problem, and within the ARM family, it is
target dependent, as it depends on how the instruction scheduler hooks
have been written for the target you are optimizing for. I know for
some of the cortex parts, there was an effort to report fewer
load/store pipes than exist, so that gcc would not schedule all loads
at the start of a loop. I don't know how effective it is though.
Post by Gunnar von Boehn
Please tell me what the next recommended step will be now.
Are all these items known already, or shall I provide code examples to
further explain them?
You can try filing bug reports into the FSF bugzilla at
http://gcc.gnu.org/bugzilla or the Linaro bugzilla at
http://bugs.linaro.org. Bugs filed into the FSF bugzilla will get
better visibility, as all ARM gcc developers will see them. The
problems you are reporting are mostly hard problems that may not be
fixed for a while, and these kinds of problems are probably better
reported into the FSF bugzilla. Issues specific to Linaro should of
course go into the Linaro bugzilla. You can try giving us testcases
here, but if it isn't something we can fix in a few minutes, then it
is better if it goes into bugzilla.

Jim

Loading...