Discussion:
Compiler selection for log10/Qsort on ARM64
Bharat Bhushan
2017-02-07 15:50:57 UTC
Permalink
Hi All,

I am working on log10/qsort benchmarks on ARM64 (ARMv8) processor,

I want to check if we have experience with these benchmarks.
Actually i am looking for a compiler version which gives best results with
these benchmarks and specific compiler optimization (in my case is see O3
gives best numbers) ?

I have tried GCC-4.9 and GCC-6.2 with log10 benchmark and my observations
are:

1) With gcc 4.9 - 140 us

2) With GCC 6.2 - 150 us


My compilation flags are "-O3 -ftree-vectorize -funroll-all-loops --param
max-inline-insns-auto=550 --param case-values-threshold=30
-falign-functions=32 -ftracer"


So it seems like gcc-6.2 is better, am i missing something, should i use
some better compiler flags?


Thanks

-Bharat
Jim Wilson
2017-02-07 18:07:19 UTC
Permalink
On Tue, Feb 7, 2017 at 7:50 AM, Bharat Bhushan
Post by Bharat Bhushan
I am working on log10/qsort benchmarks on ARM64 (ARMv8) processor,
I want to check if we have experience with these benchmarks.
We have experience with things like SPEC and Coremark, which are
compiler performance benchmarks. log10/qsort sounds like glibc
functions. Are you testing glibc performance? That would perhaps
depend more on the glibc version than the compiler version.
Post by Bharat Bhushan
Actually i am looking for a compiler version which gives best results with
these benchmarks and specific compiler optimization (in my case is see O3
gives best numbers) ?
What exactly are you trying to optimize? If you want best performance
for your application, then you try every compiler version and every
option and use the combination that gives the best performance. Us
toolchain developers only care about performance of the latest
version, and if it isn't the best performing one, then we try to fix
it. If you want best performance for the most people, then you
concentrate on -O2 results as that is what most people use. I can't
give a better answer without more specifics of what exactly you are
trying to do.
Post by Bharat Bhushan
I have tried GCC-4.9 and GCC-6.2 with log10 benchmark and my observations
1) With gcc 4.9 - 140 us
2) With GCC 6.2 - 150 us
My compilation flags are "-O3 -ftree-vectorize -funroll-all-loops --param
max-inline-insns-auto=550 --param case-values-threshold=30
-falign-functions=32 -ftracer"
So it seems like gcc-6.2 is better, am i missing something, should i use
some better compiler flags?
Usually for benchmarks, a faster runtime is a better result, so it
looks like gcc-4.9 is giving the better result. If that is a gcc-6
bug, then it should be reported so we can try to fix it. However, you
are using a lot of options, and some of those options aren't the
default because they don't always give the best results. The
usefulness of some uncommon optimization options can vary from one gcc
release to the next. You may need to use different sets of gcc
options with different gcc versions to get the best results. But
again, as mentioned above, this all depends on what exactly you are
trying to do, and you haven't given us enough info to understand that.

Jim
Bharat Bhushan
2017-02-09 16:20:14 UTC
Permalink
Thanks Jim,

When I uses "-mtune and/or -mcpu" with GCC6.2 then I see almost same number
as with GC4.9

Thanks
-Bharat
Post by Jim Wilson
On Tue, Feb 7, 2017 at 7:50 AM, Bharat Bhushan
Post by Bharat Bhushan
I am working on log10/qsort benchmarks on ARM64 (ARMv8) processor,
I want to check if we have experience with these benchmarks.
We have experience with things like SPEC and Coremark, which are
compiler performance benchmarks. log10/qsort sounds like glibc
functions. Are you testing glibc performance? That would perhaps
depend more on the glibc version than the compiler version.
Post by Bharat Bhushan
Actually i am looking for a compiler version which gives best results
with
Post by Bharat Bhushan
these benchmarks and specific compiler optimization (in my case is see O3
gives best numbers) ?
What exactly are you trying to optimize? If you want best performance
for your application, then you try every compiler version and every
option and use the combination that gives the best performance. Us
toolchain developers only care about performance of the latest
version, and if it isn't the best performing one, then we try to fix
it. If you want best performance for the most people, then you
concentrate on -O2 results as that is what most people use. I can't
give a better answer without more specifics of what exactly you are
trying to do.
Post by Bharat Bhushan
I have tried GCC-4.9 and GCC-6.2 with log10 benchmark and my observations
1) With gcc 4.9 - 140 us
2) With GCC 6.2 - 150 us
My compilation flags are "-O3 -ftree-vectorize -funroll-all-loops --param
max-inline-insns-auto=550 --param case-values-threshold=30
-falign-functions=32 -ftracer"
So it seems like gcc-6.2 is better, am i missing something, should i use
some better compiler flags?
Usually for benchmarks, a faster runtime is a better result, so it
looks like gcc-4.9 is giving the better result. If that is a gcc-6
bug, then it should be reported so we can try to fix it. However, you
are using a lot of options, and some of those options aren't the
default because they don't always give the best results. The
usefulness of some uncommon optimization options can vary from one gcc
release to the next. You may need to use different sets of gcc
options with different gcc versions to get the best results. But
again, as mentioned above, this all depends on what exactly you are
trying to do, and you haven't given us enough info to understand that.
Jim
Adhemerval Zanella
2017-02-07 18:14:16 UTC
Permalink
Post by Bharat Bhushan
Hi All,
I am working on log10/qsort benchmarks on ARM64 (ARMv8) processor,
I want to check if we have experience with these benchmarks.
Actually i am looking for a compiler version which gives best results with these benchmarks and specific compiler optimization (in my case is see O3 gives best numbers) ?
1) With gcc 4.9 - 140 us
2) With GCC 6.2 - 150 us
My compilation flags are "-O3 -ftree-vectorize -funroll-all-loops --param max-inline-insns-auto=550 --param case-values-threshold=30 -falign-functions=32 -ftracer"
So it seems like gcc-6.2 is better, am i missing something, should i use some better compiler flags?
It is really hard to give you any advise without actual code to check what
exactly you are measuring. Are you using a custom implemented log10 or
the glibc one?

The compiler options seems what you expect to use for a mathematical workload,
however I would profile and check if both '-funroll-all-loops' and the
'--param max-inline-insns-auto=550 --param case-values-threshold=30' are
actually helping on this case. All tend to increase code size and it
might or not be the case where it put icache pressure, it really really
depend of the workload and dataflow.

In any way, it would be good to profile the code to check exactly where
is the hotspot and based on the code and its characteristics check if
any other flags or even kind of optimization (pgo, ipa) can help you out.
Loading...