Discussion:
question on aarch64 libm
Virendra Kumar Pathak
2016-01-18 17:54:33 UTC
Permalink
Hi Linaro Toochain Group,

I have few questions on glibc+libm w.r.t aarch64.
If possible, please provide some insight, otherwise kindly redirect me to
the concerned person/forum.

1.It seems from the community patches that ARM/Linaro is optimizing glibc
functions such as memcpy/memmove, string for aarch64.
However, looks like some of these (e.g. memcpy/memmov) patches are still
not merged in glibc. Any comment on their availability in glibc?
e.g. https://www.sourceware.org/ml/libc-alpha/2015-12/msg00341.html


2. On the same note, is there any plan for optimizing/tuning libm functions
(e.g. trigonometric) for aarch64?
I could find any matching patches on review board. Please correct me if I
am wrong.

3. Looks like ARM have released an independent version of libm for certain
trigonometric functions.
https://github.com/ARM-software/optimized-routines.
Any plan of these optimization going in glibc's libm? Any comment on its
performance improvement over GNU libm ?

Thanks in advance for your time.
--
with regards,
Virendra Kumar Pathak
Adhemerval Zanella
2016-01-18 18:36:17 UTC
Permalink
Hi Virendra,
Post by Virendra Kumar Pathak
Hi Linaro Toochain Group,
I have few questions on glibc+libm w.r.t aarch64.
If possible, please provide some insight, otherwise kindly redirect me to the concerned person/forum.
1.It seems from the community patches that ARM/Linaro is optimizing glibc functions such as memcpy/memmove, string for aarch64.
However, looks like some of these (e.g. memcpy/memmov) patches are still not merged in glibc. Any comment on their availability in glibc?
e.g. https://www.sourceware.org/ml/libc-alpha/2015-12/msg00341.html
This is mainly due lack of review. Usually for optimization patches the arch
maintainer will have the final answer. Now it is too late for 2.23, but we
will focus on make it available for 2.24.

Besides this memcpy, there is still some other string function (memchr) and
some generic one (strpbrk, etc.) that are stalled either due missing review
or lacking of comments follow up.
Post by Virendra Kumar Pathak
2. On the same note, is there any plan for optimizing/tuning libm functions (e.g. trigonometric) for aarch64?
I could find any matching patches on review board. Please correct me if I am wrong.
No one has posted any patch or stirred discussions about it. The complex
function in libm are usually coded in in C to be platform neutral, with
some specific function being optimized (rounding, etc.). x86_64 also have
some assembly implementations for some specific routines (exp, log, ...),
but I also do not have number about how fast are they related to C
counterparts (it also might be the case where the speedup is not that
high to validate the assembly existence).

Rule of thumb currently in GLIBC is to avoid as possible arch-assembly
routines and work with C implementation that are platform neutral with
possible arch hooks on sensitive performance paths (check Siddhesh
recent sincos performance improvements).

For very critical performance paths we also have the option to add
specific build with more aggressive optimization flags along with
IFUNC support (for instance one for A57 and another for A72, if
it is such the case).

If none options are the best way to improve performance, platform
specific implementation are still a good option (libmvec is basically
a lot of x86_64 assembly implementation currently).
Post by Virendra Kumar Pathak
3. Looks like ARM have released an independent version of libm for certain trigonometric functions.
https://github.com/ARM-software/optimized-routines.
Any plan of these optimization going in glibc's libm? Any comment on its performance improvement over GNU libm ?
Regarding licensing I do not foresee any issues, since GLIBC is LGPL 2.1 and later
it may be combined with code from a LGPL version 3 library, with the combined work
as a whole falling under the terms of the GPLv3 [1] (since Apache 2.0, the one
ARM used in this projects, and it is compatible with LGPL 3.0). I am far from a
license lawyer, so someone please correct me if I am wrong.

Now related to technical side, I think it is feasible however it will required
a lot of work to adjust these function for fit GLIBC project.

First thing is the requirements: GLIBC current required 4.7 as the minimum compiler,
however the project itself requires 4.8. I noted mpfr and mpc are used exclusive in
testing framework.

Second thing is add these implementations for ARM/AArch64 with correct names and
infrastructure. The downside is it will deviate ARM/AArch64 from rest of other
ports, requiring further maintenance because of the different optimization.

Another thing is to check the implementation against GLIBC own testcase, which
add some tests regarding exceptions, rounding, etc. Any deviation will require
fixing and/or bug reported.

Finally GLIBC developers will certainly ask for either improvements in the
benchmark testsuite or number that show these implementation are somewhat better
than current ones. It will also require some precision/speed analysis.

[1] http://www.gnu.org/licenses/license-list.en.html
Post by Virendra Kumar Pathak
Thanks in advance for your time.
--
with regards,
Virendra Kumar Pathak
_______________________________________________
linaro-toolchain mailing list
https://lists.linaro.org/mailman/listinfo/linaro-toolchain
Siddhesh Poyarekar
2016-01-19 05:49:49 UTC
Permalink
On 19 January 2016 at 00:06, Adhemerval Zanella
Post by Adhemerval Zanella
No one has posted any patch or stirred discussions about it. The complex
function in libm are usually coded in in C to be platform neutral, with
some specific function being optimized (rounding, etc.). x86_64 also have
some assembly implementations for some specific routines (exp, log, ...),
but I also do not have number about how fast are they related to C
counterparts (it also might be the case where the speedup is not that
high to validate the assembly existence).
A correction here: i686 has a lot of assembly math implementations,
x86_64 doesn't. The last x86_64 asm implementation was sincos which
was removed because it was not accurate enough for our project goals.
The i686 asm versions (and for other archs, I think alpha and m68k)
are there because nobody cares enough about their precision. The i686
functions for example are known to not be precise for the entire input
domain.
Post by Adhemerval Zanella
Rule of thumb currently in GLIBC is to avoid as possible arch-assembly
routines and work with C implementation that are platform neutral with
possible arch hooks on sensitive performance paths (check Siddhesh
recent sincos performance improvements).
The general rule here is to more or less guarantee that the algorithm
does not lose precision regardless of the language it is written in.
However if you want the community also to support it actively, writing
it in C is your best bet.
Post by Adhemerval Zanella
For very critical performance paths we also have the option to add
specific build with more aggressive optimization flags along with
IFUNC support (for instance one for A57 and another for A72, if
it is such the case).
This is the cheapest way to squeeze out some performance, provided
that the compiler is tuned correctly. This is in fact what we do in
x86_64 with ifunc implementations for avx, sse2 and fma4.

Siddhesh
Adhemerval Zanella
2016-01-19 12:34:22 UTC
Permalink
Post by Siddhesh Poyarekar
On 19 January 2016 at 00:06, Adhemerval Zanella
Post by Adhemerval Zanella
No one has posted any patch or stirred discussions about it. The complex
function in libm are usually coded in in C to be platform neutral, with
some specific function being optimized (rounding, etc.). x86_64 also have
some assembly implementations for some specific routines (exp, log, ...),
but I also do not have number about how fast are they related to C
counterparts (it also might be the case where the speedup is not that
high to validate the assembly existence).
A correction here: i686 has a lot of assembly math implementations,
x86_64 doesn't. The last x86_64 asm implementation was sincos which
was removed because it was not accurate enough for our project goals.
The i686 asm versions (and for other archs, I think alpha and m68k)
are there because nobody cares enough about their precision. The i686
functions for example are known to not be precise for the entire input
domain.
I do see some x86_64 specialized implementation being used currently
(sysdeps/x86_64/fpu/s_{sin,cos}f.S for instance). The sincos implementations
is still used (sysdeps/x86_64/fpu/s_sincosf.S).

What you referring that glibc has dropped is the utilization of the
fsin/fcos/fsincos Intel instructions, which shows a ridiculous error
range depending of the inputs [1].

[1] https://randomascii.wordpress.com/2014/10/09/intel-underestimates-error-bounds-by-1-3-quintillion/
Post by Siddhesh Poyarekar
Post by Adhemerval Zanella
Rule of thumb currently in GLIBC is to avoid as possible arch-assembly
routines and work with C implementation that are platform neutral with
possible arch hooks on sensitive performance paths (check Siddhesh
recent sincos performance improvements).
The general rule here is to more or less guarantee that the algorithm
does not lose precision regardless of the language it is written in.
However if you want the community also to support it actively, writing
it in C is your best bet.
Post by Adhemerval Zanella
For very critical performance paths we also have the option to add
specific build with more aggressive optimization flags along with
IFUNC support (for instance one for A57 and another for A72, if
it is such the case).
This is the cheapest way to squeeze out some performance, provided
that the compiler is tuned correctly. This is in fact what we do in
x86_64 with ifunc implementations for avx, sse2 and fma4.
Siddhesh
Siddhesh Poyarekar
2016-01-19 12:52:11 UTC
Permalink
On 19 January 2016 at 18:04, Adhemerval Zanella
Post by Adhemerval Zanella
I do see some x86_64 specialized implementation being used currently
(sysdeps/x86_64/fpu/s_{sin,cos}f.S for instance). The sincos implementations
is still used (sysdeps/x86_64/fpu/s_sincosf.S).
What you referring that glibc has dropped is the utilization of the
fsin/fcos/fsincos Intel instructions, which shows a ridiculous error
range depending of the inputs [1].
The sincos implementation for x86_64 is the generic one; it is the
sincosf (single float) that has an assembly implementation. However
you're right otherwise; I had overlooked everything but the ieee754
double implementations of the transcendentals.

Siddhesh

Loading...