[ACTIVITY] Week 25

Discussion:

[ACTIVITY] Week 25

Yvan Roux

2016-06-27 08:39:38 UTC

== Progress ==
o Extended Validation (1/10)
- Benchmarking job babysitting.

o Upstream GCC (4/10)
- ARMv8.1 libatomic: Analysis completed.
No gain expected in implementing an Ifunc'ed version of the library.
- Working on __sync buitlins potential fix.

o Misc (5/10)
* Various meetings and discussions.
* Internal appraisal

== Plan ==
o Continue on-going tasks (__sync, benchmarking)

Yvan Roux

2016-06-27 19:04:52 UTC

Permalink

Hi Andrew,

On 27 June 2016 at 19:32, Pinski, Andrew <***@cavium.com> wrote:
>> No gain expected in implementing an Ifunc'ed version of the library.
>
> How did you prove that? What hardware did you run this on to prove it?
> Also have you thought at least doing an ifunc version for 128bit atomics?

up to 64bits, the calls to the libatomic routines are inlined and
armv8.1 CAS and load-operate version are used when the application is
build for armv8.1 architecture. For 128bits, a call to the lib is
made which uses the same LL/SC implementation with or without LSE
support, as CAS and load-operate instruction don't support this data
size.

I don't have armv8.1 hardware and made the analysis on the generated
assembler. Do you have use case on your side where an ifunc version
can be useful ? I'm not aware of an algorithm which can replace
effectively LL/SC implementation with shorter CAS, do you have any
pointers ? Maybe CASP can be used in some cases, I'll investigate it.

Thanks
Yvan

> Thanks,
> Andrew
>
> -----Original Message-----
> From: linaro-toolchain [mailto:linaro-toolchain-***@lists.linaro.org] On Behalf Of Yvan Roux
> Sent: Monday, June 27, 2016 1:40 AM
> To: Linaro Toolchain Mailman List <linaro-***@lists.linaro.org>
> Subject: [ACTIVITY] Week 25
>
> == Progress ==
> o Extended Validation (1/10)
> - Benchmarking job babysitting.
>
> o Upstream GCC (4/10)
> - ARMv8.1 libatomic: Analysis completed.
> No gain expected in implementing an Ifunc'ed version of the library.
> - Working on __sync buitlins potential fix.
>
> o Misc (5/10)
> * Various meetings and discussions.
> * Internal appraisal
>
> == Plan ==
> o Continue on-going tasks (__sync, benchmarking) _______________________________________________
> linaro-toolchain mailing list
> linaro-***@lists.linaro.org
> https://lists.linaro.org/mailman/listinfo/linaro-toolchain

Ramana Radhakrishnan

2016-06-29 15:03:31 UTC

Permalink

I'm curious about what workloads / benchmarks you considered for this activity - the traditional spec benchmarks don't really trigger anything in libatomic - so where do we see the improvements or none ?

regards
Ramana

________________________________________
From: linaro-toolchain <linaro-toolchain-***@lists.linaro.org> on behalf of Yvan Roux <***@linaro.org>
Sent: 27 June 2016 20:04:52
To: Pinski, Andrew
Cc: Linaro Toolchain Mailman List
Subject: Re: [ACTIVITY] Week 25

Hi Andrew,

On 27 June 2016 at 19:32, Pinski, Andrew <***@cavium.com> wrote:
>> No gain expected in implementing an Ifunc'ed version of the library.
>
> How did you prove that? What hardware did you run this on to prove it?
> Also have you thought at least doing an ifunc version for 128bit atomics?

up to 64bits, the calls to the libatomic routines are inlined and
armv8.1 CAS and load-operate version are used when the application is
build for armv8.1 architecture. For 128bits, a call to the lib is
made which uses the same LL/SC implementation with or without LSE
support, as CAS and load-operate instruction don't support this data
size.

I don't have armv8.1 hardware and made the analysis on the generated
assembler. Do you have use case on your side where an ifunc version
can be useful ? I'm not aware of an algorithm which can replace
effectively LL/SC implementation with shorter CAS, do you have any
pointers ? Maybe CASP can be used in some cases, I'll investigate it.

Thanks
Yvan

> Thanks,
> Andrew
>
> -----Original Message-----
> From: linaro-toolchain [mailto:linaro-toolchain-***@lists.linaro.org] On Behalf Of Yvan Roux
> Sent: Monday, June 27, 2016 1:40 AM
> To: Linaro Toolchain Mailman List <linaro-***@lists.linaro.org>
> Subject: [ACTIVITY] Week 25
>
> == Progress ==
> o Extended Validation (1/10)
> - Benchmarking job babysitting.
>
> o Upstream GCC (4/10)
> - ARMv8.1 libatomic: Analysis completed.
> No gain expected in implementing an Ifunc'ed version of the library.
> - Working on __sync buitlins potential fix.
>
> o Misc (5/10)
> * Various meetings and discussions.
> * Internal appraisal
>
> == Plan ==
> o Continue on-going tasks (__sync, benchmarking) _______________________________________________
> linaro-toolchain mailing list
> linaro-***@lists.linaro.org
> https://lists.linaro.org/mailman/listinfo/linaro-toolchain

Yvan Roux

2016-06-30 11:47:03 UTC

Permalink

Hi Ramaan,

On 29 June 2016 at 17:03, Ramana Radhakrishnan
<***@arm.com> wrote:
> I'm curious about what workloads / benchmarks you considered for this activity - the traditional spec benchmarks don't really trigger anything in libatomic - so where do we see the improvements or none ?

First, let me precise the purpose of this task which was to evaluate
and implement ARMv8.1 support in libatomic, and not to evaluate the
performance of ARMv8.1 architecture. Sorry if it wasn't clear in this
short weekly format.

Given this objectif, I didn't consider benchmarking for this activity,
my plan was to:

1. Verify the support of the new ARMv8.1 atomic instructions in the
__atomic builtins
2. Familiarize with libatomic code base and build system and check
that the builtins are used.
3. Enable and implement the ifunc version of the lib if needed.

My observations and conclusions are:

1. __atomic builtins already have a full support of the new atomic
instructions, and generate cas, swp and ld<op> as needed on data types
up to 8 bytes.
2. libatomic uses the atomic builtins proprely, thus building the lib
for ARMv8.1 architecture or enabling multilib on AArch64 generates a
libatomic which contains the expected code.
3. I don't see any benefits in implementing an Ifunc version of the
lib which will decide at runtime to use the LSE version or not, as for
the version up to 8bytes they are expanded inline at compile time, and
the 16bytes version are the same with or without LSE support. Maybe I
miss some use case or lack some background on libatomic usage here,
and I'd be happy if you can give me some inputs.

Regarding the 16bytes version, as I said I recently saw that LSE
contains a CASP instruction, which might be used to implement a
128bits compare exchange builtin, but if I understand well the
discussion in this bugzilla it might be better to wait for a new
version of the architecture which contains the proper 128bit
instruction.

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70814

Thanks
Yvan