Discussion:
Some questions about the gcc __sync intrinsics
Edward Nevill
2016-03-09 13:02:26 UTC
Permalink
Hi,

I have been comparing the stock gcc 5.2 and the Linaro 5.2 (Linaro GCC
5.2-2015.11-1) and have noticed a difference with the __sync
intrinsics.

Here is the simple test case

--- cut here ---
int add_int(int add_value, int *dest)
{
return __sync_add_and_fetch(dest, add_value);
}
--- cut here ---

Compiling with the stock gcc 5.2 (-S -O3) I get

---------
add_int:
.L2:
ldaxr w2, [x1]
add w2, w2, w0
stlxr w3, w2, [x1]
cbnz w3, .L2
mov w0, w2
ret
---------

Wheras with Linaro gcc 5.2 I get

---------
add_int:
.L2:
ldxr w2, [x1]
add w2, w2, w0
stlxr w3, w2, [x1]
cbnz w3, .L2
dmb ish
mov w0, w2
ret
---------

Why the extra (unnecessary?) memory barrier?

Also, is it worthwhile putting a prfm before the ldaxr. EG

add_int:
prfm pst1strm, [x1]
.L2:
ldaxr w2, [x1]

See the following thread

http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355996.html

All the best,
Ed
Yvan Roux
2016-03-09 13:22:48 UTC
Permalink
Hi Ed,
Post by Edward Nevill
Hi,
I have been comparing the stock gcc 5.2 and the Linaro 5.2 (Linaro GCC
5.2-2015.11-1) and have noticed a difference with the __sync
intrinsics.
Here is the simple test case
--- cut here ---
int add_int(int add_value, int *dest)
{
return __sync_add_and_fetch(dest, add_value);
}
--- cut here ---
Compiling with the stock gcc 5.2 (-S -O3) I get
---------
ldaxr w2, [x1]
add w2, w2, w0
stlxr w3, w2, [x1]
cbnz w3, .L2
mov w0, w2
ret
---------
Wheras with Linaro gcc 5.2 I get
---------
ldxr w2, [x1]
add w2, w2, w0
stlxr w3, w2, [x1]
cbnz w3, .L2
dmb ish
mov w0, w2
ret
---------
Why the extra (unnecessary?) memory barrier?
This is because Linaro gcc-5-branch is in sync with FSF gcc-5-branch
which contains a fix for this PR :
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697

As explained in the bugzilla and the patch submission the restriction
are stonger on __sync builtins than on __atomic ones.

https://gcc.gnu.org/ml/gcc-patches/2015-05/msg01989.html
Post by Edward Nevill
Also, is it worthwhile putting a prfm before the ldaxr. EG
prfm pst1strm, [x1]
ldaxr w2, [x1]
See the following thread
http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355996.html
All the best,
Ed
_______________________________________________
linaro-toolchain mailing list
https://lists.linaro.org/mailman/listinfo/linaro-toolchain
Edward Nevill
2016-03-10 11:28:36 UTC
Permalink
Hi Yvan,
Post by Yvan Roux
Hi Ed,
Post by Edward Nevill
Hi,
Why the extra (unnecessary?) memory barrier?
This is because Linaro gcc-5-branch is in sync with FSF gcc-5-branch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697
As explained in the bugzilla and the patch submission the restriction
are stonger on __sync builtins than on __atomic ones.
https://gcc.gnu.org/ml/gcc-patches/2015-05/msg01989.html
Thanks for that, obviously we should be using the __atomic versions in
Java as I dont believe Java requires a memory barrier.

One curiosity, I recompiled for 8.1 LSE using the Linaro gcc 5.2 and I
got the following

/usr/local/linare-gcc-5.2/bin/gcc -S -O3 -march=armv8-a+lse test.c

add_int:
ldaddal w0, w0, [x1]
add w2, w0, w0
mov w0, w2
ret

Why no memory barrier here? As far as I am aware the ldaddal has only
acquire and release semantics, it does not implement a full barrier?

All the best,
Ed.
Edward Nevill
2016-03-10 18:42:07 UTC
Permalink
Post by Edward Nevill
/usr/local/linare-gcc-5.2/bin/gcc -S -O3 -march=armv8-a+lse test.c
ldaddal w0, w0, [x1]
add w2, w0, w0
mov w0, w2
ret
Am I going mad, or does this just return the contents of the memory
location * 2.

ldaddal w0, w0, [x1]

Returns the original contents of [x1] in w0.

add w2, w0, w0

doubles it.

mov w0, w2

returns it.

I think __sync_add_and_fetch should return the updated contents.

Here is the test C code again.

int add_int(int add_value, int *dest)
{
return __sync_add_and_fetch(dest, add_value);
}

Regards,
Ed.
Yvan Roux
2016-03-10 18:52:15 UTC
Permalink
Post by Edward Nevill
Post by Edward Nevill
/usr/local/linare-gcc-5.2/bin/gcc -S -O3 -march=armv8-a+lse test.c
ldaddal w0, w0, [x1]
add w2, w0, w0
mov w0, w2
ret
Am I going mad, or does this just return the contents of the memory
location * 2.
ldaddal w0, w0, [x1]
Returns the original contents of [x1] in w0.
add w2, w0, w0
doubles it.
mov w0, w2
returns it.
I think __sync_add_and_fetch should return the updated contents.
Here is the test C code again.
int add_int(int add_value, int *dest)
{
return __sync_add_and_fetch(dest, add_value);
}
Hmm, it is not the code I get with our latest release candidate (the
release should be out next week), it gives the code:

add_int:
ldaddal w0, w3, [x1]
add w2, w3, w0
mov w0, w2
ret
Post by Edward Nevill
Regards,
Ed.
Yvan Roux
2016-03-10 19:59:40 UTC
Permalink
Post by Yvan Roux
Post by Edward Nevill
Post by Edward Nevill
/usr/local/linare-gcc-5.2/bin/gcc -S -O3 -march=armv8-a+lse test.c
ldaddal w0, w0, [x1]
add w2, w0, w0
mov w0, w2
ret
Am I going mad, or does this just return the contents of the memory
location * 2.
ldaddal w0, w0, [x1]
Returns the original contents of [x1] in w0.
add w2, w0, w0
doubles it.
mov w0, w2
returns it.
I think __sync_add_and_fetch should return the updated contents.
Here is the test C code again.
int add_int(int add_value, int *dest)
{
return __sync_add_and_fetch(dest, add_value);
}
Hmm, it is not the code I get with our latest release candidate (the
ldaddal w0, w3, [x1]
add w2, w3, w0
mov w0, w2
ret
Ed, just for info it was fixed by Andrew in trunk and backported in
our branch end of December.

https://gcc.gnu.org/ml/gcc-patches/2015-12/msg01962.html
Post by Yvan Roux
Post by Edward Nevill
Regards,
Ed.
Pinski, Andrew
2016-03-09 20:10:03 UTC
Permalink
Post by Edward Nevill
Also, is it worthwhile putting a prfm before the ldaxr. EG
There is already a thread upstream about this:
https://gcc.gnu.org/ml/gcc-patches/2016-03/msg00508.html

I reject adding this to -mcpu=generic as it will hurt ThunderX more than it will help. Prfm is single issue for ThunderX so it causes an extra cycle for each case.
For the kernel, the pfrm really should be patched out for ThunderX; I will propose a patch for that later on.
The way ThunderX implements ldxr/stxr is much simpler than say Cortex-a57/a72 because the inner and outer domains are the same aka there is will only be one coherent point, the L2. It is also why ThunderX exposes so many race conditions (well and the timeout for write going to the coherent point is around 1024 cycles if there was no flush).

Thanks,
Andrew Pinski

-----Original Message-----
From: linaro-toolchain [mailto:linaro-toolchain-***@lists.linaro.org] On Behalf Of Edward Nevill
Sent: Wednesday, March 9, 2016 5:02 AM
To: Linaro Toolchain Mailman List <linaro-***@lists.linaro.org>
Subject: Some questions about the gcc __sync intrinsics

Hi,

I have been comparing the stock gcc 5.2 and the Linaro 5.2 (Linaro GCC
5.2-2015.11-1) and have noticed a difference with the __sync intrinsics.

Here is the simple test case

--- cut here ---
int add_int(int add_value, int *dest)
{
return __sync_add_and_fetch(dest, add_value); }
--- cut here ---

Compiling with the stock gcc 5.2 (-S -O3) I get

---------
add_int:
.L2:
ldaxr w2, [x1]
add w2, w2, w0
stlxr w3, w2, [x1]
cbnz w3, .L2
mov w0, w2
ret
---------

Wheras with Linaro gcc 5.2 I get

---------
add_int:
.L2:
ldxr w2, [x1]
add w2, w2, w0
stlxr w3, w2, [x1]
cbnz w3, .L2
dmb ish
mov w0, w2
ret
---------

Why the extra (unnecessary?) memory barrier?

Also, is it worthwhile putting a prfm before the ldaxr. EG

add_int:
prfm pst1strm, [x1]
.L2:
ldaxr w2, [x1]

See the following thread

http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355996.html

All the best,
Ed
_______________________________________________
linaro-toolchain mailing list
linaro-***@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/linaro-toolchain
Loading...