Discussion:
gcc-linaro-5.1 vs gcc-linaro-4.8
Xiaofeng Ren
2016-01-05 10:29:11 UTC
Permalink
Hello All,
I found one difference between gcc-linaro-5.1 vs gcc-linaro-4.8 while I'm doing lmbench benchmark test for our LS1043 (cortex-A53).
While using gcc-linaro-4.8, gcc will generate advanced SIMD instructions (like as ld1, etc), however, gcc-linaro-5.1 will not generate advance SIMD instructions. This will cause big performance gap between gcc-4.8 and gcc-5.1 for lmbench memory bandwidth "fcp" test (bw_mem program).

My compiler flags is "-O3 -mcpu=cortex-a53". I also tried several different compiler flags ("-O3 -mcpu=cortex-a53+fp+simd", "-O2 -ftree-vectorize -mcpu=cortex-a53", "-O3 -ftree-vectorize -mcpu=cortex-a53"), all of them doesn't work.

Gcc-5.1 toolchain was downloaded from following link:

https://snapshots.linaro.org/openembedded/sources/gcc-linaro-5.1-snapshot-2015.06-1-x86_64_aarch64-linux-gnu.tar.xz

Can I have your comments on this?


Thanks
Ron
Bernie Ogden
2016-01-05 10:36:22 UTC
Permalink
Hello,

I'm not sure from the information below whether you have observed a
performance gap, or are expecting to observe one. Have you seen a
performance gap?

Regards,

Bernie
Post by Xiaofeng Ren
Hello All,
I found one difference between gcc-linaro-5.1 vs gcc-linaro-4.8 while I’m
doing lmbench benchmark test for our LS1043 (cortex-A53).
While using gcc-linaro-4.8, gcc will generate advanced SIMD instructions
(like as ld1, etc), however, gcc-linaro-5.1 will not generate advance SIMD
instructions. This will cause big performance gap between gcc-4.8 and
gcc-5.1 for lmbench memory bandwidth “fcp” test (bw_mem program).
My compiler flags is “-O3 -mcpu=cortex-a53”. I also tried several different
compiler flags (“-O3 -mcpu=cortex-a53+fp+simd”, “-O2 -ftree-vectorize
-mcpu=cortex-a53”, “-O3 -ftree-vectorize -mcpu=cortex-a53”), all of them
doesn’t work.
https://snapshots.linaro.org/openembedded/sources/gcc-linaro-5.1-snapshot-2015.06-1-x86_64_aarch64-linux-gnu.tar.xz
Can I have your comments on this?
Thanks
Ron
_______________________________________________
linaro-toolchain mailing list
https://lists.linaro.org/mailman/listinfo/linaro-toolchain
Xiaofeng Ren
2016-01-05 10:42:59 UTC
Permalink
Hello Bernie,
Thanks for your quick response.

Yes, I observed performance gap. Followings are data what I got on our LS1043A platform:

fcp for L1 cache with gcc-4.8: 5196.12 MB/s for L1 cache
fcp for L1 cache with gcc-5.1: 2983.11 MB/s for L1 cache

Following part of assembly code for fcp function:

Gcc-5.1:
40110c: 3dc00c6c ldr q12, [x3,#48]
401110: 3dc0106b ldr q11, [x3,#64]
401114: 3dc0146a ldr q10, [x3,#80]
401118: 3dc01869 ldr q9, [x3,#96]
40111c: 3dc01c68 ldr q8, [x3,#112]
401120: 3dc0207f ldr q31, [x3,#128]
401124: 3dc0247e ldr q30, [x3,#144]
401128: 3dc0287d ldr q29, [x3,#160]
40112c: 3dc02c7c ldr q28, [x3,#176]
401130: 3dc0307b ldr q27, [x3,#192]
401134: 3dc0347a ldr q26, [x3,#208]
401138: 3dc03879 ldr q25, [x3,#224]
40113c: 3dc03c78 ldr q24, [x3,#240]
401140: 3dc04077 ldr q23, [x3,#256]
401144: 3dc04476 ldr q22, [x3,#272]
401148: 3dc04875 ldr q21, [x3,#288]
40114c: 3dc04c74 ldr q20, [x3,#304]
401150: 3dc05073 ldr q19, [x3,#320]
401154: 3dc05472 ldr q18, [x3,#336]
401158: 3dc05871 ldr q17, [x3,#352]
40115c: 3dc05c70 ldr q16, [x3,#368]
401160: 3dc06067 ldr q7, [x3,#384]
401164: 3dc06466 ldr q6, [x3,#400]
401168: 3dc06865 ldr q5, [x3,#416]
40116c: 3dc06c64 ldr q4, [x3,#432]
401170: 3dc07063 ldr q3, [x3,#448]
401174: 3dc07462 ldr q2, [x3,#464]
401178: 3dc07861 ldr q1, [x3,#480]
40117c: 3dc07c60 ldr q0, [x3,#496]
401180: 3dc0006f ldr q15, [x3]
401184: 91080063 add x3, x3, #0x200

Gcc-4.8:
40135c: 4cdf78af ld1 {v15.4s}, [x5], #16
401360: 4c40790d ld1 {v13.4s}, [x8]
401364: 4c4078ae ld1 {v14.4s}, [x5]
401368: 9100c048 add x8, x2, #0x30
40136c: 91010045 add x5, x2, #0x40
401370: 4c40790c ld1 {v12.4s}, [x8]
401374: 4c4078ab ld1 {v11.4s}, [x5]
401378: 91014048 add x8, x2, #0x50
40137c: 91018045 add x5, x2, #0x60
401380: 4c40790a ld1 {v10.4s}, [x8]
401384: 4c4078a9 ld1 {v9.4s}, [x5]
401388: 9101c048 add x8, x2, #0x70
40138c: 91020045 add x5, x2, #0x80
401390: 4c407908 ld1 {v8.4s}, [x8]
401394: 4c4078bf ld1 {v31.4s}, [x5]
401398: 91024048 add x8, x2, #0x90
40139c: 91028045 add x5, x2, #0xa0
4013a0: 4c40791e ld1 {v30.4s}, [x8]
4013a4: 4c4078bd ld1 {v29.4s}, [x5]
4013a8: 9102c048 add x8, x2, #0xb0
4013ac: 91030045 add x5, x2, #0xc0


Best Regards
Ron

-----Original Message-----
From: Bernie Ogden [mailto:***@linaro.org]
Sent: Tuesday, January 05, 2016 6:36 PM
To: Xiaofeng Ren <***@nxp.com>
Cc: linaro-***@lists.linaro.org
Subject: Re: gcc-linaro-5.1 vs gcc-linaro-4.8

Hello,

I'm not sure from the information below whether you have observed a performance gap, or are expecting to observe one. Have you seen a performance gap?

Regards,

Bernie
Post by Xiaofeng Ren
Hello All,
I found one difference between gcc-linaro-5.1 vs gcc-linaro-4.8 while
I’m doing lmbench benchmark test for our LS1043 (cortex-A53).
While using gcc-linaro-4.8, gcc will generate advanced SIMD
instructions (like as ld1, etc), however, gcc-linaro-5.1 will not
generate advance SIMD instructions. This will cause big performance
gap between gcc-4.8 and
gcc-5.1 for lmbench memory bandwidth “fcp” test (bw_mem program).
My compiler flags is “-O3 -mcpu=cortex-a53”. I also tried several
different compiler flags (“-O3 -mcpu=cortex-a53+fp+simd”, “-O2
-ftree-vectorize -mcpu=cortex-a53”, “-O3 -ftree-vectorize
-mcpu=cortex-a53”), all of them doesn’t work.
https://snapshots.linaro.org/openembedded/sources/gcc-linaro-5.1-snaps
hot-2015.06-1-x86_64_aarch64-linux-gnu.tar.xz
Can I have your comments on this?
Thanks
Ron
_______________________________________________
linaro-toolchain mailing list
https://lists.linaro.org/mailman/listinfo/linaro-toolchain
Kugan
2016-01-05 10:50:46 UTC
Permalink
Hi Ron,
Post by Xiaofeng Ren
40110c: 3dc00c6c ldr q12, [x3,#48]
401110: 3dc0106b ldr q11, [x3,#64]
401114: 3dc0146a ldr q10, [x3,#80]
401118: 3dc01869 ldr q9, [x3,#96]
40111c: 3dc01c68 ldr q8, [x3,#112]
401120: 3dc0207f ldr q31, [x3,#128]
401124: 3dc0247e ldr q30, [x3,#144]
401128: 3dc0287d ldr q29, [x3,#160]
40112c: 3dc02c7c ldr q28, [x3,#176]
401130: 3dc0307b ldr q27, [x3,#192]
401134: 3dc0347a ldr q26, [x3,#208]
401138: 3dc03879 ldr q25, [x3,#224]
40113c: 3dc03c78 ldr q24, [x3,#240]
401140: 3dc04077 ldr q23, [x3,#256]
401144: 3dc04476 ldr q22, [x3,#272]
401148: 3dc04875 ldr q21, [x3,#288]
40114c: 3dc04c74 ldr q20, [x3,#304]
401150: 3dc05073 ldr q19, [x3,#320]
401154: 3dc05472 ldr q18, [x3,#336]
401158: 3dc05871 ldr q17, [x3,#352]
40115c: 3dc05c70 ldr q16, [x3,#368]
401160: 3dc06067 ldr q7, [x3,#384]
401164: 3dc06466 ldr q6, [x3,#400]
401168: 3dc06865 ldr q5, [x3,#416]
40116c: 3dc06c64 ldr q4, [x3,#432]
401170: 3dc07063 ldr q3, [x3,#448]
401174: 3dc07462 ldr q2, [x3,#464]
401178: 3dc07861 ldr q1, [x3,#480]
40117c: 3dc07c60 ldr q0, [x3,#496]
401180: 3dc0006f ldr q15, [x3]
401184: 91080063 add x3, x3, #0x200
40135c: 4cdf78af ld1 {v15.4s}, [x5], #16
401360: 4c40790d ld1 {v13.4s}, [x8]
401364: 4c4078ae ld1 {v14.4s}, [x5]
401368: 9100c048 add x8, x2, #0x30
40136c: 91010045 add x5, x2, #0x40
401370: 4c40790c ld1 {v12.4s}, [x8]
401374: 4c4078ab ld1 {v11.4s}, [x5]
401378: 91014048 add x8, x2, #0x50
40137c: 91018045 add x5, x2, #0x60
401380: 4c40790a ld1 {v10.4s}, [x8]
401384: 4c4078a9 ld1 {v9.4s}, [x5]
401388: 9101c048 add x8, x2, #0x70
40138c: 91020045 add x5, x2, #0x80
401390: 4c407908 ld1 {v8.4s}, [x8]
401394: 4c4078bf ld1 {v31.4s}, [x5]
401398: 91024048 add x8, x2, #0x90
40139c: 91028045 add x5, x2, #0xa0
4013a0: 4c40791e ld1 {v30.4s}, [x8]
4013a4: 4c4078bd ld1 {v29.4s}, [x5]
4013a8: 9102c048 add x8, x2, #0xb0
4013ac: 91030045 add x5, x2, #0xc0
Is it possible to create a compilable testcase with "fcp" so that we can
reproduce the above? It need not be an executable test-case.

Thanks,
Kugah
Xiaofeng Ren
2016-01-05 13:52:30 UTC
Permalink
Hello Kugah,
Thanks a lot for your support.

I attached source code and corresponding assembly codes which was generated by using gcc-4.8 and gcc-5.1. The compiler flags is "-O3".


Best Regards
Ron

-----Original Message-----
From: Kugan [mailto:***@linaro.org]
Sent: Tuesday, January 05, 2016 6:51 PM
To: Xiaofeng Ren <***@nxp.com>; Bernie Ogden <***@linaro.org>
Cc: linaro-***@lists.linaro.org
Subject: Re: gcc-linaro-5.1 vs gcc-linaro-4.8

Hi Ron,
Post by Xiaofeng Ren
40110c: 3dc00c6c ldr q12, [x3,#48]
401110: 3dc0106b ldr q11, [x3,#64]
401114: 3dc0146a ldr q10, [x3,#80]
401118: 3dc01869 ldr q9, [x3,#96]
40111c: 3dc01c68 ldr q8, [x3,#112]
401120: 3dc0207f ldr q31, [x3,#128]
401124: 3dc0247e ldr q30, [x3,#144]
401128: 3dc0287d ldr q29, [x3,#160]
40112c: 3dc02c7c ldr q28, [x3,#176]
401130: 3dc0307b ldr q27, [x3,#192]
401134: 3dc0347a ldr q26, [x3,#208]
401138: 3dc03879 ldr q25, [x3,#224]
40113c: 3dc03c78 ldr q24, [x3,#240]
401140: 3dc04077 ldr q23, [x3,#256]
401144: 3dc04476 ldr q22, [x3,#272]
401148: 3dc04875 ldr q21, [x3,#288]
40114c: 3dc04c74 ldr q20, [x3,#304]
401150: 3dc05073 ldr q19, [x3,#320]
401154: 3dc05472 ldr q18, [x3,#336]
401158: 3dc05871 ldr q17, [x3,#352]
40115c: 3dc05c70 ldr q16, [x3,#368]
401160: 3dc06067 ldr q7, [x3,#384]
401164: 3dc06466 ldr q6, [x3,#400]
401168: 3dc06865 ldr q5, [x3,#416]
40116c: 3dc06c64 ldr q4, [x3,#432]
401170: 3dc07063 ldr q3, [x3,#448]
401174: 3dc07462 ldr q2, [x3,#464]
401178: 3dc07861 ldr q1, [x3,#480]
40117c: 3dc07c60 ldr q0, [x3,#496]
401180: 3dc0006f ldr q15, [x3]
401184: 91080063 add x3, x3, #0x200
40135c: 4cdf78af ld1 {v15.4s}, [x5], #16
401360: 4c40790d ld1 {v13.4s}, [x8]
401364: 4c4078ae ld1 {v14.4s}, [x5]
401368: 9100c048 add x8, x2, #0x30
40136c: 91010045 add x5, x2, #0x40
401370: 4c40790c ld1 {v12.4s}, [x8]
401374: 4c4078ab ld1 {v11.4s}, [x5]
401378: 91014048 add x8, x2, #0x50
40137c: 91018045 add x5, x2, #0x60
401380: 4c40790a ld1 {v10.4s}, [x8]
401384: 4c4078a9 ld1 {v9.4s}, [x5]
401388: 9101c048 add x8, x2, #0x70
40138c: 91020045 add x5, x2, #0x80
401390: 4c407908 ld1 {v8.4s}, [x8]
401394: 4c4078bf ld1 {v31.4s}, [x5]
401398: 91024048 add x8, x2, #0x90
40139c: 91028045 add x5, x2, #0xa0
4013a0: 4c40791e ld1 {v30.4s}, [x8]
4013a4: 4c4078bd ld1 {v29.4s}, [x5]
4013a8: 9102c048 add x8, x2, #0xb0
4013ac: 91030045 add x5, x2, #0xc0
Is it possible to create a compilable testcase with "fcp" so that we can reproduce the above? It need not be an executable test-case.

Thanks,
Kugah
Jim Wilson
2016-01-05 23:48:32 UTC
Permalink
Post by Xiaofeng Ren
40110c: 3dc00c6c ldr q12, [x3,#48]
40135c: 4cdf78af ld1 {v15.4s}, [x5], #16
The ld1 and ldr instructions are effectively equivalent, they are both
loading 16-byte values into fp/simd registers.

I see a difference in the scheduling though. The gcc-4.8 output has a
series of shift/add/store instructions while the gcc-5.1 output has a
series of shift instructions followed by a series of store
instructions. The gcc-5.1 output will serialize the code as these are
simd shifts which can only execute one at a time, and stores can only
execute one at a time. I see that gcc-4.8 has no cortex-a53 pipeline
description, so we appear to be getting good code by accident. The
gcc-5.1 has a cortex a53 scheduler, but it doesn't handle simd
instructions, so it isn't scheduling them correctly. I see that there
was a change added in November
https://gcc.gnu.org/ml/gcc-patches/2015-10/msg00025.html
that adds a new a53 pipeline description, and this one does handle
simd instructions. With current sources, I see some shifts,
alternating shifts and stores, and then the last of the stores. This
should give better performance than the gcc-5.1 code. I haven't tried
testing it on hardware.

Jim
Xiaofeng Ren
2016-01-06 00:19:09 UTC
Permalink
Hello Jim,
Appreciate for your comments.
I will try to manually apply that patch on my side and try it.
BTW, may I know which released Linaro gcc version include that patch? Maybe I can download it and try it quickly.
https://gcc.gnu.org/ml/gcc-patches/2015-10/msg00025.html


Best Regards
Ron

-----Original Message-----
From: Jim Wilson [mailto:***@linaro.org]
Sent: Wednesday, January 06, 2016 7:49 AM
To: Xiaofeng Ren <***@nxp.com>
Cc: Kugan <***@linaro.org>; Bernie Ogden <***@linaro.org>; linaro-***@lists.linaro.org
Subject: Re: gcc-linaro-5.1 vs gcc-linaro-4.8
Post by Xiaofeng Ren
40110c: 3dc00c6c ldr q12, [x3,#48]
40135c: 4cdf78af ld1 {v15.4s}, [x5], #16
The ld1 and ldr instructions are effectively equivalent, they are both loading 16-byte values into fp/simd registers.

I see a difference in the scheduling though. The gcc-4.8 output has a series of shift/add/store instructions while the gcc-5.1 output has a series of shift instructions followed by a series of store instructions. The gcc-5.1 output will serialize the code as these are simd shifts which can only execute one at a time, and stores can only execute one at a time. I see that gcc-4.8 has no cortex-a53 pipeline description, so we appear to be getting good code by accident. The
gcc-5.1 has a cortex a53 scheduler, but it doesn't handle simd instructions, so it isn't scheduling them correctly. I see that there was a change added in November
https://gcc.gnu.org/ml/gcc-patches/2015-10/msg00025.html
that adds a new a53 pipeline description, and this one does handle simd instructions. With current sources, I see some shifts, alternating shifts and stores, and then the last of the stores. This should give better performance than the gcc-5.1 code. I haven't tried testing it on hardware.

Jim
Jim Wilson
2016-01-06 02:45:05 UTC
Permalink
Post by Xiaofeng Ren
Hello Jim,
Appreciate for your comments.
I will try to manually apply that patch on my side and try it.
BTW, may I know which released Linaro gcc version include that patch? Maybe I can download it and try it quickly.
https://gcc.gnu.org/ml/gcc-patches/2015-10/msg00025.html
It was backported to our gcc-5 branch on Nov 24 by Yvan. This is
after the latest release 2015-11 was made. The patch is in the
December snapshot, but I think that is a source only release.
http://snapshots.linaro.org/components/toolchain/gcc-linaro/5.3-2015.12/
You would have to build your own toolchain from that, perhaps by using abe.

Jim
Xiaofeng Ren
2016-01-06 02:47:32 UTC
Permalink
Jim,
Thanks a lot for your clarification.


Best Regards
Ron

-----Original Message-----
From: Jim Wilson [mailto:***@linaro.org]
Sent: Wednesday, January 06, 2016 10:45 AM
To: Xiaofeng Ren <***@nxp.com>
Cc: Kugan <***@linaro.org>; Bernie Ogden <***@linaro.org>; linaro-***@lists.linaro.org; Zhenhua Luo <***@nxp.com>
Subject: Re: gcc-linaro-5.1 vs gcc-linaro-4.8
Post by Xiaofeng Ren
Hello Jim,
Appreciate for your comments.
I will try to manually apply that patch on my side and try it.
BTW, may I know which released Linaro gcc version include that patch? Maybe I can download it and try it quickly.
https://gcc.gnu.org/ml/gcc-patches/2015-10/msg00025.html
It was backported to our gcc-5 branch on Nov 24 by Yvan. This is after the latest release 2015-11 was made. The patch is in the December snapshot, but I think that is a source only release.
http://snapshots.linaro.org/components/toolchain/gcc-linaro/5.3-2015.12/
You would have to build your own toolchain from that, perhaps by using abe.

Jim
Rob Savoye
2016-01-06 02:53:18 UTC
Permalink
  Is it in the 2015.11-1 release ?
- rob -

-------- Original message --------
From: Jim Wilson <***@linaro.org>
Date: 01/05/2016 19:45 (GMT-07:00)
To: Xiaofeng Ren <***@nxp.com>
Cc: linaro-***@lists.linaro.org, Zhenhua Luo <***@nxp.com>
Subject: Re: gcc-linaro-5.1 vs gcc-linaro-4.8
Post by Xiaofeng Ren
Hello Jim,
Appreciate for your comments.
I will try to manually apply that patch on my side and try it.
BTW, may I know which released Linaro gcc version include that patch?   Maybe I can download it and try it quickly.
https://gcc.gnu.org/ml/gcc-patches/2015-10/msg00025.html
It was backported to our gcc-5 branch on Nov 24 by Yvan.  This is
after the latest release 2015-11 was made.  The patch is in the
December snapshot, but I think that is a source only release.
    http://snapshots.linaro.org/components/toolchain/gcc-linaro/5.3-2015.12/
You would have to build your own toolchain from that, perhaps by using abe.

Jim
Jim Wilson
2016-01-06 03:32:56 UTC
Permalink
Post by Rob Savoye
Is it in the 2015.11-1 release ?
I found gcc-linaro-5.2-2015.11-1-rc1.tar.xz on the snapshot site. I
don't see it in this release either. This seems to have identical gcc
sources to the 2015-11 release, other than a patch to change the
linaro version number.

Jim

Loading...