Discussion:
SVE routines for cortex-strings
Richard Henderson
2018-05-31 04:08:46 UTC
Permalink
I spoke with Ramana about these at HKG18, and I'm finally getting back to
these. I have routines for

-rw-rw-r--. 1 rth rth 2538 May 30 19:12 memchr.S
-rw-rw-r--. 1 rth rth 2405 May 30 20:49 memcmp.S
-rw-rw-r--. 1 rth rth 2385 May 30 19:12 rawmemchr.S
-rw-rw-r--. 1 rth rth 2470 May 30 19:12 strchrnul.S
-rw-rw-r--. 1 rth rth 2588 May 30 19:12 strchr.S
-rw-rw-r--. 1 rth rth 2370 May 30 19:12 strcmp.S
-rw-rw-r--. 1 rth rth 2403 May 30 19:12 strcpy.S
-rw-rw-r--. 1 rth rth 2263 May 30 19:12 strlen.S
-rw-rw-r--. 1 rth rth 2595 May 30 19:12 strncmp.S
-rw-rw-r--. 1 rth rth 2344 May 30 19:12 strnlen.S
-rw-rw-r--. 1 rth rth 3105 May 30 19:12 strrchr.S

The tests pass when run under Foundation Platform 11.3. What is the best way
to submit these for review and upstreaming? There's nothing in the git README
about an upstream mailing list...

FWIW, my code is at

https://github.com/rth7680/cortex-strings/tree/rth/sve


r~
Siddhesh Poyarekar
2018-05-31 08:12:24 UTC
Permalink
Post by Richard Henderson
I spoke with Ramana about these at HKG18, and I'm finally getting back to
these. I have routines for
-rw-rw-r--. 1 rth rth 2538 May 30 19:12 memchr.S
-rw-rw-r--. 1 rth rth 2405 May 30 20:49 memcmp.S
-rw-rw-r--. 1 rth rth 2385 May 30 19:12 rawmemchr.S
-rw-rw-r--. 1 rth rth 2470 May 30 19:12 strchrnul.S
-rw-rw-r--. 1 rth rth 2588 May 30 19:12 strchr.S
-rw-rw-r--. 1 rth rth 2370 May 30 19:12 strcmp.S
-rw-rw-r--. 1 rth rth 2403 May 30 19:12 strcpy.S
-rw-rw-r--. 1 rth rth 2263 May 30 19:12 strlen.S
-rw-rw-r--. 1 rth rth 2595 May 30 19:12 strncmp.S
-rw-rw-r--. 1 rth rth 2344 May 30 19:12 strnlen.S
-rw-rw-r--. 1 rth rth 3105 May 30 19:12 strrchr.S
The tests pass when run under Foundation Platform 11.3. What is the best way
to submit these for review and upstreaming? There's nothing in the git README
about an upstream mailing list...
FWIW, my code is at
https://github.com/rth7680/cortex-strings/tree/rth/sve
I believe the canonical procedure to submit patches to cortex-strings
is via gerrit: review.linaro.org. If you have the git-review plugin
installed it should be as simple as running the 'git review' command.

Siddhesh
Richard Henderson
2018-05-31 17:37:58 UTC
Permalink
Post by Siddhesh Poyarekar
I believe the canonical procedure to submit patches to cortex-strings
is via gerrit: review.linaro.org. If you have the git-review plugin
installed it should be as simple as running the 'git review' command.
Thanks. I believe I've now successfully done so.
Please let me know if my procedure is off; this is
my first time using gerrit.


r~
Siddhesh Poyarekar
2018-05-31 17:41:18 UTC
Permalink
Post by Richard Henderson
Thanks. I believe I've now successfully done so.
Please let me know if my procedure is off; this is
my first time using gerrit.
You need to add reviewers to your review requests. I'd suggest adding
at least Adhemerval, Christophe and Maxim at the minimum; they may
then suggest more reviewers if necessary.

Siddhesh
Richard Sandiford
2018-06-01 09:32:42 UTC
Permalink
Post by Richard Henderson
I spoke with Ramana about these at HKG18, and I'm finally getting back to
these. I have routines for
-rw-rw-r--. 1 rth rth 2538 May 30 19:12 memchr.S
-rw-rw-r--. 1 rth rth 2405 May 30 20:49 memcmp.S
-rw-rw-r--. 1 rth rth 2385 May 30 19:12 rawmemchr.S
-rw-rw-r--. 1 rth rth 2470 May 30 19:12 strchrnul.S
-rw-rw-r--. 1 rth rth 2588 May 30 19:12 strchr.S
-rw-rw-r--. 1 rth rth 2370 May 30 19:12 strcmp.S
-rw-rw-r--. 1 rth rth 2403 May 30 19:12 strcpy.S
-rw-rw-r--. 1 rth rth 2263 May 30 19:12 strlen.S
-rw-rw-r--. 1 rth rth 2595 May 30 19:12 strncmp.S
-rw-rw-r--. 1 rth rth 2344 May 30 19:12 strnlen.S
-rw-rw-r--. 1 rth rth 3105 May 30 19:12 strrchr.S
The tests pass when run under Foundation Platform 11.3. What is the best way
to submit these for review and upstreaming? There's nothing in the git README
about an upstream mailing list...
FWIW, my code is at
https://github.com/rth7680/cortex-strings/tree/rth/sve
Thanks for doing these. One general comment is that the routines
tend to use the FFR result even in the case where no potential
fault is detected. Although it's not as obvious as it could be
from some of the published documentation, the architecturally-
preferred approach is instead to have the "normal" case depend only
on the flags set by RDFFRS, not on the FFR itself. E.g. the typical
structure would be something like this:

---------------------------------------------------------------------
SETFFR
loop:
...[A] first-faulting and non-faulting loads predicated...
... on some predicate Pg...
RDFFRS Pn.B, Pg/Z
B.NLAST recovery
...[B] code that ignores Pn and operates on all lanes active in Pg...
continue:
...[C] any code that can naturally be shared without using Pn...
...[D] branch back to loop when appropriate...

...

recovery:
SETFFR
...[E] code that operates only on the lanes active in Pn...
B continue
---------------------------------------------------------------------

Also, using INCB, INCH, INCW and INCD is architecturally preferred over
INCP in cases where either could be used. So if the above loop has a
pointer or byte index Xm, and if Pg is all-true, it would be better to do:

---------------------------------------------------------------------
SETFFR
loop:
...[A] first-faulting and non-faulting loads predicated...
... on some predicate Pg...
RDFFRS Pn.B, Pg/Z
B.NLAST recovery
INCB Xm
...[B] code that ignores Pn and operates on all lanes active in Pg...
continue:
...[C] any code that can naturally be shared without using Pn...
...[D] branch back to loop when appropriate...

...

recovery:
SETFFR
INCP Xm, Pn.B
...[E] code that operates only on the lanes active in Pn...
B continue
---------------------------------------------------------------------

If [B] is too complex to copy to [E], an alternative approach is to do
something like this:

---------------------------------------------------------------------
SETFFR
loop:
PTRUE Pg.B or WHILELO Pg.B, Xi, Xj // set Pg for this iteration
...[A] first-faulting and non-faulting loads predicated on Pg...
RDFFRS Pn.B, Pg/Z
B.NLAST recovery
INCB Xm
continue:
...[B] code that ignores Pn and operates on all lanes active in Pg...
...[D] branch back to loop when appropriate...

...

recovery:
SETFFR
INCP Xm, Pn.B
MOV Pg.B, Pn.B
B continue
---------------------------------------------------------------------

The common case then has the extra overhead of resetting Pg in each
iteration, but if [B] is complex (or if Pg is being reset anyway),
then that shouldn't matter.

The idea is that the B.NLAST should be highly predictable,
so it's usually not necessary to wait for the FFR value to become
available. And in practice, getting a precise FFR predicate is likely
to be a slow operation (to the extent that this is an ISA-level
principle rather than a uarch optimisation).

Thanks,
Richard
Richard Sandiford
2018-06-01 16:59:39 UTC
Permalink
Post by Richard Sandiford
Post by Richard Henderson
I spoke with Ramana about these at HKG18, and I'm finally getting back to
these. I have routines for
-rw-rw-r--. 1 rth rth 2538 May 30 19:12 memchr.S
-rw-rw-r--. 1 rth rth 2405 May 30 20:49 memcmp.S
-rw-rw-r--. 1 rth rth 2385 May 30 19:12 rawmemchr.S
-rw-rw-r--. 1 rth rth 2470 May 30 19:12 strchrnul.S
-rw-rw-r--. 1 rth rth 2588 May 30 19:12 strchr.S
-rw-rw-r--. 1 rth rth 2370 May 30 19:12 strcmp.S
-rw-rw-r--. 1 rth rth 2403 May 30 19:12 strcpy.S
-rw-rw-r--. 1 rth rth 2263 May 30 19:12 strlen.S
-rw-rw-r--. 1 rth rth 2595 May 30 19:12 strncmp.S
-rw-rw-r--. 1 rth rth 2344 May 30 19:12 strnlen.S
-rw-rw-r--. 1 rth rth 3105 May 30 19:12 strrchr.S
The tests pass when run under Foundation Platform 11.3. What is the best way
to submit these for review and upstreaming? There's nothing in the git README
about an upstream mailing list...
FWIW, my code is at
https://github.com/rth7680/cortex-strings/tree/rth/sve
Thanks for doing these. One general comment is that the routines
tend to use the FFR result even in the case where no potential
fault is detected. Although it's not as obvious as it could be
from some of the published documentation, the architecturally-
preferred approach is instead to have the "normal" case depend only
on the flags set by RDFFRS, not on the FFR itself.
Is it possible to elaborate on the reasons for that?
In some cases, the usage of ffr are so trivial (e.g. strlen)
that I can use the unpredicated RDFFR instead of the predicated.
Yeah, RDFFRS is preferred even then. The reason is that the FFR
accumulates information across multiple loads (all LDFFs and LDNFs
since the last SETFFR) and so it might not be able to issue until all
those loads have completed. Using the other structure should mean that
the normal path can be issued in a similar way to loops that use
ordinary loads.
Post by Richard Sandiford
Also, using INCB, INCH, INCW and INCD is architecturally preferred over
INCP in cases where either could be used. So if the above loop has a
This is much easier to understand -- that adding a near constant is preferred
to popcount over up to 256-bits. Avoiding INCP for strlen rearranges the loop
in just the sort of ways you say are preferred for RDFFR.
Yeah.
Post by Richard Sandiford
The idea is that the B.NLAST should be highly predictable,
so it's usually not necessary to wait for the FFR value to become
available. And in practice, getting a precise FFR predicate is likely
to be a slow operation (to the extent that this is an ISA-level
principle rather than a uarch optimisation).
Interesting. From the language that I read, which was essentially
"first-faulting loads can fail for any reason", I assumed that it
happened more often than what you're implying here -- that it should
be on the rare side.
That's more a trade-off between giving software a guarantee about the
specific situations in which a potential fault is and isn't flagged
vs. giving implementations the leeway to do something efficient.
It's not really saying how frequent the "potential fault" case is.

Thanks,
Richard
Richard Henderson
2018-06-01 17:24:04 UTC
Permalink
[ How odd. I could find no record that the message to which you are replying
was ever actually sent. I assumed I simply closed the window instead of
hitting send. And thus I just now sent a second reply. ]
Post by Richard Sandiford
Is it possible to elaborate on the reasons for that?
In some cases, the usage of ffr are so trivial (e.g. strlen)
that I can use the unpredicated RDFFR instead of the predicated.
Yeah, RDFFRS is preferred even then. The reason is that the FFR
accumulates information across multiple loads (all LDFFs and LDNFs
since the last SETFFR) and so it might not be able to issue until all
those loads have completed. Using the other structure should mean that
the normal path can be issued in a similar way to loops that use
ordinary loads.
Ok. I'll reorg the routines along these lines.


r~
Peter Maydell
2018-06-04 15:16:53 UTC
Permalink
Post by Richard Henderson
[ How odd. I could find no record that the message to which you are replying
was ever actually sent. I assumed I simply closed the window instead of
hitting send. And thus I just now sent a second reply. ]
Our mailing list server apparently sat on it over the weekend
for some reason:

Received: by lists.linaro.org (Postfix, from userid 109) id 43DCB619F3; Mon,
4 Jun 2018 14:58:52 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on ip-10-142-244-252
X-Spam-Level:
X-Spam-Status: No, score=-1.9 required=5.0
tests=BAYES_00,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL
autolearn=disabled version=3.4.0
Received: from [127.0.0.1] (localhost [127.0.0.1]) by lists.linaro.org
(Postfix) with ESMTP id 1F544617CD; Mon,
4 Jun 2018 14:58:40 +0000 (UTC)
X-Original-To: linaro-***@lists.linaro.org
Delivered-To: linaro-***@lists.linaro.org
Received: by lists.linaro.org (Postfix, from userid 109) id 750F361775; Fri,
1 Jun 2018 15:53:02 +0000 (UTC)
Received: from mail-pl0-f67.google.com (mail-pl0-f67.google.com
[209.85.160.67]) by lists.linaro.org (Postfix) with ESMTPS id
2067E617C0 for <linaro-***@lists.linaro.org>; Fri,
1 Jun 2018 15:53:01 +0000 (UTC)

thanks
-- PMM
Richard Henderson
2018-06-01 17:18:34 UTC
Permalink
Post by Richard Sandiford
Thanks for doing these. One general comment is that the routines
tend to use the FFR result even in the case where no potential
fault is detected. Although it's not as obvious as it could be
from some of the published documentation, the architecturally-
preferred approach is instead to have the "normal" case depend only
on the flags set by RDFFRS, not on the FFR itself.
Clearly it would be interesting to read the microarch docs once they are
available. This is not the result I would have imagined.
Post by Richard Sandiford
RDFFRS Pn.B, Pg/Z
B.NLAST recovery
So the takeaway is that, if the branch is predicted untaken, and we don't use
Pn in the predicted path, then FFR is speculatively unused.
Post by Richard Sandiford
Also, using INCB, INCH, INCW and INCD is architecturally preferred over
INCP in cases where either could be used.
This is much more directly understandable. I guess it's the sort of thing
where the real cycle count for INCP is going to depend on the actual width of
the implementation, whereas INCB is always going to be a simple add.
Post by Richard Sandiford
The idea is that the B.NLAST should be highly predictable,
so it's usually not necessary to wait for the FFR value to become
available. And in practice, getting a precise FFR predicate is likely
to be a slow operation (to the extent that this is an ISA-level
principle rather than a uarch optimisation).
I'm surprised about FFR being quite so slow. And I guess from the language
elsewhere in the manual -- more or less that first-fault reads can fail at any
time for any reason -- I expected them to fail more often than you are implying
that they should.

I suppose if they only fail on tlb misses, and the OS is using 64k pages, then
the failure rate must be below 0.5%.

Thanks. This is the sort of stuff that's missing from the public manual so far.


r~
Richard Henderson
2018-06-04 18:30:45 UTC
Permalink
Yes, using cortex-strings as the bridge is indeed the preferred way. I am
- Adding ifunc variants and add them as default if system advertise SVE
support.
- Or adding as an extra option (as Szabolcs has proposed sometime ago
for LSE), where one would need to build and extra glibc library and
loader will select it if the case.
When bringing them to glibc, I would use ifuncs. SVE is isolatable to a few
routines.

LSE becomes scattered throughout libc, which is why an extra binary was
proposed. Although for LSE it might work just as well to pull the atomic
operations out to helper functions and ifunc those as well. It really all
depends on the relative costs.


r~
Ramana Radhakrishnan
2018-06-08 09:23:25 UTC
Permalink
Post by Richard Henderson
Yes, using cortex-strings as the bridge is indeed the preferred way. I am
- Adding ifunc variants and add them as default if system advertise SVE
support.
- Or adding as an extra option (as Szabolcs has proposed sometime ago
for LSE), where one would need to build and extra glibc library and
loader will select it if the case.
When bringing them to glibc, I would use ifuncs. SVE is isolatable to a few
routines.
LSE becomes scattered throughout libc, which is why an extra binary was
proposed. Although for LSE it might work just as well to pull the atomic
operations out to helper functions and ifunc those as well. It really all
depends on the relative costs.
Amen,.

Ramana
Post by Richard Henderson
r~
_______________________________________________
linaro-toolchain mailing list
https://lists.linaro.org/mailman/listinfo/linaro-toolchain
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Ramana Radhakrishnan
2018-06-18 09:24:36 UTC
Permalink
Do you plan to adapt them to glibc?
Ramana suggested starting "upstream" in cortex-strings and copying to glibc
from there, as a general work-flow preference. I believe this has been the
case for all of the other aarch64 string routines.
For now, I'm not sure it's useful to copy to glibc until we have hardware
against which we can run make bench -- given that would be the very first thing
any reviewer would ask.
Sorry about getting into this a bit late.

To be honest, I would prefer that we tried to get this into glibc sooner
rather than later.

I don't think the make bench discussion is totally relevant in this
case. We'd like to see what the performance of this stuff is and it's
always goign to be a battle to get things benchmarked perfectly on
hardware and there is unlikely to be one implementation in the Arm world
unlike other architectures. It's under an ifunc and gives folks
something to reason with, if we are really that concerned about it's
performance today lets put it under a tunable that allows us to turn it
off easily ?



regards
Ramana
r~
_______________________________________________
linaro-toolchain mailing list
https://lists.linaro.org/mailman/listinfo/linaro-toolchain
Siddhesh Poyarekar
2018-06-18 09:39:03 UTC
Permalink
On Mon, 18 Jun 2018 at 14:54, Ramana Radhakrishnan
Post by Ramana Radhakrishnan
I don't think the make bench discussion is totally relevant in this
case. We'd like to see what the performance of this stuff is and it's
always goign to be a battle to get things benchmarked perfectly on
hardware and there is unlikely to be one implementation in the Arm world
+1, I don't think there's a risk in pushing these patches without a
benchmark result. When implementations appear we can run benchmarks
and tweak and choose accordingly. In the worst case if we find that
*everyone* botched up their SVE implementations then we can just drop
that code in some years :)
Post by Ramana Radhakrishnan
unlike other architectures. It's under an ifunc and gives folks
something to reason with, if we are really that concerned about it's
performance today lets put it under a tunable that allows us to turn it
off easily ?
Existing tunables can do the job just fine:

- glibc.tune.cpu can be used to emulate a different processor if
that's suitable for the core
- glibc.tune.hwcap_mask can be used to mask out HWCAP_SVE completely

Siddhesh

Loading...