Post by Richard HendersonI spoke with Ramana about these at HKG18, and I'm finally getting back to
these. I have routines for
-rw-rw-r--. 1 rth rth 2538 May 30 19:12 memchr.S
-rw-rw-r--. 1 rth rth 2405 May 30 20:49 memcmp.S
-rw-rw-r--. 1 rth rth 2385 May 30 19:12 rawmemchr.S
-rw-rw-r--. 1 rth rth 2470 May 30 19:12 strchrnul.S
-rw-rw-r--. 1 rth rth 2588 May 30 19:12 strchr.S
-rw-rw-r--. 1 rth rth 2370 May 30 19:12 strcmp.S
-rw-rw-r--. 1 rth rth 2403 May 30 19:12 strcpy.S
-rw-rw-r--. 1 rth rth 2263 May 30 19:12 strlen.S
-rw-rw-r--. 1 rth rth 2595 May 30 19:12 strncmp.S
-rw-rw-r--. 1 rth rth 2344 May 30 19:12 strnlen.S
-rw-rw-r--. 1 rth rth 3105 May 30 19:12 strrchr.S
The tests pass when run under Foundation Platform 11.3. What is the best way
to submit these for review and upstreaming? There's nothing in the git README
about an upstream mailing list...
FWIW, my code is at
https://github.com/rth7680/cortex-strings/tree/rth/sve
Thanks for doing these. One general comment is that the routines
tend to use the FFR result even in the case where no potential
fault is detected. Although it's not as obvious as it could be
from some of the published documentation, the architecturally-
preferred approach is instead to have the "normal" case depend only
on the flags set by RDFFRS, not on the FFR itself. E.g. the typical
structure would be something like this:
---------------------------------------------------------------------
SETFFR
loop:
...[A] first-faulting and non-faulting loads predicated...
... on some predicate Pg...
RDFFRS Pn.B, Pg/Z
B.NLAST recovery
...[B] code that ignores Pn and operates on all lanes active in Pg...
continue:
...[C] any code that can naturally be shared without using Pn...
...[D] branch back to loop when appropriate...
...
recovery:
SETFFR
...[E] code that operates only on the lanes active in Pn...
B continue
---------------------------------------------------------------------
Also, using INCB, INCH, INCW and INCD is architecturally preferred over
INCP in cases where either could be used. So if the above loop has a
pointer or byte index Xm, and if Pg is all-true, it would be better to do:
---------------------------------------------------------------------
SETFFR
loop:
...[A] first-faulting and non-faulting loads predicated...
... on some predicate Pg...
RDFFRS Pn.B, Pg/Z
B.NLAST recovery
INCB Xm
...[B] code that ignores Pn and operates on all lanes active in Pg...
continue:
...[C] any code that can naturally be shared without using Pn...
...[D] branch back to loop when appropriate...
...
recovery:
SETFFR
INCP Xm, Pn.B
...[E] code that operates only on the lanes active in Pn...
B continue
---------------------------------------------------------------------
If [B] is too complex to copy to [E], an alternative approach is to do
something like this:
---------------------------------------------------------------------
SETFFR
loop:
PTRUE Pg.B or WHILELO Pg.B, Xi, Xj // set Pg for this iteration
...[A] first-faulting and non-faulting loads predicated on Pg...
RDFFRS Pn.B, Pg/Z
B.NLAST recovery
INCB Xm
continue:
...[B] code that ignores Pn and operates on all lanes active in Pg...
...[D] branch back to loop when appropriate...
...
recovery:
SETFFR
INCP Xm, Pn.B
MOV Pg.B, Pn.B
B continue
---------------------------------------------------------------------
The common case then has the extra overhead of resetting Pg in each
iteration, but if [B] is complex (or if Pg is being reset anyway),
then that shouldn't matter.
The idea is that the B.NLAST should be highly predictable,
so it's usually not necessary to wait for the FFR value to become
available. And in practice, getting a precise FFR predicate is likely
to be a slow operation (to the extent that this is an ISA-level
principle rather than a uarch optimisation).
Thanks,
Richard