Bug 569 - svp64 register predicates vs BE arrays of bits
Summary: svp64 register predicates vs BE arrays of bits
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Specification (show other bugs)
Version: unspecified
Hardware: PC Other
: --- major
Assignee: Luke Kenneth Casson Leighton
URL:
Depends on:
Blocks: 213
  Show dependency treegraph
 
Reported: 2021-01-06 20:17 GMT by Alexandre Oliva
Modified: 2022-02-10 20:10 GMT (History)
2 users (show)

See Also:
NLnet milestone: ---
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Alexandre Oliva 2021-01-06 20:17:50 GMT
Bit arrays (that don't exist in C, but that exist in other languages) and bit fields are allocated from most significant to least significant bit, in big endian mode, which is the opposite order of little endian.

Even with implicit byte-reversal on load, this won't bit-reverse, which  would be required for predicates to be represented in such a natural way in BE.

We probably have to document this constrain on BE, and state that programmers must arrange for bits to land in the predicate register so that the LSBit (2^0) holds the predicate for vector element 0, and so on.

This can be as simple as encoding the value to be loaded into the predicate register as an integral type, computed by ORing the bits, each one shifted left by the index of the vector element they apply to.  Holding an integer value computed this way in memory, in the cpu-configured endianness, whatever its width, and loading it into the predicate register with that width, will yield the intended predication.

Bit arrays indexed by vector element width, or 1-bit-fields declared in the same order as vector elements, will only correspond to the intended vector elements if the CPU is in LE mode.

In BE mode, it does NOT help to configure the specific predicate type as being in LE mode.  In order to be usable for predication, it needs a mixed-endianness representation, with bytes laid out as big-endian, because of byte-reversal on load, but with bits laid out as in little-endian, because of the absence of bit-reversal.
Comment 1 Luke Kenneth Casson Leighton 2021-01-06 21:54:09 GMT
interesting anomaly, good catch.

ok so integer predication is defined as scalar only. i.e. only ONE integer register is utilised, 64 bits in length, NOT a vector of integers.

this meshes well with, or more specifically *is* the reason why VL is restricted to 64.

we are NOT repeat NOT going to change the definition of scalar integers, so the behaviour is DEFINED as how v3.0B integers are defined.

if a user were to load an integer into a register then the lowest arithmetic bit (numbered 63 by IBM using their MSB0 convention, sigh) would be that bit used as the first predicate bit.

i say "lowest arithmetic bit" to mean, according to v3.0B standard scalar behaviour, "the bit which, if the same register were to have the constant 1 placed into it with addi rN, r0, 1 this would result in the MSB0-numbered bit 63 containing a 1 and all other bits would be zero"

if a user used the wrong ld operation (ld rather than ldbrx) then that is their lookout: i.e. it is an error on their part.

this also affects (or, doesn't) the new cr-to-int transfer routines that have to be added to help with predication in general.  these should be modelled after how mtcr and mfcr work in the scalar v3.0B spec.

i will go through the relevant pages and make sure there are sections on this.

https://libre-soc.org/openpower/sv/cr_int_predication/
Comment 2 Luke Kenneth Casson Leighton 2021-01-06 22:55:43 GMT
(In reply to Luke Kenneth Casson Leighton from comment #1)

> https://libre-soc.org/openpower/sv/cr_int_predication/

arrrg this is going to drive me nuts.

i need some urgent help verifying the section added to confirm that it is correct.

this is the area which took 4 MONTHs to track down bugs and required 3 weeks of investigation and help from Ben and Paul.
Comment 3 Luke Kenneth Casson Leighton 2021-01-06 23:04:38 GMT
(In reply to Alexandre Oliva from comment #0)


> Even with implicit byte-reversal on load, this won't bit-reverse, 
 
we are not going to change OpenPOWER v3.0B scalar behaviour.

the perspective that IBM defined Scalar v3.0B Integer registers to behave differently when a Logical operation (AND) is performed vs an ALU operation (ADD) depending on whether LE or BE is set is just not how it works, and if we try to do that it will make understanding and acceptance impossible.

the user must use the correct LD operation - ldbrx or ld - and the user must expect that the ordering of bits follows the v3.0B defined conventions.

we are NOT going to add in implicit bytereversal that changes v3.0B Scalar behaviour (explicit, maybe, implicit absolutely not)

everything will follow from that inviolate hard rule.

this will be challenging due to IBM's use of MSB0 conventions, even harder when it comes to CR numbering, but tough luck for us: we deal with it.  sigh
Comment 5 Jacob Lifshay 2021-01-07 01:31:03 GMT
(In reply to Luke Kenneth Casson Leighton from comment #2)
> (In reply to Luke Kenneth Casson Leighton from comment #1)
> 
> > https://libre-soc.org/openpower/sv/cr_int_predication/
> 
> arrrg this is going to drive me nuts.
> 
> i need some urgent help verifying the section added to confirm that it is
> correct.
> 
> this is the area which took 4 MONTHs to track down bugs and required 3 weeks
> of investigation and help from Ben and Paul.

Umm, where you have:
CR{7-n} = CR[32+n*4:35+n*4]

Assuming the above CR bit numbers are in MSB0 form, I think that gets CR registers reversed:

The spec says the following (OpenPower ISA v3.1 section 2.3.1):
For all fixed-point instructions in which Rc=1, and for
addic., andi., and andis., the first three bits of CR Field
0 (bits 32:34 of the Condition Register)

You have mistakenly put CR7 in bits 32:34
Comment 6 Jacob Lifshay 2022-02-09 08:48:35 GMT
LLVM currently assumes that vectors (like SIMD) of bits are laid out such that bit ordering always matches byte ordering (it doesn't have a concept of different bit ordering vs endian) -- BE starts from the MSB as the first vector element, going to the LSB as the last vector element -- LE starts from the LSB as the first vector element through to the MSB as the last vector element.

I found that in LLVM's code for converting between bitvectors and their memory layout (implementation of bitcasting for vector constants, to be specific). that was a while ago and i was just now reminded again, so I don't have a link to LLVM's code right now (too tired to look at the moment).

Because that assumption is baked into LLVM, probably spread throughout the code, making it quite difficult to split out bit order as independent from byte order, we will probably want to take the path of least resistance and change SVP64 to have bitmasks be MSB0 in BE, and LSB0 in LE.
Comment 7 Jacob Lifshay 2022-02-09 08:51:42 GMT
we ran into the llvm bit vs byte endian issue in Rust's project portable-simd:
https://github.com/rust-lang/portable-simd/pull/239#issuecomment-1033484051
Comment 8 Jacob Lifshay 2022-02-09 08:55:21 GMT
changing importance to major because of llvm issue
Comment 9 Luke Kenneth Casson Leighton 2022-02-09 09:35:50 GMT
(In reply to Jacob Lifshay from comment #5)

> Umm, where you have:
> CR{7-n} = CR[32+n*4:35+n*4]
> 
> Assuming the above CR bit numbers are in MSB0 form, 

that would be an incorrect assumption. i define the operator CR{nn}
to be "the actual number of the CR Field".  CR0 == CR{n} where n==0.
Comment 10 Luke Kenneth Casson Leighton 2022-02-09 10:17:04 GMT
(In reply to Jacob Lifshay from comment #6)

> Because that assumption is baked into LLVM, probably spread throughout the
> code, making it quite difficult to split out bit order as independent from
> byte order, we will probably want to take the path of least resistance and
> change SVP64 to have bitmasks be MSB0 in BE, and LSB0 in LE.

unfortunately, if i understand correctly, it is quite insane and deeply
problematic to follow this assumption.  let's walk through an example:

* CRs are used as predicates
* BE mode is set
* the predicate is set to say use the EQ bit
* the operation to be executed is on a vectorised "crand"
  instruction where the expectation is to combine the results
  of the crand instruction for further use in predicate masks
* VL is set to 3

what happens here - if i understand correctly is:

* the predicate mask comprising bits CR0.EQ, CR1.EQ, and CR2.EQ
  are constructed in **REVERSED** order because this is what LLVM
  expects
* the crand operation extracts (say)
     CR8.SO  CR9.SO  CR10.SO  and ANDs them with
     CR16.GT CR17.GT CR18.GT  applying the **REVERSED** predicate mask
     CR2.EQ  CR1.EQ  CR0.EQ   storing the result in
     CR32.LT CR33.LT CR34.LT

then the next instruction, a cror, which is expecting to then use
the CR32-CR33 results as its incoming predicate, must first bit-reverse
them?  but even before that, the CR0-CR2 had to be bit-reversed.

now, should we instead "fix" this by inverting the ordering of the
Vectors so that in BE mode they go VL-1..0 by default whereas in
LE mode they go 0..VL-1 by default? this will do people's heads in.

and, stricly, should we re-order the definition of the bit-numbering
SO GT EQ LT in BE mode so that it now becomes inverted?

this would indeed meet the strict definition required by LLVM.

but such a definition then creates insanity at the GPR/FPR level:
it's not so much any one single operation that is problematic,
it's the interaction *between* operations where things become
deeply problematic, and if flipping the elwidths half way through
that introduces a whole new dimension of complexity, even just
to consider let alone implement.

overall it is just easier to say "LE and BE apply to memory *ONLY*,
the GPR/FPR and CR regfile contents are strictly off limits:
CRs are already defined and do not change; GPR/FPR is strictly
defined as a LE-byte-addressable SRAM at ALL times"

i.e. as far as the hardware is concerned, the only presence of
BE byte-swapping is in the LD/ST operations, hooking an XOR
gate into ldbrx.

byte-reversing here, byte-reversing there, byte-reversing everywhere
is just too much. it will be literally months to review.

if we had completely separate Vector Register files and completely
separate Vector Predicate Mask register files i would say "yes, no
problem".  [but, as you are aware, that then requires a whole stack
of MV/copy instructions]

however because of the retro-fitting on top of an *existing* scalar
regfile (similar to the original MMX) it's just too much.

my feeling is that when it comes to adding LLVM support to SVP64
it is going to be radically different and yet radically simpler
from every other Vector ISA, because of the for-looping.

i fully expect the "for-looping-on-scalars" concept to hit LLVM
in the exact same surprisingly-elegant way that it has in hardware,
drastically simplifying how it is added.

and if SVP64 is damaged by fitting with how SIMD and
other Vector ISAs have been done (with their explicit intrinsics),
that job will be made far harder.

remember: if we follow how things are done for other Vector ISAs
in LLVM, we have ONE AND A HALF MMMILLLLION vector intrinsics.

auto-generating a header file with 1.5 million intrinsics is flat-out
insane.

therefore we *have* to go back to first principles in LLVM (and gcc)
and hit them with a lower-level-conceptual rethink, propagating the
"for-looping-on-scalars" right the way down to the IR representation.
ultimately i expect that to also drastically simplify the competing
SIMD and Vector ISAs implementations but that's not our problem/focus.
Comment 11 Jacob Lifshay 2022-02-09 18:08:59 GMT
(In reply to Luke Kenneth Casson Leighton from comment #10)
> (In reply to Jacob Lifshay from comment #6)
> 
> > Because that assumption is baked into LLVM, probably spread throughout the
> > code, making it quite difficult to split out bit order as independent from
> > byte order, we will probably want to take the path of least resistance and
> > change SVP64 to have bitmasks be MSB0 in BE, and LSB0 in LE.
> 
> unfortunately, if i understand correctly, it is quite insane and deeply
> problematic to follow this assumption.  let's walk through an example:

no, it's just integer predicates that are bit-reversed. CR predicates are already logically laid out as a vector of bits rather than an integer, so they are already in vector element order and don't need reversing.
Comment 12 Luke Kenneth Casson Leighton 2022-02-09 18:22:48 GMT
consider a case where processing of data requires LSB0 bit zero of
each element to become, ultimately, part of a predicate mask.  the most
logical thing to do is a Vectorised CMPI operation and the vector
of CR Field results treated directly as a predicate mask.

if however BE is involved then at least one reversal instruction is required

consider also the crweird instructions which transfer between integers and
CR fields: these too would become damaged by bit and/or byte reversal.

little-endian has an extremely important property in arithmetic field whereby
the bytes are in natural incrementing order that may take pairs, quads, or
any other multiple and regardless of that multiple the numerical RADIX
significance is still preserved.

BE does *not* hold this same property.  when constructing big integer
math libraries it is *not* possible to sequentially store an array of 64 bit 
numbers representing the large number then follow up with a typecast to
an array of 32-bit: you *have* to perform word-swapping on pairs of 32bit
numbers first to get the sequence back.

given the interchangeability between predicates and data it is simply
not safe or sane to attempt anything other than treating the regfile as
a byte-addressable LE-ordered SRAM.

having such a dedicated property ensures that changing elwidths does not
require such byteswapping instructions.

i appreciate that LLVM may have made some assumptions about SIMD, but tough.
when we have the prerequisite USD 25 million to do a decent job of adding
SVP64 to LLVM this can be addressed, and LLVM assumptions sorted out.
it is good to be *aware* of the limitations, because there will be no
surprises in budgeting to sort it out.
Comment 13 Jacob Lifshay 2022-02-09 18:31:19 GMT
(In reply to Luke Kenneth Casson Leighton from comment #10)
the problem is if we don't use the already existing vector support in LLVM, it makes our problem more than 10x larger, from just needing to add SVP64 to the powerpc target which is relatively simple (probably less than 30kloc) into needing to rewrite/duplicate major portions of LLVM and Clang and Rustc and Flang and all other language frontends (probably 100k to millions of loc) cuz we'd need to rewrite/duplicate everything everywhere that touches vectors: optimizations, frontends, backends, etc.

imho that approach is patently absurd unless you are able to spend >$100M on dozens of programmers over several years.

also, programmers everywhere will hate us if we don't map the pre-existing llvm vectors to svp64 instructions because then they'll have to rewrite their code if they want it to go fast on svp64 -- they'll probably mostly ignore svp64 and we will have mostly failed in our mission to make cpu vectors easier to work with.
> 
> and if SVP64 is damaged by fitting with how SIMD and
> other Vector ISAs have been done (with their explicit intrinsics),
> that job will be made far harder.
> 
> remember: if we follow how things are done for other Vector ISAs
> in LLVM, we have ONE AND A HALF MMMILLLLION vector intrinsics.

not actually, llvm intrinsics can have constant and/or metadata arguments, allowing you to share one intrinsic between many different operations.

Also, when first starting out, imho we should add the functionality to llvm that is basically what RISC-V V implements (mapping llvm's architecture-independent fixed/scalable length vector operations to SVP64 instructions, allowing us to reuse the 10s of millions of lines of code spread across the ecosystem that targets architecture-independent fixed-length and/or scalable vectors, and mostly leave the rest for later).
Comment 14 Luke Kenneth Casson Leighton 2022-02-09 18:32:12 GMT
(In reply to Jacob Lifshay from comment #11)

> no, it's just integer predicates that are bit-reversed. CR predicates are
> already logically laid out as a vector of bits rather than an integer, so
> they are already in vector element order and don't need reversing.

the fields are nibbles (4 bit) and if there is to be numbering reversal then
it should be consistently applied, even to nibbles.

that it is quite insane to consider should give a clue that reversal of integer
predicates is equally as insane.

i repeat: if we had totally separate predicate mask regfiles this would not be
an issue, i would be agreeing with you 100%, that the predicates could and
should respect the current MSB ordering.

it is the fact that the regfile is MMX-esque (8/16/32/64 within a 64 bit GPR)
that makes the idea of doing anything other than strictly treating the regfile
as byte-addressable LE-ordered SRAM completely insane.

i am not kidding when i say that it would take many months to do a full audit
of the implications of this idea.

it is simply too much.
Comment 15 Luke Kenneth Casson Leighton 2022-02-09 18:39:28 GMT
if you are volunteering to write the Grant Appplication and develop
the simulator and the HDL, after first finishing the delivery of the
commitments we are already obliged to complete, then great.
Comment 16 Jacob Lifshay 2022-02-09 19:04:14 GMT
(In reply to Luke Kenneth Casson Leighton from comment #12)
> consider a case where processing of data requires LSB0 bit zero of
> each element to become, ultimately, part of a predicate mask.  the most
> logical thing to do is a Vectorised CMPI operation and the vector
> of CR Field results treated directly as a predicate mask.
> 
> if however BE is involved then at least one reversal instruction is required

no reversals are required because you never touch an integer predicate, you go from vector of ints (not a predicate) to the cmpi into a vector of cr bits (not an integer predicate so bitreversal doesn't happen) to whatever instruction is predicated where the cr vector is used as a predicate (not an integer predicate so bitreversal doesn't happen)
> 
> consider also the crweird instructions which transfer between integers and
> CR fields: these too would become damaged by bit and/or byte reversal.

those instructions would bitreverse the integers in BE mode


> when constructing big integer
> math libraries it is *not* possible to sequentially store an array of 64 bit 
> numbers representing the large number then follow up with a typecast to
> an array of 32-bit: you *have* to perform word-swapping on pairs of 32bit
> numbers first to get the sequence back.

actually, it's perfectly possible to have that property in BE, simply store the bigint in BE:

e.g. a 256-bit bigint:

0xFEDCBA9876543210FEDCBA9876543210FEDCBA9876543210FEDCBA9876543210

is stored as an array of bytes:

[
0xFE, 0xDC, 0xBA, 0x98, 0x76, 0x54, 0x32, 0x10,
0xFE, 0xDC, 0xBA, 0x98, 0x76, 0x54, 0x32, 0x10,
0xFE, 0xDC, 0xBA, 0x98, 0x76, 0x54, 0x32, 0x10,
0xFE, 0xDC, 0xBA, 0x98, 0x76, 0x54, 0x32, 0x10,
]

which can be grouped in 32-bit chunks:

[
0xFEDCBA98, 0x76543210, 0xFEDCBA98, 0x76543210,
0xFEDCBA98, 0x76543210, 0xFEDCBA98, 0x76543210,
]

> 
> given the interchangeability between predicates and data it is simply
> not safe or sane to attempt anything other than treating the regfile as
> a byte-addressable LE-ordered SRAM.

imho it is perfectly safe/sane to treat the regfile as a byte-addressable LE/BE sram, such that the interpretation of the bytes follows memory BE/LE mode (with the addition that the whole SRAM is byteswapped when switching between cpu BE/LE mode to preserve 64-bit reg values for backward compatibility -- that switching can be completely ignored at the user application level). it requires slightly more hardware, but greatly simplifies software.
> 
> having such a dedicated property ensures that changing elwidths does not
> require such byteswapping instructions.

actually, in BE mode, changing elwidths would require separate byteswapping instructions if we followed your regfile-is-only-LE plan. if we followed lxo and my regfile-endian-matches-memory-endian plan, no separate byteswapping instructions need to be added by the programmer.
> 
> i appreciate that LLVM may have made some assumptions about SIMD, but tough.
> when we have the prerequisite USD 25 million to do a decent job of adding
> SVP64 to LLVM this can be addressed, and LLVM assumptions sorted out.
> it is good to be *aware* of the limitations, because there will be no
> surprises in budgeting to sort it out.
Comment 17 Jacob Lifshay 2022-02-09 19:07:21 GMT
(In reply to Luke Kenneth Casson Leighton from comment #15)
> if you are volunteering to write the Grant Appplication and develop
> the simulator and the HDL, after first finishing the delivery of the
> commitments we are already obliged to complete, then great.

for what exactly? byteswapping the regfile into memory order?
Comment 18 Luke Kenneth Casson Leighton 2022-02-10 12:56:11 GMT
(In reply to Jacob Lifshay from comment #17)
> (In reply to Luke Kenneth Casson Leighton from comment #15)
> > if you are volunteering to write the Grant Appplication and develop
> > the simulator and the HDL, after first finishing the delivery of the
> > commitments we are already obliged to complete, then great.
> 
> for what exactly? byteswapping the regfile into memory order?

all of it. the full and comprehensive instruction analysis, modifications
required to the spec, everything.  it's so large that i'd suggest you
write the Grant at the full EUR 50,000.

(In reply to Jacob Lifshay from comment #16)

> no reversals are required because you never touch an integer predicate, you
> go from vector of ints (not a predicate) to the cmpi into a vector of cr
> bits (not an integer predicate so bitreversal doesn't happen) to whatever
> instruction is predicated where the cr vector is used as a predicate (not an
> integer predicate so bitreversal doesn't happen)

to say that the problem may be "avoided" by not using precisely and exactly
one of the beneficial features of SVP64 is... well, i'm not sure what to say

> actually, it's perfectly possible to have that property in BE, simply store
> the bigint in BE:

the example that you gave when picking 32-bit means that
when setting elwidth to 64-bit the words in each 64-bit
element are in the wrong order.

it does not matter which example is picked (any N, M where
N and M are both 16 or greater).

this is the *known* problem of big-endian mode.

> compatibility -- that switching can be completely ignored at the user
> application level). it requires slightly more hardware, but greatly
> simplifies software.

"slightly more hardware" - which needs a full audit and a deeply-comprehensive
review that we absolutely do not have any time or resource to do right now.

please focus on priority tasks.
Comment 19 Jacob Lifshay 2022-02-10 20:04:52 GMT
(In reply to Luke Kenneth Casson Leighton from comment #18)
> (In reply to Jacob Lifshay from comment #17)
> > (In reply to Luke Kenneth Casson Leighton from comment #15)
> > > if you are volunteering to write the Grant Appplication and develop
> > > the simulator and the HDL, after first finishing the delivery of the
> > > commitments we are already obliged to complete, then great.
> > 
> > for what exactly? byteswapping the regfile into memory order?
> 
> all of it. the full and comprehensive instruction analysis, modifications
> required to the spec, everything.  it's so large that i'd suggest you
> write the Grant at the full EUR 50,000.

ok, I can work on that, though I'd likely put it off till later, as you mentioned. Just wanted to make sure you were onboard with the idea of Libre-SOC working on that.

> (In reply to Jacob Lifshay from comment #16)
> 
> > no reversals are required because you never touch an integer predicate, you
> > go from vector of ints (not a predicate) to the cmpi into a vector of cr
> > bits (not an integer predicate so bitreversal doesn't happen) to whatever
> > instruction is predicated where the cr vector is used as a predicate (not an
> > integer predicate so bitreversal doesn't happen)
> 
> to say that the problem may be "avoided" by not using precisely and exactly
> one of the beneficial features of SVP64 is... well, i'm not sure what to say

my point is that predicates are only bitreversed when in integer registers (or in memory). CRs are already kinda a Vector<i1, N>, stored in logical element order, one bit per CR, so no bitreversing needs to happen, because bitreversing happens whenever the cpu is translating between the type Vector<i1, N> and it's memory layout. Integer predicates are bitreversed because they are Vector<i1, N> but converted to the in-memory layout, then the in-memory bytes are interpreted as a 64-bit integer.
> 
> > actually, it's perfectly possible to have that property in BE, simply store
> > the bigint in BE:
> 
> the example that you gave when picking 32-bit means that
> when setting elwidth to 64-bit the words in each 64-bit
> element are in the wrong order.

how so, afaict it's the right order.

Example with a different 256-bit number so you can distinguish 64-bit parts:
8-bit words:
[0x12, 0x34, 0x56, 0x78, 0x9a, 0xbc, 0xde, 0xfe,
0xdc, 0xba, 0x98, 0x76, 0x54, 0x32, 0x10, 0x13,
0x57, 0x9b, 0xdf, 0x2, 0x46, 0x8a, 0xce, 0xca,
0x86, 0x42, 0xf, 0xdb, 0x97, 0x53, 0x1a, 0xa5]
16-bit words:
[0x1234, 0x5678, 0x9abc, 0xdefe, 0xdcba, 0x9876, 0x5432, 0x1013,
0x579b, 0xdf02, 0x468a, 0xceca, 0x8642, 0xfdb, 0x9753, 0x1aa5]
32-bit words:
[0x12345678, 0x9abcdefe, 0xdcba9876, 0x54321013,
0x579bdf02, 0x468aceca, 0x86420fdb, 0x97531aa5]
64-bit words:
[0x123456789abcdefe, 0xdcba987654321013,
0x579bdf02468aceca, 0x86420fdb97531aa5]
128-bit words:
[0x123456789abcdefedcba987654321013, 0x579bdf02468aceca86420fdb97531aa5]
256-bit words:
[0x123456789abcdefedcba987654321013579bdf02468aceca86420fdb97531aa5]