Bug 216 - LOAD STORE buffer needed
Summary: LOAD STORE buffer needed
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Source Code (show other bugs)
Version: unspecified
Hardware: Other Linux
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL:
Depends on: 257 465
Blocks: 383
  Show dependency treegraph
 
Reported: 2020-03-11 20:28 GMT by Luke Kenneth Casson Leighton
Modified: 2023-09-12 02:34 BST (History)
4 users (show)

See Also:
NLnet milestone: NLnet.2019.02.012
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments
screenshot of memory dependency cell (69.44 KB, image/png)
2020-03-13 20:30 GMT, Luke Kenneth Casson Leighton
Details
screenshot of memory dependency matrix (332.39 KB, image/png)
2020-03-13 20:31 GMT, Luke Kenneth Casson Leighton
Details
over-run 80 chars (10.86 KB, image/png)
2020-05-25 11:36 BST, Luke Kenneth Casson Leighton
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2020-03-11 20:28:00 GMT
http://bugs.libre-riscv.org/show_bug.cgi?id=215#c5

discussion and advice from comp.arch
https://groups.google.com/forum/#!topic/comp.arch/cbGAlcCjiZE
Comment 1 Luke Kenneth Casson Leighton 2020-03-11 22:54:37 GMT
(In reply to Jacob Lifshay from http://bugs.libre-riscv.org/show_bug.cgi?id=215#c5)
> (In reply to Luke Kenneth Casson L
> > whew.
> > 
> > so that's 128-bit-wide for _textures_... that's on the *load* side.  are
> > there any simultaneous (overlapping) "store" requirements? are the
> > code-loops tight enough to require simultaneous 128-bit LD *and* 128-bit ST?
> 
> yes and no -- there is code that will benefit from simultaneous loads and
> stores (memcpy and probably most other code that has both loads and stores
> in a loop), however it isn't strictly necessary.

ok.

> It will be highly beneficial to support multiple simultaneous 8, 16, 32, or
> 64-bit loads to a single cache line all being able to complete
> simultaneously independently of alignment in that cache line. 

yes i have been thinking about this for a day since getting the LDSTCompUnit operational.  it currently uses a "fake" memory block (for testing) and the next step is to look at this.

> Also
> misaligned loads that cross cache lines (and possibly page boundaries),
> though those don't need to complete in a single cache access.

ok

> All the above also applies to stores, though they can be a little slower
> since they are less common.
> 
> I realize that that will require a really big realignment network, however
> the performance advantages I think are worth it.

yes.  we cannot have high performance computation yet no good getting data in and out.
 
> For a scheduling algorithm for loads that are ready to run (6600-style
> scheduler sent to load/store unit for execution, no conflicting stores
> in-front, no memory fences in-front), we can have a queue of memory ops and
> each cycle we pick the load at the head of the queue and then search from
> the head to tail for additional loads that target the same cache line
> stopping at the first memory fence, conflicting store, etc. Once those loads
> are selected, they are removed from the queue (probably by marking them as
> removed) and sent thru the execution pipeline.
> 
> We can use a similar algorithm for stores.

right. ok. thanks to Mitch Alsup, i have a Memory Dependency matrix that takes care of discerning and preserving the loads and stores into batches.  we could if we wanted to do not only TSO, it can handle cross-SMP in a way that makes atomic memory either entirely moot or dead trivial.  this i need a little more research on.

anyway the point is: LOADs as a batch are already identified and hold up any STOREs, and vice-versa.

> To find the required loads, we can use a network based on recursively
> summarizing chunks of the queue entries' per-cycle ready state, then
> reversing direction from the summary back to the queue entries to tell the
> entries which, if any, execution port they will be running on this cycle.
> There is then a mux for each execution port in the load pipeline to move the
> required info from the queue to the pipeline. The network design is based on
> the carry lookahead network for a carry lookahead adder, which allows taking
> O(N*log(N)) space and O(log(N)) gate latency.

with the Memory DMs already in place, taking care of separating LOAD from STORE batches, would it be necessary to do that? (i don't know the answer).

also the ordering is not important (Weak Memory Model) and so a buffer will *only* have LOADs in or STOREs but not both, and it becomes possible to analyse the buffer and see if a batch is present.

actually if we had a (really) small CAM that mirrored the way the L1 cache worked, that might work.  2 maybe 4 lines or something.

then you can detect when the lines are fully occupied...

hmmm needs thought

> Loads/Stores that cross a cache boundary can be split into 2 load/store ops
> when sent to the queue and loads reunited when they both complete. They
> should be relatively rare, so we can probably support reuniting only 1 op
> per cycle.

yes.
 
> RMW Atomic ops and fences can be put in both load and store queues where
> they are executed once they reach the head of both queues.

these luckily Mitch told me you simply set *both* MemRead *and* MemWrite hazard and the result is fascinatingly they have to be atomic, locking out all other LDs, STs, and, because of the way the DMs work, preserve the order as well.

so there will *be* no overlap between atomics, LDs, or STs.

now, whether these all go into the same queue, i don't know.

the main thing is, if we have a queue it has to basically not just be a queue, it has to be a cache as well.
Comment 2 Jacob Lifshay 2020-03-11 23:08:39 GMT
(In reply to Luke Kenneth Casson Leighton from comment #1)
> (In reply to Jacob Lifshay from
> http://bugs.libre-riscv.org/show_bug.cgi?id=215#c5)
> > For a scheduling algorithm for loads that are ready to run (6600-style
> > scheduler sent to load/store unit for execution, no conflicting stores
> > in-front, no memory fences in-front), we can have a queue of memory ops and
> > each cycle we pick the load at the head of the queue and then search from
> > the head to tail for additional loads that target the same cache line
> > stopping at the first memory fence, conflicting store, etc. Once those loads
> > are selected, they are removed from the queue (probably by marking them as
> > removed) and sent thru the execution pipeline.
> > 
> > We can use a similar algorithm for stores.
> 
> right. ok. thanks to Mitch Alsup, i have a Memory Dependency matrix that
> takes care of discerning and preserving the loads and stores into batches. 
> we could if we wanted to do not only TSO, it can handle cross-SMP in a way
> that makes atomic memory either entirely moot or dead trivial.  this i need
> a little more research on.
> 
> anyway the point is: LOADs as a batch are already identified and hold up any
> STOREs, and vice-versa.

Ok. The algorithm I proposed can still be used for scheduling loads/stores inside each batch and deciding which cache line to access each cycle, since I think the memory dependency matrix currently handles deciding which ops to execute each cycle by relying on a slow and simple FIFO algorithm. The queue in the algorithm I proposed can be the same HW as the memory dependency matrix, so it would end up as a hybrid algorithm.

The recursive summary mechanism inspired by carry lookahead will still be useful.
Comment 3 Luke Kenneth Casson Leighton 2020-03-12 01:54:19 GMT
(In reply to Jacob Lifshay from comment #2)
>
> Ok. The algorithm I proposed can still be used for scheduling loads/stores
> inside each batch and deciding which cache line to access each cycle, since
> I think the memory dependency matrix currently handles deciding which ops to
> execute each cycle by relying on a slow and simple FIFO algorithm. 

nono, it's a bitlevel matrix, capable of multi issue execution and resolving all conflicts and dependencies in a single cycle.

the ordering is preserved by cascading bits being raised, just like in the Register-style DMs.

so yes it has the same *characteristics* ad a FIFO bit is not a sequential binary numbered array, it is *unary* numbering, and the bits (SR Latches) specify "row X is in front of column Y".  actually, "row X is a memory LD hazard on column Y"

> The queue
> in the algorithm I proposed can be the same HW as the memory dependency
> matrix, so it would end up as a hybrid algorithm.

i am very wary of making any kind of changes to the Dependency Matrices.  it took months to get an understanding of them, enough to implement correctly.

using them as-is, as a component: no problem.  modifying them? no - not at this stage, we simply haven't got time.

> The recursive summary mechanism inspired by carry lookahead will still be
> useful.

can you do this piece? if we spec it up. i can point you at the code it needs to connect to (minerva L1 cache and LDSTCompUnit).
Comment 4 Luke Kenneth Casson Leighton 2020-03-12 11:13:46 GMT
(In reply to Luke Kenneth Casson Leighton from comment #3)

> i am very wary of making any kind of changes to the Dependency Matrices.  it
> took months to get an understanding of them, enough to implement correctly.

(and it was 10 months ago.  sigh)

the load/store system is a *multi*-step process: there is the "normal" (Register-FU and FU-FU hazard system, which takes care of the "usual" register-based dependencies that all ALUs have including the Branch Unit), *and* there is the inter-address LD/ST hazard Dependency Matrices.

the memory-function-unit matrices are augmented (differ from) the "normal" register-style DMs in that they recognise 10 bits of the address as a way to see if there was a conflict.  the logic being "overzealous".

you remember we discussed that last year, jacob, because 10 bits happens to hit the 1 MEGABYTE boundary of Vulkan Texture blocks?

what i considered doing was having, instead of a straight 10-bit address-line comparator, shifting that up by 4 bits and having a *bit-map* comparator for the 4 LSBs.

in other words a byte LOAD would set 1 bit in the 16 bits, a half-word 2 bits, and so on.  (actually to cover misaligned LOADs it would need to be a minimum 23 bits and/or split into *two* separate LOADs).

let us assume that the bitmap width is exactly the same as the cache line width (16 bits representing 16 byte wide cache lines).

this bitmap - if filled by multiple LOADs - can have a popcount carried out on it, and that tells us which "line" in the buffer has the highest number of bytes to perform LOADs.

if the entire bitmap is all 1s, that is a full cache line of LOADs.
Comment 5 Luke Kenneth Casson Leighton 2020-03-12 13:35:26 GMT
hm.

misaligned is tricky.  am just reading POWER 3.0B, book III, 6.5.8 (page 1074), it says that quad-word (lq, stq) is allowed to raise an "alignment" exception.  there's several others as well (including when write-through is required or cacheing is inhibited) however at least if we don't have to deal with 8-byte misalignment that makes things a *lot* simpler.
Comment 6 Luke Kenneth Casson Leighton 2020-03-12 17:29:04 GMT
ok i think i have a data structure that will help.  it's a (very small) CAM, with extra "things".

also it critically depends upon a pre-analysis of the LD/ST operation, breaking it down from this form, where unary is defined as (1<<N):

FunctionUnit# Addr Length

into:

unary(FU#){0}  Addr[4:]   unary(Addr[0:3]     )  (start=0, mask=000bbbb)
unary(FU#){1}  Addr[4:]+1 unary(Addr[0:3]&mask)  (start=M, mask=bbb0000)

when a write is misaligned across a cache line, it results in the *second* operation (unary(FU#){1}) being activated, where there would be two separate and distinct LOADs / STOREs coming from *one* FunctionUnit.

the (start,mask) is so as to be able to shift and mask the data as it comes in (or out) of the register.  this because when it's split into two, obviously you have to know which portion of each split covers which part of the data register.


let us assume that there are 4 LOAD/STORE Function Units.

there is a table (a CAM) with **EIGHT** rows, with an *unary* row-matching
by:

unary(FU#0){0}
unary(FU#0){1}
unary(FU#1){0}
unary(FU#1){1}
unary(FU#2){0}
unary(FU#2){1}
unary(FU#3){0}
unary(FU#3){1}

and in each column there is:

* MSBs of address - address[4:]
* *bitmask* of bytes covered by address[0:3]
* (up to) 128 bit data to be LD/ST'd

basically it's a very very small L0 cache, and because it's so small we can do row-addressing in *unary*.

the key to resolving misalignment is that initial break-down of the "original" LD/ST into a *PAIR* (plus a bitmask)

the bit that is a bit... awful is: there will be *four* Load-Store Units creating and contending for up to *EIGHT* simultaneous reads/writes to this CAM, worst-case (every single one of the operations crosses a cache-line boundary)

that's far too much.

i think it is more reasonable to limit that to only four, however even a four-way port-contention system is still one hell of a lot.

writing to this store/buffer/CAM: with four ports (four ways in CAM terminology) any one LOAD/STORE will be able to write to two separate addresses in the CAM

AND

up to *four* LOAD/STORE Units will be able to read/write to the *same* address (same row).

the Memory itself will need individual byte-level enable lines.

then, the "flush" from this small CAM will be able to read/write entire lines out to the L1 Cache.  each line will kiinda be like one single-entry FIFO.
Comment 7 Luke Kenneth Casson Leighton 2020-03-13 11:50:52 GMT
jacob, could i ask you the favour of pseudo-coding up the algorithm you
described?  bear in mind, the DMs will have taken care of separating
out all LOADs, all STOREs, and all AMOs into batches.

then we can do a comparison.
Comment 8 Jacob Lifshay 2020-03-13 18:37:01 GMT
(In reply to Luke Kenneth Casson Leighton from comment #7)
> jacob, could i ask you the favour of pseudo-coding up the algorithm you
> described?  bear in mind, the DMs will have taken care of separating
> out all LOADs, all STOREs, and all AMOs into batches.
> 
> then we can do a comparison.

Yeah, I can do that.
Comment 9 Luke Kenneth Casson Leighton 2020-03-13 20:16:34 GMT
here's some links to existing code:

the MemFunctionUnits:
https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/scoreboard/memfu.py;hb=HEAD
this is the "top level" of the mem-matrices, so quite boring

still boring:
https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/scoreboard/mem_fu_matrix.py;hb=HEAD

ok *now* we are cooking:
https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/scoreboard/mem_dependence_cell.py;hb=HEAD

however this is really the key bit:
https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/scoreboard/addr_match.py;hb=HEAD

that is where you get a Pascal's Triangle of address-comparators
*only in the bottom 10 bits*.

the actual algorithm is described in Mitch's book chapters, 11.4.12
i'll attach screenshots for the diagrams in a second.

the code i would like to connect to is here:
https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/minerva/units/loadstore.py;hb=HEAD

CachedLoadStoreUnit

and here is where the LOADs/STOREs actually get fired off:
https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/experiment/compldst.py;h=206f44876b00b6c1d94716e624a03e81208120d4;hb=HEAD#l277

here, rather than have that code *arbitrarily* fire off the LD/ST, and *assume* it will complete in a single cycle:

* stwd_mem_o must be PREVENTED from asserting UNTIL it is absolutely known that the store will succeed
* load_mem_o must likewise be prevented from asserting until the load is known to succeed

thus, it is perfectly fine for LOAD (or STORE) to take more than one cycle (just like any other Function Unit).

yes, i know, if the LD/ST cannot complete (raises an exception) this is a type of "result", which must result in the cancellation of the instruction (and all downstream instructions), and an exception raised.

this is TODO however it is important that the information propagate back through the LOAD/STORE buffer as a potential "result": "LOAD could not take place due to exception".
Comment 10 Luke Kenneth Casson Leighton 2020-03-13 20:30:48 GMT
Created attachment 36 [details]
screenshot of memory dependency cell

this cell is the basis of the LOAD-STORE hazard detection.  its design and purpose is to separate out sequences of LOADs from sequences of STOREs as well as separate out "atomic" operations which are defined as both LOAD *and* STORE being activated (requested) at the same time.
Comment 11 Luke Kenneth Casson Leighton 2020-03-13 20:31:53 GMT
Created attachment 37 [details]
screenshot of memory dependency matrix

this screenshot simply shows a group of Memory Dependency Cells.
Comment 12 Luke Kenneth Casson Leighton 2020-03-15 21:19:49 GMT
jacob i'm extending the LD/ST-address-conflict matrix to unary-overlap in the last LSB 4 bits of the address.
an unary map (LD byte == 0b0000 0000 0000 0001, LD dword = 0b0000 0000 1111 1111)
then shifted up by the LSBs of the address, and split into two (if there is
a misalignment), means that two bit-level comparisons detect if the LDs overlap
in the last 4 bits of the address.

the current system already compares bits 4 thru 11 as an *binary* address.
by combining with the unary system it gives a full guaranteed detection
of overlap, based not just on the address but on the length of the operation.

however that leaves the top bits (11 thru 39 or 48, whatever we decide the VM page size to be).

and that can be taken care of by a 4 or 8 entry mini-L1-cache-aka-LD/ST-buffer

however what is nice is that the mini-L1-cache can:

a) use the fact that the address-conflict matrix has *already* detected (compared) addresses in bits 0-11

b) find those addresses which *are* equal (in bits 4 and above)

c) make sure the upper bits (11+) are also equal

d) use a merge of the *unary* maps of bits 0..3 to detect which LDs/STs are covering the same L1 cache line, for all addresses detected in (b+c)

the unary map "merge" is where a popcount will tell us which is the best one to put straight through to the L1 cache.  if all 16 bits (when merged) are set, we know that's a full cache line.

the nice thing is, it's not actually that complicated.  we don't want to make the *entire* L1 cache line this (hence a mini-pre-L1-cache) because it would mean a popcount function on absolutely every single line of the L1 cache, and a lot more besides.

plus, an "optimisation" is that for those address bits 4-11, well, because we've already compared them, instead of doing the comparison again, we can use the *conflict matrix row number* as a substitute.  i think.

i have to check that.
Comment 13 Luke Kenneth Casson Leighton 2020-03-16 17:25:53 GMT
could everyone just have a look at this and see if i've got it right:
https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/scoreboard/addr_match.py;h=0312504eb818c5e77d8a85d3948f733663b34b2e;hb=182e4cd214022a8ce0a498c2ceeca223a282f329

the idea is as described: turn the last 4 bits of the address, plus the
length (1,2,4,8 bytes) into a bitmap, equal in length to a cache line
(actually, 2 cache lines) i.e. 2x 16-bits

then you do *unary* bit-wise compare on those, but *binary* compare on
the remainder of the address.

the tricky bit is that the roll-over (misaligned on a 16-byte boundary)
will need *three* compares:

* address-equal, compare 1st cache line (16 LSBs of the bitmap)
* 1st address PLUS ONE equal to *2nd* address:
  - compare 8 bytes of 1st address's SECOND cache line (bits 16-23) with
    first 8 bytes of 2nd address's FIRST cache line (bits 0-7)
* reverse - *2nd* address PLUS ONE equal to 1st address
  - likewise reverse on bitmap.

confusing as hell, hence why it needs checking.
Comment 14 Luke Kenneth Casson Leighton 2020-03-23 10:21:03 GMT
ok i have a pair of classes now:

* LDSTSplitter - this is designed to connect directly to LDSTCompUnit
* TwinPartialAddrBitmap - designed to be connected to LDSTSplitter

the idea is:

* misaligned LDs/STs that cross cache-line boundaries get "split" into
  a *pair* of LD/ST addresses plus corresponding "byte-map".

* these "byte-maps" are based on the bottom LSBs of the address *and*
  the length (byte-LD = len=1 ==> 0b1, halfword-LD = len=2 ==> 0b11
  which is then *SHIFTED UP* by the bottom LSBs

* any byte-map that goes over a cache line results in the SECOND
  of the two addresses "activating" from the LDSTSplitter

* key thing to note: the bytemaps *already encode* the position in the
  L1 cache, and can be used directly as "byte read/write-enable" lines on
  the underlying Memory.  no further shifting or offsetting is required
  at the Cache: the bytemaps can be wired *directly* into nmigen Memory
  as byte-enables.

* TwinPartialAddrBitmap takes *pairs* of addresses (plus length bytemaps)
  and identifies which addresses are clashing.  there are extra addresses
  included (actually, excluded) because only the LSBs are checked: this
  is fine.


important to note: LDSTSplitter has clock-synchronous signalling with
combinatorial bypass on it.  in other words, if the downstream LDs/STs
signal that they are ready immediately, LDSTSplitter will *also* respond
immediately (in that same clock cycle).  however if one of the LDs/STs
is not ready yet (cache miss for example) then LDSTSplitter will take
note of that and wait until the downstream sources set their "valid"
flag(s).

errors are simply merged.  this is still TODO properly.
Comment 15 Luke Kenneth Casson Leighton 2020-03-24 13:55:22 GMT
https://ascslab.org/research/briscv/downloads.html

i wonder... i wonder if it would work if we had multiple L1 caches,
with an N-way crossbar on each (routing to be organised by binary
address bits [4:4+log2(N)-1]) then have a L2 cache underneath that?

or, just actually have multiple L1 caches with a coherent L2 cache
underneath?
Comment 16 Luke Kenneth Casson Leighton 2020-03-24 14:22:08 GMT
http://sadve.cs.illinois.edu/Publications/isca00.pdf

this paper describes how to partition a cache so that media workloads
do not completely trash the L1 cache.  we will almost certainly need
to implement this because GPU workloads are READ-PROCESS-WRITE, and
even having the data in the L1 cache *at all* is completely pointless
as it's never going to be accessed more than once.
Comment 17 Jacob Lifshay 2020-03-24 18:00:58 GMT
If we're having multiple L1 caches, we should implement them to appear as if there is a single L1 cache -- no extra instructions are needed to flush one cache before accessing the other, this greatly simplifies the programming model and Kazan currently depends on the memory having a normal programming model (no cache flushes required for an operation to be visible on the same thread).

For GPU workloads, they still have some accesses (the majority for a tiled architecture -- what Kazan is) that benefit from a L1 cache. basically, it splits the screen into small squares/rectangles (bins) and accumulates a list of rendering operations (binning) then for each square/rectangle (bin) it renders all the operations inside it in sequence, so the pixels are repeatedly stored and loaded until that bin is done rendering.

Texture access also benefits from a cache since nearby pixels are accessed together and often the accesses are overlapping.
Comment 18 Luke Kenneth Casson Leighton 2020-03-24 20:19:12 GMT
(In reply to Jacob Lifshay from comment #17)
> If we're having multiple L1 caches, we should implement them to appear as if
> there is a single L1 cache -- no extra instructions are needed to flush one
> cache before accessing the other,


yes, agreed.  it appears, from the advice on comp.arch, that it is normal to "stripe" the L1 cache, using e.g. bit 4 of the address (when using 128 bit cache lines) to select odd or even halves.

if there are two separate LD/ST buffers, one for each half, this doubles the bestcase data throughput at a cost of 40% misses in the general case (due to each half being, duh, half the size).

however yes when it comes to flushing the cache these halves can, must and will appear to be coherent as if they were a single L1 cache.


> For GPU workloads, they still have some accesses (the majority for a tiled
> architecture -- what Kazan is) that benefit from a L1 cache. basically, it
> splits the screen into small squares/rectangles (bins) and accumulates a
> list of rendering operations (binning) then for each square/rectangle (bin)
> it renders all the operations inside it in sequence, so the pixels are
> repeatedly stored and loaded until that bin is done rendering.
> 
> Texture access also benefits from a cache since nearby pixels are accessed
> together and often the accesses are overlapping.

ok.

ok so those 1 MB texture buffers you mentioned before are accessed multiple times in near-succession?
Comment 19 Jacob Lifshay 2020-03-24 20:27:37 GMT
(In reply to Luke Kenneth Casson Leighton from comment #18)
> (In reply to Jacob Lifshay from comment #17)
> > For GPU workloads, they still have some accesses (the majority for a tiled
> > architecture -- what Kazan is) that benefit from a L1 cache. basically, it
> > splits the screen into small squares/rectangles (bins) and accumulates a
> > list of rendering operations (binning) then for each square/rectangle (bin)
> > it renders all the operations inside it in sequence, so the pixels are
> > repeatedly stored and loaded until that bin is done rendering.
> > 
> > Texture access also benefits from a cache since nearby pixels are accessed
> > together and often the accesses are overlapping.
> 
> ok.
> 
> ok so those 1 MB texture buffers you mentioned before are accessed multiple
> times in near-succession?

Yes. Though several different textures can be accessed in an interleaved fashion.
Comment 20 Luke Kenneth Casson Leighton 2020-04-19 10:51:06 BST
the other thing, an aspect of the memory hazard and addr matching matrices:

* loads are separated completely from stores
* stores likewise
* atomic memory operations likewise (by marking them as both LD *and* ST)
* the *batch* order is what is preserved
* only one atomic ever gets sent through (batch of QTY 1).
* due to the expansion of the last 4 bits into a bitmap as part of addr_match, no operations in any one "batch" will have overlapping bytes.  ever.

therefore there is no need - at all - to preserve "order" in any way shape or form (hence why i said that we do not need a "queue" per se, we need more a L0 Cache/buffer).

(this wording - L0 cache/buffer - came out of the discussion on comp.arch with Ivan. 8 entries is likely to be sufficient for us)
 
to emphasise:

* *at no time* will the DMs put out a mixed batch of LDs, STs or Atomics.
* acknowledging, clearing and completing each batch before moving onto the next is absolutely critical.
* except for IO and Atomics (which will only be sent down one at a time anyway) the order of LDs (or STs) can be entirely arbitrary.

also we can get away with full IO memory order preservation by treating them as "Atomic".  this will result in them only being sent one at a time, in strict order.

if the exact same dependency system were used at the L2 layer we could do full SMP TSO with no difficulty.
Comment 21 Luke Kenneth Casson Leighton 2020-04-19 10:57:00 BST
(In reply to Jacob Lifshay from comment #19)
> (In reply to Luke Kenneth Casson Leighton from comment #18)

> > ok so those 1 MB texture buffers you mentioned before are accessed multiple
> > times in near-succession?
> 
> Yes. Though several different textures can be accessed in an interleaved
> fashion.

ok that's fine.

so what will happen, because the addr match is on the last 4k (11 bits), anything that has on the same 4k LSBs even though the MSBs do not match, gets delayed a couple of cycles and, after the false positive clears through the memory system, the "not-actually-conflicting" LDST's turn will come next.

Mitch's analysis showed that in practice this is such a rare occurrence that the penalty is not worth being concerned about.
Comment 22 Luke Kenneth Casson Leighton 2020-04-19 16:22:14 BST
https://libre-soc.org/3d_gpu/architecture/6600scoreboard/

i just updated this page, a new section "L0 Cache/Buffer".  it contains a diagram i did a couple weeks ago:

https://libre-soc.org/3d_gpu/architecture/6600scoreboard/600x-mem_l0_to_l1_bridge.png

the actual algorithm is incredibly simple:

* priority picker picks one (and only one) address (per L0 cache/buffer)
* for all rows greater than the one picked, match against all MSBs of
  the address, bits 5 and above.
* all matching rows, OR the 16-bit bitmap representing {LSBs 0-3 plus LD/ST-len} 

those ORed bitmaps become the "byte read/write-enable" lines on a given
L1 cache line.

that's it - that's all there is to it.

the "complex" bit is the N-in/out multiplexing from 16-in on the 8 LD/ST
FunctionUnits (2 ports per FU because 1 is for "normal" requests and the
2nd is for misaligned addresses)

however i just realised that if we can accept an increase in size of the L0
cache/buffer from 8 to 16 entries - or to limit the number of LD/ST Function
Units to 6 - then we can instead simply have one *dedicated* "entry" for
each and every FU, and the entire MASSIVE 16-to-4 multiplexer completely
disappears.

i'll draw it out.

the caveat: we now have a 16-entry (or 12-entry) L0 Cache/Buffer, which
unfortunately requires up to a 16-way CAM on the *ENTIRE* address from
bits 5 and upwards.

it *might* be possible to inherit (yet again) from the addr_match.py classes,
which already has a full triangular comparison of all-against-all in 
bits 4 thru 11.
Comment 23 Luke Kenneth Casson Leighton 2020-04-19 18:46:26 BST
https://libre-soc.org/3d_gpu/mem_l0_to_l1_bridge_v2.png

ok done. it works like this:

* each FU (0-7) produces 2 LD/ST addresses which are broken up like this:

  addr1[0..3] addr1[4] addr1[5..11] addr1[12..48]
  addr2[0..3] addr2[4] addr2[5..11] addr2[12..48]

  where the relationship between addr1 and addr2 is:

       addr1[5..11] + 1 == addr2[5..11]

* addr1[0..3], in combination with the LD/ST len (1/2/3/4) is turned into
  a bytemap mask, 24 bits in length.  this bytemask is broken down into
  two halves:

      bytemask1, bytemask2 = map_addr(addr, LDST_len) [0..15], [16..23]

  i.e. anything that does not fit fully into bytemask1 is a "misaligned"
  LD/ST and the remainder overflows into bytemask2.

* if addr[4..11] == 0b111111111 and bytemask2 is non-zero, this indicates
  a "misaligned major page fault".

   this is a situation that we are *not* going to deal with (and it has
   been catered for in the 3.0B spec)

* all 16 FU LD/ST re-encodings of the (addr, LDST_len) are lined up in a
  table.  this table breaks down, alternating between:

  * FU # aligned     or FU # misaligned
  * addr1[5:11]      or addr2[5:11]
  * addr[12:48] for *BOTH*
  * bytemap1[0:15]   or bytemap2[0:15]
  * data1[0:15]      or data2[15]

note that addr[4] is *not* included in this because it is used to select
whether L1 cache bank #0 or #1 is to be used.

the algorithm for merging of LD/STs into *one single L1 cache line* is:

1). With a PriorityPicker find the index (row) of the first valid LD/ST request

2). For all entries after that row, compare Addr[5:11] and Addr[12:48].

3). If "match" on both, OR the byte-mask for that row onto the output.

that's it.  that's really all there is to it.


one thing that's important to note: there are only actually *eight* comparisons of addr[12:48] needed (not 16), because the addr[12:48] is *identical* for every *pair* of rows.

that however is still *seven* potential 36-bit CAM hits (seven lots of 36-bit XOR gates).  which is a hell of a lot.

if we could somehow use the L1 "tag" in place of Addr[12:48], that would save a huge amount of power.  unfortunately, every way i can think of that would get the tag *into* L0 is either equally power-consuming, or results in multi-cycle latency.

if we could reliably use a hash instead, i would suggest it.  however, unfortunately, the risk of a collision is too detrimental consequences.

the "sensible" option that does not have too detrimental an effect on performance is: reduce the number of LD/ST FUs to 6.  that would result in only 12 rows.
Comment 24 Luke Kenneth Casson Leighton 2020-04-22 11:53:04 BST
i just realised something:

* misaligned LD/STs will create a pair of LD/STs (on each FU) where it
  is *guaranteed* that Addr[4] for the LO-HI request will be mutually
  exclusive.  if Addr[4] is 0 for the LO request of the pair, Addr[4]
  is *guaranteed* to be 1 for the HI request of the pair

* therefore splitting into L0 left/right LO/HI isn't going to work...
  unless we accept that the LO/HI requests actually *cross over*!

FU0 LO -> L0 left  if Addr[4] == 0  else L0 right
FU0 HI -> L0 right addr 4 will be 1 else L0 right

this has the strong disadvantage of not keeping the addresses that
we *know* have the same bits in the upper parts [12:48].

the other way to do it therefore is for the L0 to be only 8 rows
deep, comprising *two* 128-bit entries, i.e. to be integrated
L0-odd and L0-even where previously it was two separate L0 caches.

two L1 caches would then connect to that.

i'll think it through.
Comment 25 Luke Kenneth Casson Leighton 2020-04-22 13:17:07 BST
done.

https://libre-soc.org/3d_gpu/twin_l0_cache_buffer.jpg

the key is that there are only 8 rows (not 16) however each row contains
*two* byte-masks, *two* addr[5:11] and data sets, but *ONE* addr[12:48]
per row.

multiplexing is exclusively limited to 2-in 2-out, on a per-row basis.
each FU having two ports need *only* perform multiplexing on its *own row*
(not to any other rows), using addr[4] to select which bank the two data
sets are to go into.

the thing is that because the two data sets from each FU are consecutive
addresses (one aligned, one mis-aligned, both coming from the exact same
LD/ST), we *know* that they will both go simultaneously into the row:
it's just a matter of which one goes into which row.

so we need a 2-in 2-out crossbar, no multiplexing, no arbitration or
delay-gates needed.

i'll do the writeup now
https://libre-soc.org/3d_gpu/architecture/6600scoreboard/
Comment 26 Luke Kenneth Casson Leighton 2020-04-25 13:40:46 BST
Writeup of the interface / API needed for LDST to talk to the LDST Computational Unit.  Additional signals will come through such as length, type, atomic etc
these are in order of time sequence

(edit this comment as necessary):

LD
--

* out: busy
* in: is_ld
* in: address
* in: len (1/2/4/8)
* in: go_addr
* in: go_die
* out: addr_ok (no exception will occur)
* out: addr_exc (exception type)

* out: ld_data
* out: ld_data_ok (to be raised for 1 cycle)


ST
--

* out: busy
* is_st
* in: address
* in: len (1/2/4/8)
* in: go_addr
* in: go_die
* out: addr_ok (no exception will occur)
* out: addr_exc (exception type)

* in: st_data
* in: go_st (raised for 1 cycle, must complete)

Notes
-----

* both LD/ST are identical up to address ok/exception point

* LDSTCompUnit must wait for busy signal to deassert before proceeding
  with signalling go_addr

* go_die pulls a hard reset (at any time), cancelling any outstanding LD, ST or address validation.

* addr is valid and will not change from the time that go_addr is asserted until a reset occurs.  (therefore addr does not need to be internally latched)

* within the L0 Cache/Buffer, before address validation (and TLB, and L1 cache line reservation) have occurred, neither LD nor ST data transfer may take place.

* beyond addr_ok the operation MUST be successful.  L1 Cache, TLB etc MUST all have been done

* after addr_ok sent, LD data can be placed as soon as ready on output bus without requiring "permission" from a go_ld signal.

* for a LD, LDSTCompUnit will wait for EITHER addr exception OR a ld_data_ok.

* on any address-related exception, addr_exc is asserted HI for a single-cycle signal: both LDSTCompUnit and L0 Cache/Buffer go back to reset.

* ld_data_ok MUST NOT be asserted if addr_exc was asserted

* ld_data_ok must remain asserted for 1 clock and ld_data must be valid on that same clock.  ld_data latching is the caller's responsibility

* L0 Cache/Buffer goes back to reset once ld_data_ok goes low again.

* ST on the other hand must have the ADDR checking (and L1 cache line loading) done separate from the write.

* ST therefore has two phases: addr ok (or not) followed by waiting for permission to carry out the actual store.

* the ST data in *will* be valid at the same clock where go_st is raised.

* go_st will *not* be raised by the LDSTCompUnit until *after* addr_ok/exc

* again: no exceptions can occur after the addr_ok signal is raised.

* go_st will remain asserted for one cycle. st_data *MUST* be accepted (and stored)

* L0 Cache/Buffer must return to reset after go_st goes back to low.
Comment 27 Luke Kenneth Casson Leighton 2020-04-26 12:28:04 BST
i may need to implement a very basic (working) of the API in
http://bugs.libre-riscv.org/show_bug.cgi?id=216#c26
effectively a thin "wrapper" around nmigen Memory rdport and wrport.

it's not quite a perfect match for nmigen rdport/wrport as the
address phase needs latching in, separately from how r/w data is handled.

i'll make sure that the (decoded) opcode is passed over because it
contains crucial information (not just ld/st-length).
Comment 28 Luke Kenneth Casson Leighton 2020-04-27 12:56:51 BST
i've added the "PortInterface" which connects between the LDST Computational
Unit and the L0 Cache/Buffer.

https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=80f66556795c80603cf7b33140adc27e01e37155

this is the "interface" (API) which is implemented in comment #26 and
i put here as well:
https://libre-soc.org/3d_gpu/architecture/memory_and_cache/

i've also added a "subset" of the POWER decode flags that come directly
from decoding the instruction, only passing those that are relevant to
actual LD/ST operations:

https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=66381f20d789aaf74511c2d6a892093940f34ce5

if there's anything missing from that it means that first the actual
instruction needs to be added to the POWER decoder *first*.
Comment 29 Luke Kenneth Casson Leighton 2020-04-29 22:42:57 BST
jacob on looking at minerva they did something which i think would be really useful to copy.

they created "cacheless" memory classes and "cached" versions conforming to the same interface.

this allows an incremental approach where testing of one component is not critically dependent on another that hasn't been written yet.
Comment 30 Luke Kenneth Casson Leighton 2020-05-01 14:27:55 BST
https://www.youtube.com/watch?v=6Yiyw4abJpE

a walkthrough video which explains the components (the ones that
already exist) and explains what's needed.  crucially, the existing
components provide certain guarantees that, by the time it comes
to what the L0 Cache/Buffer has to do, is a ridiculously simple
job.

i forgot to mention: after the PriorityPicker, the "selection" of the
units which are "above" that first-picked row, we actually already
have some code which does that: michael wrote it, for the LE/GE/EQ
"PartitionedSignal" comparators.

the class is called "RippleLSB":
https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part_cmp/ripple.py;h=032512047f2e58c3ce50bc869e4a064738647943;hb=cec6f51e8b2a1ba2c2f83965be7219885ea163f0#l13

this can be moved to nmutil.
Comment 31 Luke Kenneth Casson Leighton 2020-05-01 14:33:18 BST
(In reply to Luke Kenneth Casson Leighton from comment #30)

> the class is called "RippleLSB":
> https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part_cmp/
> ripple.py;h=032512047f2e58c3ce50bc869e4a064738647943;
> hb=cec6f51e8b2a1ba2c2f83965be7219885ea163f0#l13
> 
> this can be moved to nmutil.

done

https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part_cmp/ripple.py;hb=HEAD
Comment 32 Luke Kenneth Casson Leighton 2020-05-04 11:21:30 BST
jacob i've added documentation of the requirements of PortInterface:
https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=6404cf302734700514e7553bdf6c5a5df5badeb8

i'm doing a "dummy" one at the moment which will implement this API,
only allowing the one (and only one) Memory read/write per cycle.
this makes it clear exactly where the "merging" needs to take place.

putting the (twin/quad) L1 Caches in place once the L0CacheBuffer
has a single-cycle 128-bit-wide interface should be dead straightforward.
Comment 33 Luke Kenneth Casson Leighton 2020-05-04 21:11:22 BST
okaaay, it's messy but functional, and, importantly, illustrates the API, jacob.

i've not put in actual LD/ST width yet, so the tests "pass" based on storing
a 16-bit value into a hard-coded-16-bit-wide "Memory", the important thing
is the signalling and API conformance, illustrating the timing that's
required (and expected) by LDSTCompUnit.

at least this allows me to move on to the (new, ridiculously-redrawn)
LDSTCompUnit.

we'll need to coordinate when it comes to adding twin ports (for misaligned)
however it may just be ok there to assume the misaligned port is never
used by LDSTCompUnit as the 2nd port is being added to L0CacheBuffer, get
the unit tests working for L0CacheBuffer, then add it to LDSTCompUnit, and
*then* add unit tests to LDSTCompUnit that try using it.
Comment 34 Jacob Lifshay 2020-05-06 02:09:52 BST
What do you think of the idea to write a single-op-at-a-time memory interface with all the caches & stuff to work with the rest of the SoC?

it will be much simpler to get right and not take as much time, then, after that works, we can work on the much more complex multi-op version. That way, we will have something that works even if the multi-op version doesn't end up working in time for the october tape out.
Comment 35 Jacob Lifshay 2020-05-06 02:11:51 BST
The proposed single-op-at-a-time interface would work through PortInterface, since you already wrote a lot of the code needed to work with it.
Comment 36 Luke Kenneth Casson Leighton 2020-05-06 09:29:36 BST
(In reply to Jacob Lifshay from comment #35)
> The proposed single-op-at-a-time interface would work through PortInterface,
> since you already wrote a lot of the code needed to work with it.

the code i wrote is sufficient to get me to the point where i can test LDSTCompUnit.

it already does exactly what you propose.

as a concept, it is "technically correct" and would actually "not corrupt memory"

however it will have performance that is, ultimately, a waste of effort.

adding dual 128 bit memory interfaces would be 10% utilised, because they would receive only one 8-64 bit request.

we might as well just use minerva's L1 code directly, in full, only one interface 
 with the only change being to expand it to 64 bit from 32.

in addition, misaligned requests would not 
be handlee at the hardware level, leaving us non-compliant with the POWER spec.

bottom line is that the current code is for testing purposes and prototyping.  its usefulness in particular is as a "mock test" i.e. to ensure that a single LDSTCompUnit "works" and that is why i wrote it, *not* to actually put it into a production chip.

it can in no way be used to *properly* test the parallel capability of the engine because it will only do one request at a time.

if we want to be taken seriously and want the performance of a GPU we need the twin buffer implemented in the next week.

i will have the LDSTCompUnit done by then and we will need to begin testing of parallel LD/STs.
Comment 37 Luke Kenneth Casson Leighton 2020-05-06 09:42:34 BST
(In reply to Jacob Lifshay from comment #34)
> What do you think of the idea to write a single-op-at-a-time memory
> interface with all the caches & stuff to work with the rest of the SoC?

as intermediary steps this is a good idea.   the code i wrote yesterday already does exactly that (and is why i did it. it "works". performance sucks.)

however stopping at a level that does not meet the requirements of keeping those twin 128 bit L1 caches fully occupied on every cycle is not.


 
> it will be much simpler to get right and not take as much time, then, after
> that works, we can work on the much more complex multi-op version. That way,
> we will have something that works even if the multi-op version doesn't end
> up working in time for the october tape out.

yes.

although it will suck.

strictly speaking just as in minerva if you look at the code there exists "cache bypassed" LoadStore code.

by using that code directly we also have the option to switch off the L1 cache, making unit testing much easier.

and to emphasise, again (i have mentioned this about four times but have not received a response), the minerva L1 code and associates LDST compiletime reconfigureable Interface does *not* need duplicating.  it is good enough as-is and can simply be expanded to 64 bit then quqdrupled up, or expanded to 128 bit and doubled up, and 128 to 64 bit Wishbone Arbiters dropped on the front.

with only needing to connect to two well defined interfaces (multiple PortInterfaces on one side and multiple minerva LDSTCache interfaces on the other) this task is much simpler and straightforward than it seems.
Comment 38 Luke Kenneth Casson Leighton 2020-05-06 13:39:24 BST
https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=65a2218914d3f9a82d161c4cb946e5332d49b666

i've added a TODO class "DualPortSplitter" - actually it should
probably kinda be that LDSTSplitter should be adapted to conform
to PortInterface.

can you do that task, jacob? it's a simple self-contained task that's
an important step - no need to examine any of the other code.
Comment 39 Luke Kenneth Casson Leighton 2020-05-23 16:17:16 BST
see 
http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2020-May/007196.html


a new class in soc/experiment/l0_cache.py is needed which
implements that core algorithm: the one that uses a PriorityPicker to
select one of the "active" LD/STs, then use that to OR-select an
output.

please assume the following inputs:

* an array of 8 Signals of length 8 bits which tell you which
addresses have matched against which other addresses.
  ( we call this the "address-match" array)
* an array of 8 Records containing {16 bytes data, 16 "enable" lines}

and the following output:

* a Record containing {16 bytes data, 16 "enable" lines}

in the address-match array, you want to find the first non-zero line.
once you have that index, use it to OR the 16 bytes and 16 "enables"
from the 8 records, into the output.
Comment 40 Luke Kenneth Casson Leighton 2020-05-23 16:23:16 BST
Tobias: i've added a stub class "DataMerger".  please can you focus on implementing
that.  it should be approximately... 30 lines of code, maximum.  start by committing
a class with the Record Layout to be used by the input (Nx) and output (1x)
Comment 41 Luke Kenneth Casson Leighton 2020-05-24 20:29:45 BST
On Sun, May 24, 2020 at 8:23 PM Tobias Platen <libre-soc@platen-software.de> wrote:
>
> this evening:
>
> I began working on the elaboratable function of the DataMerger class, which I still haven't commited yet.
> If all bits in addr_array_i are zero the output will be zero too as nothing needs to be merged.

that would be correct.  it would indicate that the LD/ST unit is idle.


> I' now asking myself weather the interface is correct, and how the DataMerger is used.

it is used as in the diagram in the video.  it is also the piece marked
"Priority Picker" in this diagram:

https://libre-soc.org/3d_gpu/180nm_single_core_testasic_memlayout.jpg
Comment 42 Luke Kenneth Casson Leighton 2020-05-24 20:39:25 BST
(In reply to Luke Kenneth Casson Leighton from comment #41)

> > I began working on the elaboratable function of the DataMerger class, which I still haven't commited yet.
> > If all bits in addr_array_i are zero the output will be zero too as nothing needs to be merged.
> 
> that would be correct.  it would indicate that the LD/ST unit is idle.

hmmm... the 2D NxN addr_array_i indicates address matches.  if we assume that in
the array that the diagonal (X=Y i.e. addr[n,n]) will always be set for any given
LD/ST then there will always be one active LD/ST.

this is kind-of "the address matches against itself".

thus, if there is only one LD/ST, and it is from unit 5, then addr_array_i[5,5]
will be TRUE

thus, yes, we can assume that when all bits in addr_array_i are zero, the LD/ST
unit is idle.
Comment 43 Luke Kenneth Casson Leighton 2020-05-25 11:36:10 BST
Created attachment 63 [details]
over-run 80 chars

tobias, hi, code-review:

* please keep to under 80 chars.  the best way to do this is to make sure
  that the editor window is only 80 chars wide.

* please do not make multi-purpose commits.  the commit contains about a
  hundred lines of undocumented whitespace changes... *AND* contains
  code changes

* range(0, array_size) does not need the 0,.  just range(array_size)

* the 2D array can actually be done as a 1D array of Signals of length N

        for i in range(0, array_size):
            ul2 = []
            for j in range(0, array_size):
                ul2.append(Signal())

   can just be:

        for i in range(array_size):
            ul2.append(Signal())

* Signals all need to be reset_less=True (with very few exceptions)

* Signals in loops do not end up being named properly ($1, $2) which
  gets really annoying.

  add a parameter name="addr_match_%d" % j

* the traditional industry-standard short-name in hardware for "byte enable"
  is just "en".

        layout = (('data', 128),
                  ('en', 16)
                  )
Comment 44 Luke Kenneth Casson Leighton 2020-05-25 14:55:49 BST
excellent, tobias.  i added some docstrings, please do git pull and carry on.
i just had to redo the docstring i just wrote after noticing (only now) several
days of using a MACos terminal at 86x30 not 80x30.  sigh :)

anyway: the docstring should now show clearly what's needed.

commit db13193e20441c6988e7a1b3969979461136eca3
Author: Tobias Platen <tplaten@posteo.de>
Date:   Mon May 25 15:41:49 2020 +0200

    refactoring (see #216 Comment 43)
Comment 45 Luke Kenneth Casson Leighton 2020-05-26 13:50:12 BST
tobias, i just noticed in soc.regfile.regfile.py, this code:

    def elaborate(self, platform):
        m = Module()
        for (regs, p) in self._rdports:
            #print (p)
            ror = treereduce(list(regs))
            m.d.comb += p.data_o.eq(ror)

that's pretty much exactly what is needed for DataMerger.  the ORing of
all of the data together (from only those with a "match" in the selected row,
though), is "it".

however bear in mind, both the enable *and* data signals need to be merged
out.  you may need to do them separately: one treereduce for the en and another
for the data, both "optionally" controlled by the match on the selected row.
Comment 46 Luke Kenneth Casson Leighton 2020-05-27 12:10:54 BST
tobias, the DataMerger class looks good as a first cut: it looks like it
does the job.

couple of things:

----

+            for j in range(self.addr):

this should be self.array_size



----

+                with m.If(self.addr_match_i[j]>0):

instead of > 0, bool() can (should) be used:

                with m.If(self.addr_match_i[j].bool()):


-----

+                with m.If(self.addr_match_i[idx][j] && valid):

ah.  in nmigen, you can't use boolean operators (&& or ||) because
there is no operator-overload for "bool".  a PEP was actually raised
about this and, rather short-sightedly, turned down.

you therefore have to use the bit-wise operators, and make sure
that the things that you are comparing are only 1 bit wide.

                with m.If(self.addr_match_i[idx][j] & valid):

------

+            self.data_o.eq(0)

this line is not needed.  combinatorial circuits default to zero unless
you set some of the Signal initialisation arguments.



-----

+                    self.data_o.eq(self.data_i[j]|self.data_o)

this isn't python programming, unfortunately: you can't treat Signals in
a loop as "sequentially-accumulating variables".

this is why i said to use treereduce.  or, actually, now you can import
ortreereduce, from soc.regfile.regfile.

so do this instead:

            l = []
            for j in range(self.array_size):
                select = self.addr_match_i[idx][j] & valid)
                l.append(Mux(select, self.data_i[j], 0)
            self.data_o.eq(ortreereduce(l)

see how that works? because hardware is effectively "statically-assigned
variables", you can't really do dynamic-accumulation.  or... you can, but
it will create a massive chain of gates and the completion time will be
awful.

:)

so, use a Mux-in-a-loop, and tree-reduce them.  strictly speaking we
could hypothetically use the 2nd parameter of nmutil.utils.treereduce
to actually do the Mux there, reduce things down to maybe one or two
lines of code, however that's getting obtuse and "too clever" :)
too clever means the code becomes unreadable.
Comment 47 Luke Kenneth Casson Leighton 2020-05-28 13:35:44 BST
hi tobias, saw the latest commit, getting there.

------
            comb, sync = m.d.comb, m.d.sync

sync is not needed


-------
            #(1) pick a row
            m.submodules.pick = pick = PriorityEncoder(self.array_size)
            for j in range(self.array_size):
                with m.If(self.addr_match_i[j].bool()):
                    pick.i.eq(pick.i||(1<<j))

again: as previous comment, this is not python (and the comb += is missing
at the front, which means that the nmigen AST is discarded, doing nothing)

everything is parallel hardware, not sequential python.  the nmigen AST
captures that parallel hardware definition.

so, here, we want each bit of the PriorityPicker input to connect to one
bit of the test of whether a line in addr_match_i is empty/non-empty.
therefore, we need to assign addr_match_i[i].bool() to each bit of pick.i:

            # assign each priority-picker input bit the test whether the
            # address match line is non-empty.
            for j in range(self.array_size):
                comb += pick.i[j].eq(self.addr_match_i[j].bool())

simple as that.



-------
            self.data_o.eq(ortreereduce(l))


import is missing.


-----

fix those and it'll be time to move on to a unit test.
Comment 48 Luke Kenneth Casson Leighton 2020-05-28 13:38:44 BST
----------

hmmm...

            l = []
            for j in range(self.array_size):
                select = self.addr_match_i[idx][j] & valid
                l.append(Mux(select, self.data_i[j], 0))
            self.data_o.eq(ortreereduce(l))

this would be better / clearer:

            with m.If(valid):
                l = []
                for j in range(self.array_size):
                    select = self.addr_match_i[idx][j]
                    l.append(Mux(select, self.data_i[j], 0))
                self.data_o.eq(ortreereduce(l))


-----

return m is missing.
Comment 49 Tobias Platen 2020-05-28 14:07:30 BST
I've added a unittest for DataMerger and I get the following import error:
cannot import name 'treereduce' from 'nmutil.util'
I git pulled the nmutil repo but there were no changes.
Comment 50 Luke Kenneth Casson Leighton 2020-05-28 14:26:22 BST
(In reply to Tobias Platen from comment #49)
> I've added a unittest for DataMerger and I get the following import error:
> cannot import name 'treereduce' from 'nmutil.util'
> I git pulled the nmutil repo but there were no changes.

see earlier comment.

"this is why i said to use treereduce.  or, actually, now you can import
ortreereduce, from soc.regfile.regfile."
Comment 51 Luke Kenneth Casson Leighton 2020-05-28 21:34:10 BST
(In reply to Luke Kenneth Casson Leighton from comment #50)

> "this is why i said to use treereduce.  or, actually, now you can import
> ortreereduce, from soc.regfile.regfile."

saw commit d42161569bfaad5f0703f67d41634fd460719c1b

excellent.  (btw remember to do whitespace and actual code-changes separately)

(In reply to Luke Kenneth Casson Leighton from comment #48)

> -----
> 
> return m is missing.

return m is still missing.  this will prevent the module from working,
because the AST - which was constructed by the elaborate function - is
entirely blank (None), due to the return result being implicitly None.
Comment 52 Luke Kenneth Casson Leighton 2020-05-29 17:17:37 BST
tobias: the DataMerger class is coming along well.  now that the unit
test passes (doesn't break anything) feel free to enable it by default,
and then use it to confirm, incrementally, from that point on, that
what you commit will now at least "not break anything" :)

btw do give each DataMergerRecord() a unique name.  this so that when
people inspect the graphviz, each DataMergerRecord will have a prefix
{name}_data etc. by which they can be identified correctly with the
python source code.

next bit: unit test.  after that: formal proof.
Comment 53 Luke Kenneth Casson Leighton 2020-05-30 20:40:17 BST
commit 23b682daa011ace5cfbfbe2b60ece7369cbe403c
Author: Tobias Platen <tplaten@posteo.de>
Date:   Sat May 30 21:11:02 2020 +0200

    unit test for DataMerger

looks good, tobias.  i added some use of "Settle()" which should be
used (as we just learned thanks to Cesar highlighting it) after any
combinatorial logic changes, rather than doing a full one-cycle "yield"
Comment 54 Tobias Platen 2020-06-02 19:26:16 BST
Before running formal verification, I first had to upgrade nmigen, because my code works using records. When using an old version of nmigen, I got the following error. 
> Object (rec <unnamed> data en) cannot be used as a key in value collections

The second one is a bug in nmigen when it invokes sby with an empty working directory, which is the current directory in this case. I set it to "." with the following lines added to the code.

+        if(len(spec_dir)==0):
+            spec_dir="."
Comment 55 Luke Kenneth Casson Leighton 2020-06-04 17:49:19 BST
proof_datamerger.py:

                for b in range(0,8):

just range(8).  there is no need to specify a start range of 0

            comb += dut.addr_array_i[j].eq(AnySeq(dut.array_size))
            comb += dut.data_i[j].eq(AnySeq(16+128))

AnySeq is for synchronous data where the value changes on every clock.
use AnyConst instead which will not change.
Comment 56 Tobias Platen 2020-06-05 20:17:04 BST
fixed
Comment 57 Luke Kenneth Casson Leighton 2020-06-05 20:40:37 BST
(In reply to Tobias Platen from comment #56)
> fixed

fantastic.  you'll see i added some comments in the DualPortInterface.  i think
what might need to be done is:

* to have the the incoming PortInterface.oper_i.data_len
  be one unary bit only (0b1, 0b10, 0b100, 0b1000)

* however for the outgoing ones, due to the "misalignment", they
  can be set to 0b11, 0b111, 0b101, and so on.

actually this is ok! :)  it is a 4-bit field after all.

so basically:

* one LD/ST comes in
* it will be routed to EITHER left (bit 4 ==0) OR right (bit 4==1)
* misalignment possibility results in a *SPLIT* LD/ST that **ALWAYS**
  goes out the **OTHER** port.
* if both ports are to be used, the data from both has to be
  recombined.

now, there are a few things that are different from LDSTSplitter:

1) LDSTSplitter rather hilariously treats the data as 16-bit, not 16-byte.
   this is easily fixable by using Repl on the mask and multiplying by 8
   and so on

2) LDSTSplitter does *NOT* do the correct bit-4 routing.  it is HARDCODED
   to ALWAYS assign non-misaligned data to port 0

            with m.If(mask2 == mzero):
                comb += self.valid_o.eq(self.sld_valid_o[0]) <-- wrong

   fixing that is slightly complicated in that ld1 and ld2 need to dynamically
   swap over.

3) LDSTSplitter although it creates the mask correctly (using LenExpand)
   does not "re-create" the *binary* version of that, which is the key
   information that needs to go back into PortInterface.oper_i.data_len
   along with the "altered" addresses.
Comment 58 Luke Kenneth Casson Leighton 2020-06-08 00:42:53 BST
worked out that this:
https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/minerva/wishbone.py;hb=HEAD

is exactly what we need to do the arbitration of odd/even dual ports.
actually there could be four 64-bit wishbone buses, each seeking
arbitration onto the one 64-bit memory bus:

* LSB 64-bit half of the 128-bit ODD L0CacheBuffer Port interface
* MSB 64-bit half of the 128-bit ODD L0CacheBuffer Port interface
* LSB 64-bit half of the 128-bit EVEN L0CacheBuffer Port interface
* MSB 64-bit half of the 128-bit EVEN L0CacheBuffer Port interface

the detection as to whether any given one of these should be enabled
is simply done by ORing the relevent sections of the 16-bit byte-enable
column line (wen / ren).  this comes directly from the "en" lines
from DataMerger.

* half (8) of the Datamerger byte-enable lines will go to enabling
  one of the 64-bit Wishbone interfaces
* the other half (8) will go to the other one.
Comment 59 Luke Kenneth Casson Leighton 2020-06-22 12:01:39 BST
i've moved the responsibility for byte-reversal from L0CacheBuffer into LDSTCompUnit.
this means that PortInterface can be simplified (no need to have the full
CompLDSTOpSubset) and it begins to look much more like minerva LoadStoreInterface
Comment 60 Luke Kenneth Casson Leighton 2020-06-22 13:13:35 BST
(In reply to Luke Kenneth Casson Leighton from comment #59)

> this means that PortInterface can be simplified (no need to have the full
> CompLDSTOpSubset) and it begins to look much more like minerva
> LoadStoreInterface

[master bafa9f0] remove CompLDSTOpSubset, replace with just data_len.

done.  quite straightforward (very quick) as there is not much information
needed to be passed over.

whilst LoadStoreInterface has x_mask (byte-enable lines) and expects data
to be wishbone-bus-aligned, i'm recommending that we keep PortInterface
binary-encoded and process it on-demand through LenExpand to create the
x_mask (byte-enable lines).

the reason is that it's a lot more extra wires to pass round a 16-bit expanded
bit-mask and we have a laarge number of PortInterfaces to route.