Bug 215 - evaluate minerva for base in libre-soc
Summary: evaluate minerva for base in libre-soc
Status: RESOLVED FIXED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Source Code (show other bugs)
Version: unspecified
Hardware: PC Linux
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL:
Depends on:
Blocks:
 
Reported: 2020-03-11 14:31 GMT by Luke Kenneth Casson Leighton
Modified: 2020-06-30 19:46 BST (History)
2 users (show)

See Also:
NLnet milestone: ---
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2020-03-11 14:31:19 GMT
https://github.com/lambdaconcept/minerva

looks really good: clean design, uses wishbone to the L1 caches and
to the main core.  this will help when it comes to adding SMP down
the line.  the decoder is also very clean.

the only non-obvious bit is how the core works, with source/sink
being created in class "Stage" (which is fine), and with the
inter-transfer layouts being the same (source on previous stage
equals sink on the next stage), that's obvious: the bit that's
not obvious is what-gets-connected-to-what.

i think this is because "sinks" are set up at the start of core.py
whilst "sources" are set up much further down.  in the libre-soc
pipeline code, the layouts are done via objects, and the modules
"take care" of placing data into the "output" inherently.  here,
it's messy, and the separation makes understanding difficult.

other than that, though, it's pretty damn good.
Comment 1 Jacob Lifshay 2020-03-11 14:36:09 GMT
will need adjusting to make the datapath between the core and L1 wider -- 64-bit at the very least, 128-bit or wider preferred.
Comment 2 Luke Kenneth Casson Leighton 2020-03-11 14:50:38 GMT
(In reply to Jacob Lifshay from comment #1)
> will need adjusting to make the datapath between the core and L1 wider --
> 64-bit at the very least, 128-bit or wider preferred.

yes.  four LD/STs @ 32-bit is the minimum viable data width to the L1 cache,
realistically.  preferably four LD/STs @ 64 bit.

this is a monster we're designing!

address widths also need to be updated: i'm going to suggest parameterising
them because we might not have time to do an MMU (compliant with the
POWER ISA), just have to see how it goes.
Comment 3 Jacob Lifshay 2020-03-11 15:06:54 GMT
do note that compressed texture decoding needs to be able to load 128-bit wide values (a single compressed texture block), so our scheduling circuitry should be designed to support that. They should always be aligned, so we won't need to worry about that in the realignment network.
Comment 4 Luke Kenneth Casson Leighton 2020-03-11 15:37:46 GMT
(In reply to Jacob Lifshay from comment #3)
> do note that compressed texture decoding needs to be able to load 128-bit
> wide values (a single compressed texture block), 

okaaay.

> so our scheduling circuitry
> should be designed to support that. They should always be aligned, so we
> won't need to worry about that in the realignment network.

whew.

so that's 128-bit-wide for _textures_... that's on the *load* side.  are there any simultaneous (overlapping) "store" requirements? are the code-loops tight enough to require simultaneous 128-bit LD *and* 128-bit ST?
Comment 5 Jacob Lifshay 2020-03-11 19:25:51 GMT
(In reply to Luke Kenneth Casson Leighton from comment #4)
> (In reply to Jacob Lifshay from comment #3)
> > do note that compressed texture decoding needs to be able to load 128-bit
> > wide values (a single compressed texture block), 
> 
> okaaay.
> 
> > so our scheduling circuitry
> > should be designed to support that. They should always be aligned, so we
> > won't need to worry about that in the realignment network.
> 
> whew.
> 
> so that's 128-bit-wide for _textures_... that's on the *load* side.  are
> there any simultaneous (overlapping) "store" requirements? are the
> code-loops tight enough to require simultaneous 128-bit LD *and* 128-bit ST?

yes and no -- there is code that will benefit from simultaneous loads and stores (memcpy and probably most other code that has both loads and stores in a loop), however it isn't strictly necessary.

It will be highly beneficial to support multiple simultaneous 8, 16, 32, or 64-bit loads to a single cache line all being able to complete simultaneously independently of alignment in that cache line. Also misaligned loads that cross cache lines (and possibly page boundaries), though those don't need to complete in a single cache access.

All the above also applies to stores, though they can be a little slower since they are less common.

I realize that that will require a really big realignment network, however the performance advantages I think are worth it.

For a scheduling algorithm for loads that are ready to run (6600-style scheduler sent to load/store unit for execution, no conflicting stores in-front, no memory fences in-front), we can have a queue of memory ops and each cycle we pick the load at the head of the queue and then search from the head to tail for additional loads that target the same cache line stopping at the first memory fence, conflicting store, etc. Once those loads are selected, they are removed from the queue (probably by marking them as removed) and sent thru the execution pipeline.

We can use a similar algorithm for stores.

To find the required loads, we can use a network based on recursively summarizing chunks of the queue entries' per-cycle ready state, then reversing direction from the summary back to the queue entries to tell the entries which, if any, execution port they will be running on this cycle. There is then a mux for each execution port in the load pipeline to move the required info from the queue to the pipeline. The network design is based on the carry lookahead network for a carry lookahead adder, which allows taking O(N*log(N)) space and O(log(N)) gate latency.

Loads/Stores that cross a cache boundary can be split into 2 load/store ops when sent to the queue and loads reunited when they both complete. They should be relatively rare, so we can probably support reuniting only 1 op per cycle.

RMW Atomic ops and fences can be put in both load and store queues where they are executed once they reach the head of both queues.