Bug 257 - Implement demo Load/Store queueing algorithm
Summary: Implement demo Load/Store queueing algorithm
Status: DEFERRED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Source Code (show other bugs)
Version: unspecified
Hardware: All All
: --- enhancement
Assignee: Jacob Lifshay
URL:
Depends on:
Blocks: 216
  Show dependency treegraph
 
Reported: 2020-03-13 18:40 GMT by Jacob Lifshay
Modified: 2023-09-12 02:34 BST (History)
2 users (show)

See Also:
NLnet milestone: NLnet.2019.02.012
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jacob Lifshay 2020-03-13 18:40:38 GMT
see http://bugs.libre-riscv.org/show_bug.cgi?id=216#c7
Comment 1 Jacob Lifshay 2020-03-18 05:23:18 GMT
Luke, it looks like you probably have something that works just as well as this would, so I'm resolving the bug.
Comment 2 Luke Kenneth Casson Leighton 2020-03-18 06:59:02 GMT
(In reply to Jacob Lifshay from comment #1)
> Luke, it looks like you probably have something that works just as well as
> this would, so I'm resolving the bug.

maayybee... i don't know.  it's taking me considerable time to piece together.
Comment 3 Luke Kenneth Casson Leighton 2020-03-20 11:03:18 GMT
(In reply to Luke Kenneth Casson Leighton from comment #2)
> (In reply to Jacob Lifshay from comment #1)
> > Luke, it looks like you probably have something that works just as well as
> > this would, so I'm resolving the bug.
> 
> maayybee... i don't know.  it's taking me considerable time to piece
> together.

nope, struggling.  and the N-N-way comparisons for the high MSBs is going to kill power usage.  took me a while to realise what you were saying about using
a LRU (we have a PLRU implementation here):
https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/TLB/ariane/plru.py;hb=HEAD

can you take a look?  assume that each LD/ST buffer has *two* "ports",
one that is 16 bytes wide with address port0_addr and is normally used,
the other is *8* bytes wide (actually, 7), is *always* at port0_addr+16,
and deals with misalignments.

actually, that's a red herring.  to clarify: it's safe to assume that
misalignments have  been "dealt with" by being pre-split into separate
"aligned" ports.

also assume that addresses+len are expressed as:

* top bits numbered 4 and up as a binary address
* LSBs 0 to 3 as a 16-bit unary "mask" which *also* encapsulates data length

(with no misalignments, which means that some mask bit groups will be
non-power-2 long)

also assume that it is *guaranteed* - by this class - that there will be
ZERO overlaps.  as in, absolutely guaranteed.

https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/scoreboard/addr_match.py;h=0312504eb818c5e77d8a85d3948f733663b34b2e;hb=43415447fa3601f27b7994076823cea4cc3323a5#l143

it's the top MSBs of the address that are not covered by that (because it's
too much).

i figured, rather than a ring buffer, a PriorityPicker could identify QTY 1
address to put through, then all other address MSBs go through... the PLRU?

i get lost at this point.
Comment 4 Jacob Lifshay 2020-03-20 13:06:15 GMT
ok, reopening
Comment 5 Luke Kenneth Casson Leighton 2020-03-20 18:34:48 GMT
(In reply to Jacob Lifshay from comment #4)
> ok, reopening

thanks.  whew.  oh.  one other thing: if it can be N-way-ported (assume that the L1 cache will at some point be dual-ported, minimum) that would be handy.  if there are several LDs, on 16-byte boundaries for example, we don't want the L1 cache to be the bottleneck just because it's only possible to read/write one cache line at a time.

i'm keeeenly aware that this implies we would be need a maaassive 256-bit-wide data bus to the L1 cache.

oink.

welcome to GPUs :)
Comment 6 Jacob Lifshay 2020-03-20 19:07:41 GMT
Was assuming that the L1 cache has a single cache-line-sized R or W port, though having 2 ports makes it even better.

Remember that we probably want another cache-line-sized port for communication with the L2 and the rest of the SoC.
Comment 7 Luke Kenneth Casson Leighton 2020-03-20 19:53:41 GMT
(In reply to Jacob Lifshay from comment #6)
> Was assuming that the L1 cache has a single cache-line-sized R or W port,
> though having 2 ports makes it even better.

here's the code i'd like to drop-in-place (literally), i.e. just use it.

https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/minerva/units/loadstore.py;hb=HEAD

that's already set up for "masking" - bear in mind it's only 32-bit-wide
so the x_mask signal is only 4 bits: we need to expand that to 16-bits
and it will fit perfectly with the 16-bit "mask" from the PartialAddressBitmap
class.

so the other end of the interface, jacob, is the LoadStoreUnitInterface.  a test can be written which bypasses the cache, for now, to make life a bit simpler.



> Remember that we probably want another cache-line-sized port for
> communication with the L2 and the rest of the SoC.

one step at a time :)

wait... ahh... everything memory-wise goes through the L1 cache. the cores are "slaves" of the wishbone bus (not masters).  all communication is done by way of memory-addressing (all peripherals are memory-addressable).

even "management" of the L2 cache (and L1 "flush" commands) etc. should ideally be done by way of memory-addressable "registers".

or, we have a special side-channel wishbone bus that communicates "messages".  basically drops a pre-determined data structure (a nmigen Record) onto the side-channel WB bus at a known "address", it gets routed to where is needed, and the recipient decodes it as "commands".

but as far as communication with periherals is concerned, they're *all* memory-addressable.  all of them.  any "setup" is done by way of "Registers"... which are also simply memory-addressable... over the WB memory bus.

some of those, because they are I/O, cacheing is not permitted.  so we have to be on the lookout for that.
Comment 8 Luke Kenneth Casson Leighton 2020-03-20 20:21:40 GMT
https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/minerva/cache.py;hb=HEAD

they set the cache up to be a maximum of 2-way.  however that's not the same as the number of read/write ports.  arg.

ok i found a diagram which explains it:
https://libre-riscv.org/3d_gpu/architecture/set_associative_cache.jpg

it shouuuld be relatively straightforward to augment that to be dual-ported,
although let's not do that straight away.
Comment 9 Luke Kenneth Casson Leighton 2020-03-21 00:48:09 GMT
hang on a minute...

if we use a very small but wide L1 cache, with the following characteristics, does it do the job?
* 8 sets
* 16 way set associative
* byte per way

the 16 bit mask *is* the 16 way selector

then i think, if you make a N-in/out multiplexor to write to each byte, the job's done.

errr...
Comment 10 Jacob Lifshay 2020-03-21 01:56:02 GMT
I've been thinking that we would have a cache with a cache line size of 64 bytes (or 32 if we really have to).

we want the cache line size to be the same for all our caches since it greatly simplifies cache to cache transfers.

So, for a 32kB L1 cache, it would have 512 lines. If we had it be 8-way set associative, then there would be 64 sets and each set would have 8 lines.

Having a L1 cache much smaller than 16kB would be quite detrimental due to the excessive cache latency and lack of bandwidth to the L2.

I'm going to be unavailable till at least Sat night, then will work on publishing v0.2 of algebraics (updating transitive dependencies for simple-soft-float due to PyO3 v0.9 being released) then work on this.
Comment 11 Luke Kenneth Casson Leighton 2020-03-21 04:12:53 GMT
(In reply to Jacob Lifshay from comment #10)
> I've been thinking that we would have a cache with a cache line size of 64
> bytes (or 32 if we really have to).

eek.  ok.  right.  64 / 4 = 16 that gives 4 stripes to 1R1W Memory inside the cache. that way we do not need multi ported SRAM in the cache.

if we have byte enable lines the 16 bit mask can be passed to each byte.

> we want the cache line size to be the same for all our caches since it
> greatly simplifies cache to cache transfers.

ok

> So, for a 32kB L1 cache, it would have 512 lines.

ye gods.

> If we had it be 8-way set
> associative, then there would be 64 sets and each set would have 8 lines.

4 way makes more sense (a little easier) as the byte enable mask is 16 bits.

however 8 way does as well because each 16 byte enable mask is split in two.

> Having a L1 cache much smaller than 16kB would be quite detrimental due to
> the excessive cache latency and lack of bandwidth to the L2.

yes.

> I'm going to be unavailable till at least Sat night, then will work on
> publishing v0.2 of algebraics (updating transitive dependencies for
> simple-soft-float due to PyO3 v0.9 being released) then work on this.

ok.

the earlier idea of a tiny pre L1 cache may on reflection be unnecessary.

i drew out a diagram a couple days ago, it basically involved:

* a PriorityPicker to find the first entry in the AddressMatrix that is "live"

* that then says which entry is to be used to broadcast the high address bits (14 to 48) to all other live LD/STs

* bear in mind all "live" addresses have had to calculate a comparison on bits 4 thru 13 of the address, we also include *these* (ANDed) in the comparison

* anything that succeeds on matching bits 4 to 13 from the AddressMatcher matrix *and* gives a hit on the Priority Picked bits 14 to 48, its 16bit "mask" is permitted to go through an ORer to the L1 cache byte-enable on the Memory.

a refinement of that is to allow up to *four* of those to be picked, one to be routed to each "way".

that is one of the things that i wrote the MultiPriorityPicker for.
Comment 12 Luke Kenneth Casson Leighton 2020-03-21 15:28:01 GMT
hmmm connected to LDSTCompAlu we need something that, after splitting the address and len into two, can "reconstruct" the LD/ST

i'm also going to suggest that we have a preanalysis phase on LD/STs for VL and make it SIMD-like, up to 64 bit.

so if a VL=4 LD comes in on a halfword operation it is turned into a 64 bit request.

the only thing to watch out for there is mem faults.  the last 16 bits failing should *not* cause the other 3 16 bit LDs in that "batch" to fail.

we can incrementally add this as an optimisation, later
Comment 13 Luke Kenneth Casson Leighton 2020-03-21 15:36:52 GMT
(In reply to Luke Kenneth Casson Leighton from comment #12)
> hmmm connected to LDSTCompAlu we need something that, after splitting the
> address and len into two, can "reconstruct" the LD/ST

this basically adds 16 to the real address and masks off the top bytes that overrun a 16 byte cache line.

there needs to be two latches which are in "passthrough" mode (in case the twin data is available immediately) and if one is not available, wait until it is before responding "ok".

also needs to pass through exceptions back to the LDSTCompAlu

the interesting thing is, the interface is almost certainly identical to the minerva LoadStoreInterface class.
Comment 14 Luke Kenneth Casson Leighton 2020-03-22 11:37:52 GMT
https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/scoreboard/addr_split.py;h=40f4569c38defa7fbf31b68f186f064d2df3e732;hb=f58721bc872b0e0b487a14dfa6a816774b433892

putting together a basic LD/ST address-splitter.  however it includes signalling which will connect directly to CompLDST.

basically, the address+len expands to a bit-mask, that could be mis-aligned,
so the "normal" half is sent truncated on a cache-line-boundary and the "top"
half is split into a *second* LD/ST.

what that means is that the LDSTCompUnit now needs to wait for *TWO* LD/STs
to complete.

therefore, we need valid/data signalling for *two* LD/STs and, on successful
completion of *both*, the data needs "re-assembly" based on the *two* masks
Comment 15 Luke Kenneth Casson Leighton 2020-03-24 11:16:58 GMT
*facepalm*... i managed to get all the way to a successful unit test
before noticing that i'd set up the code to LD/ST *bits*... not bytes :)

am currently adding ST splitting: drawing this diagram helped:

https://libre-riscv.org/3d_gpu/ld_st_splitter.png
Comment 16 Luke Kenneth Casson Leighton 2020-03-24 11:54:17 GMT
https://groups.google.com/forum/#!topic/comp.arch/cbGAlcCjiZE

started a discussion here, see what happens.
Comment 17 Luke Kenneth Casson Leighton 2020-03-25 13:07:22 GMT
https://libre-riscv.org/3d_gpu/l0_to_l1_n_way_cache_buffer.png

i will add this and a description here, shortly:
https://libre-riscv.org/3d_gpu/architecture/6600scoreboard/

the idea here is to have 6 to 8 LD/ST Function Units, each with
two "ports" (one exclusively for misaligned LDs/STs), which means
that we then have 12 to 16 such "ports" trying to get access to
the L0 Cache/Buffer.

that will require a Multi-Priority-Picker ("and here's one we prepared
earlier....")
https://libre-riscv.org/3d_gpu/architecture/6600scoreboard/800x-multi_priority_picker.png

this can be set up to be 16-in and produce 4x unique exclusive 16-out
"selectors", none of which will overlap.

that in turn allows us to OR/MUX any one of those 16 LD/ST FUs to
any one of the 4 "ports" on the L0 Cache

from there, we can split into two halves using bit 4 as the "selector"
bit, and use bits 5 and above as the "Addr hit".

a priority picker then picks *one* row and broadcasts its address to
all other rows.  if there is a "hit" (same address bits 5 and upwards)
then the data-select bytemask - and data itself - is "merged" into
the output as a "single cache line", to be sent directly to the
L1 cache.

whew.
Comment 18 Luke Kenneth Casson Leighton 2020-03-26 17:06:04 GMT
oh.  one important thing that came out of the discussion on comp.arch:
the number of wires involved, if we are not careful, is MENTAL.

let's imagine we pass the expanded (shifted, masked) data and op-length
out from each FunctionUnit.  the data, previously a maximum of 64 bit,
is now a full cache-line wide (because it has been shifted into position
for the cache write), and the operation, previously encoded as len=1,2,4,8
(we could use 2 bits for that) and in binary address form LSBs 0-3, is
now a 16 bit mask...

oh times two, one for the aligned and one for the misaligned LD/ST.

so that's 128, 256, plus 16x2=32 equals 288 wires per LD/ST FU, of which
there will be 8.

that's a staggering 2,304 wires *just for the data*

eep.

if we do the splitting *after* the Muxing through the 4-port Muxer(s),
*even if the splitting was needed for input into the Address Clash Matrix*,
we can halve that number of wires to "only" around 1120.

kiinda important :)
Comment 19 Jacob Lifshay 2020-03-26 20:58:58 GMT
I was thinking that the load/store FUs would just split and submit both halves to a queue-like structure that and the FU would just track both halves independently until they complete, then merge them. If a load/store doesn't need to be split, then it would only take one queue entry.

That queue can have all the dependency matrix and scheduling stuff be part of it and it only needs enough entries to keep the pipeline full -- it doesn't need double entries to handle split load/stores since the FUs will just stall if the queue is too full.
Comment 20 Luke Kenneth Casson Leighton 2020-03-26 21:23:08 GMT
(In reply to Jacob Lifshay from comment #19)
> I was thinking that the load/store FUs would just split and submit both
> halves to a queue-like structure that and the FU would just track both
> halves independently until they complete, then merge them. If a load/store
> doesn't need to be split, then it would only take one queue entry.

yes - only submit one queue entry, however there's a second split, in order
to double the maximum throughput (without needing dual-ported L1 cache SRAM):
use bit 4 to select odd-even banks.

however that selection needs to be back at the start:
https://libre-riscv.org/3d_gpu/l0_to_l1_n_way_cache_buffer.png

so we need *two* multi-in 4-out PriorityPickers, where the inputs are
actually done like this:

for i in range(16):
   if addr[i][4] == 0:
      multi_picker_even[i] = ldst_requested[i]
   else:
      multi_picker_odd[i] = ldst_requested[i]

*then* followed by:

mpick_even = MultiPriorityPicker(multi_picker_even, 4)
mpick_odd  = MultiPriorityPicker(multi_picker_odd, 4)

*then* followed by:

for i in range(4):
     L0_cache_buffer_even.port[i] == mpick_even.out_en[i] # each 16-bit wide
     L0_cache_buffer_odd .port[i] == mpick_odd .out_en[i] # ditto

that L0_cache_buffer_odd/even, yes it can be a Queue, however a FIFO Queue
has a bit of a problem: you can't normally get access to the entries in it.
where we need to *merge* entries (the bytemasks) that happen to have the same 
address bits 5 and upwards.

> That queue can have all the dependency matrix and scheduling stuff be part
> of it and it only needs enough entries to keep the pipeline full -- it
> doesn't need double entries to handle split load/stores since the FUs will
> just stall if the queue is too full.

part of the discussion on comp.arch was that we *may* want to prioritise
aligned LDs/STs over misaligned ones.

this can easily be done by putting the misaligned LDs/ST requests at the
*end* of the PriorityPicker.

however, again, if there happens to be a sequential batch of LDs/STs
which happen to be misaligned because they are a massive sequential batch of
8-byte LDs that were offset accidentally by 4 bytes, we do *not* want
misaligned LDs/STs to be deprioritised, because they will actually merge
quite nicely with the other LDs/STs.

therefore (and i'm recommending we do this later) we actually need to
prioritise which entry should be sent from the L0 cache/buffer/queue
by the number of bits that "merge" at the bytemask level.  this with
a "popcount" (or pseudo-popcount) on every entry in the L0 cache/buffer/queue
that detects at the very minimum which are full cache-lines after merging
the byte-masks which all have the same binary addresses [bits 5 and upwards].

*not* by "whether it is first in the queue".

so ultimately, a queue (FIFO queue) is not the right data structure to use.

if you'd like to write a round-robin Picker, great! :)  however i don't
think it makes a huge amount of difference, initially, whether it's
Priority-picked or RR-picked.  the PP and Multi-PP code exists :)
Comment 21 Jacob Lifshay 2020-04-22 03:58:07 BST
Wishbone as-is is not sufficient for handling the interconnect between the L1 and L2 caches, or to any other device that can act as a bus master and is cache coherent (such as other CPUs, OmniXtend, or cache coherent DMA). The reason is because it has no concept of ownership of a particular cache block (or caching at all), which is needed to support a cache coherence protocol.

Wishbone also has the drawback of memory bandwidth being limited by the bus having a max width of 64 bits (ends up as 6.4GB/s peak theoretical bandwidth at 800MHz), we are likely to want more bandwidth between cores and the L2 cache.

Minerva's memory interface appears to not be designed to handle cache coherence at all, so would need to be redesigned to support that if we wanted to use Minerva's memory interface, which I don't think we should since we would need to rewrite the majority of Minerva's memory interface code.

Using wishbone for devices that don't need to be cache coherent and don't need large memory bandwidth (less than 1GB/s or so) is still a good idea.
Comment 22 Luke Kenneth Casson Leighton 2020-04-22 09:30:51 BST
AXI4 does not have cache coherency either and yet it is still achieved. it is done by passing messages over the "control bus", using home-grown protocols in every case that i have examined (including ariane and OpenPITON).

we do not need to be fully compliant with the wishbone spec internally.

it is perfectly fine to modify the code so that it has the required bus bandwidth for internal use.

when external connections to other peoples' code is needed (peripherals) then and only then will we need to comply with the spec.
Comment 23 Luke Kenneth Casson Leighton 2020-04-22 09:39:55 BST
also, we're not doing an SMP core for the october 2020 deadline, because we are only doing a single core.

therefore L1 and L2 SMP cache coherency is not needed.

we *need* to urgently get moving on this, we are seriously behind.

the fastest way to do that is to not spend months writing code that already exists.

the fastest way to get the job done and meet that hard deadline is to only spend days or weeks using existing code.

minerva *already* uses wishbone.

in addition, there is the "soc" code from enjoy-digital (litex) which has testing infrastructure and connectivity to peripherals already done.

we *need* test infrastructure and peripheral connectivity.

we *cannot* spend weeks or months writing new test infrastructure that already exists because we're using a completely new Bus Protocol (TileLink) that completely isolates and cuts us off from that test infrastructure, because that test infrastructure is all written to use Wishbone.

after the October 2020 deadline, when SMP cache coherency is a goal, we can look at this again.

right now, we have to shelve SMP, and get the job done.  it's only six months and at least two of those are going to be full-time working on the layout.
Comment 24 Jacob Lifshay 2020-04-22 09:54:45 BST
(In reply to Luke Kenneth Casson Leighton from comment #23)
> also, we're not doing an SMP core for the october 2020 deadline, because we
> are only doing a single core.
> 
> therefore L1 and L2 SMP cache coherency is not needed.

Ok, we can write the code to basically have whatever cache coherency stuff gets written when getting the memory interface working and not worry about the L2 or other cores existing at all, treating it as a writethrough transparent cache. The design I'm working on is integrated closely enough with the L1 cache that it will still need at least a little of the coherence protocol to support atomics, but it can be built to have a WishBone interface adaptor.
Comment 25 Luke Kenneth Casson Leighton 2020-04-22 10:48:47 BST
(In reply to Jacob Lifshay from comment #24)
> (In reply to Luke Kenneth Casson Leighton from comment #23)
> > also, we're not doing an SMP core for the october 2020 deadline, because we
> > are only doing a single core.
> > 
> > therefore L1 and L2 SMP cache coherency is not needed.
> 
> Ok, we can write the code to basically have whatever cache coherency stuff
> gets written when getting the memory interface working and not worry about
> the L2 or other cores existing at all, treating it as a writethrough
> transparent cache.

basically, yes.  we're only going to be running at 25 mhz (possibly 100 mhz)
because there's no PLL, and no clock gating.

therefore yes even the L2 can be cut.

> The design I'm working on is integrated closely enough
> with the L1 cache that it will still need at least a little of the coherence
> protocol to support atomics, 

remember, i said already that atomic memory operations are taken care of
by the Address / Memory Dependency Matrices.

they are already split out into individual (single) operations, each one
isolated from (batches of) LOADs and (batches of) STOREs.

so there's strictly speaking no need to implement any kind of coherence
protocol, because as things stand it's never going to be active.

yes it's over-zealous, it is however just a side-effect of how the
address-matching / memory-dependency matrices.


> but it can be built to have a WishBone
> interface adaptor.

ok good.  does it implement this algorithm?

https://bugs.libre-soc.org/show_bug.cgi?id=216#c23
https://libre-soc.org/3d_gpu/mem_l0_to_l1_bridge_v2.png
Comment 26 Luke Kenneth Casson Leighton 2020-04-22 14:06:43 BST
L1 cache concept, go for it - that's really straightforward and standard,
although important to appreciate that 16 *byte* level read/write-enable lines
will come through from the Function Units.

if you can also do a non-standard-wishbone-128 to standard-wishbone-64
bit "funnel", that analyses the 16 byte-level read/write-enable lines,
splitting them into two halves covering the LO-64-bit and HI-64-bit
that would fit well.

actually to be honest at the 128-bit-level it doesn't even need to be
Wishbone, which may make design easier.

another "easier" way may be to simply have *FOUR* 64-bit Wishbone
interfaces:

* L1 left LO-64
* L1 left HI-64
* L1 right LO-64
* L1 right HI-64

then we "jam" those down onto a single 64-bit Wishbone interface
using a standard Wishbone Bus arbiter


L0 cache/buffer, give it a little time for some review / thought
as i just found a mistake in the L0 cache/buffer design and had
to adjust it:
https://libre-soc.org/3d_gpu/twin_l0_cache_buffer.jpg
Comment 27 Luke Kenneth Casson Leighton 2020-04-22 14:38:01 BST
(In reply to Luke Kenneth Casson Leighton from comment #26)

> another "easier" way may be to simply have *FOUR* 64-bit Wishbone
> interfaces:
> 
> * L1 left LO-64
> * L1 left HI-64
> * L1 right LO-64
> * L1 right HI-64
> 
> then we "jam" those down onto a single 64-bit Wishbone interface
> using a standard Wishbone Bus arbiter

i was going to say that those could all be 4 independent L1 caches,
however both left ones and both right ones have the exact same address,
so it would be a waste of power.

*however*... the 16-bit byte read/write-enable lines back at L0 can be
split into two then ORed together to produce a *single* "LO-64" and "HI-64"
L1 cache line read/write.
Comment 28 Jacob Lifshay 2020-04-29 12:16:25 BST
since i'm effectively stalled on the algorithm I've been working on (it should work, just not good at writing it out properly) and the one luke's been working on should work well enough (bug 216, I think), will be switching to just implementing that one.

If there's more time later, I can try again and reopen this bug.
Comment 29 Luke Kenneth Casson Leighton 2020-04-29 22:39:48 BST
if you can record some brief notes, after the oct 2020 deadline, or if we have less time pressure, it will be good to come back to this.
Comment 30 Jacob Lifshay 2020-05-06 01:33:43 BST
deferred for after oct 2020 tape out