Bug 126 - Make Div core conditional (enable signal)
Summary: Make Div core conditional (enable signal)
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: ALU (including IEEE754 16/32/64-bit FPU) (show other bugs)
Version: unspecified
Hardware: Other Linux
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL:
Depends on:
Blocks: 48
  Show dependency treegraph
 
Reported: 2019-07-29 21:20 BST by Luke Kenneth Casson Leighton
Modified: 2022-07-17 14:16 BST (History)
2 users (show)

See Also:
NLnet milestone: ---
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2019-07-29 21:20:23 BST
Div Core is generating spurious output based on input that always gets passed in.
This should be stopped as it makes debugging harder and also takes power.
One simple way is to increase the operation enum so that there is a zero (NOP).
Comment 1 Jacob Lifshay 2019-07-29 23:20:25 BST
the way to save power is to make the pipeline registers not change their value, not by adding conditional logic to completely combinatorial blocks. this is because saving power occurs in cmos when the smallest number of gates switches from their previous state to a new state. switching all the gates to 0 or 1 actually uses lots of power in the clock cycle transitioning from in-use to idle.

In my opinion, this should be done outside the div core, it should be done in fpdiv/pipeline.py, which is where the pipeline registers are actually created.

if debugging, all you have to do is view the data-valid signal as well as the data signals,
Comment 2 Luke Kenneth Casson Leighton 2019-07-30 00:27:53 BST
(In reply to Jacob Lifshay from comment #1)
> the way to save power is to make the pipeline registers not change their
> value, not by adding conditional logic to completely combinatorial blocks.

They're combinatorial... however they are sometimes connected to other combinatorial blocks and sometimes connected via pipeline classes and the combinatorial blocks *do not know which*

Therefore, there are two choices:

1. Have logic that makes decisions inside the blocks (out_do_z)

2. Completely redesign absolutely every single combinatorial block, of which there must now be over 50, to have an indicator (o_valid) which tells the ongoing block if the output is to be switched...

... and then have some way of identifying when StageChain is used and when pipeline registers are used.

Total mess, huge amount of work, and getting rid of the clean isolation and transparent data independence in the process.

InputMux and OutputMux are bad enough as it is, reading ctx.muxid. These are literally the only classes that break the rule of total separation between data and data passing, and even then, there is no combinatorial block that ever does anything other than pass ctx through.

That rule is just about to get broken with early in and early out bypassing, but that's another story.

> this is because saving power occurs in cmos when the smallest number of
> gates switches from their previous state to a new state. switching all the
> gates to 0 or 1 actually uses lots of power in the clock cycle transitioning
> from in-use to idle.

It means adding an external "stall" capability, in effect.  6600 pipelines *never* stall. Saving power however is a good reason to break that rule.

Or, in the case of the ReservationStations system: if the (unary) muxid is nonzero, data shall be passed along the register pipeline chain, and left alone if not.

That would do the trick. Only data that had a (unary) muxid would be propagated.  It would require that the muxid be always (unconditionally) propagated down the registers, whilst the data was *conditionally* propagated.


> In my opinion, this should be done outside the div core, it should be done
> in fpdiv/pipeline.py, which is where the pipeline registers are actually
> created.

Sort-of, by using SimpleHandshake, yes. This needs to be replaced with a unary-muxid-understanding class with cancellation capability.  Will discuss on the other bugreport. Bug #101


> if debugging, all you have to do is view the data-valid signal as well as
> the data signals,

What happens in fpmul and fpadd, they check out_do_z.  Zeros get propagated down the chain. Nothing changes.

So I add signals to view in gtkwave on every sequential phase, prenorm, align, processing, postnorm, output.

Then I set the same input values in fpmux.py

The data propagates and is CLEARLY visible, no confusion, because, clearly, if it's nonzero, it got there because the previous pipeline registers put it there.

This is NOT the case with DivCore.

DivCore has the operator permanently active.

Zero input starts IMMEDIATELY generating spurious data, which immediately begibs propagating down the pipeline.  I wasted... about an hour trying to work out why on earth I was seeing 0x7000000 0x7f0000 0x7ff000 0x7ffff00 going down the pipeline in the Div stages.

Finally I realised it was because operator defaults to zero, which ACTIVATES actual DIV computation.
Comment 3 Jacob Lifshay 2019-07-30 00:58:32 BST
(In reply to Luke Kenneth Casson Leighton from comment #2)
> (In reply to Jacob Lifshay from comment #1)
> > the way to save power is to make the pipeline registers not change their
> > value, not by adding conditional logic to completely combinatorial blocks.
> 
> They're combinatorial... however they are sometimes connected to other
> combinatorial blocks and sometimes connected via pipeline classes and the
> combinatorial blocks *do not know which*
> 
> Therefore, there are two choices:
> 
> 1. Have logic that makes decisions inside the blocks (out_do_z)

the problem with this approach is that it is a pseudo-solution to saving power. the combinatorial blocks don't know the previous state of the outputs (they have no flip-flops) so can't reproduce it.

> 2. Completely redesign absolutely every single combinatorial block, of which
> there must now be over 50, to have an indicator (o_valid) which tells the
> ongoing block if the output is to be switched...

you wouldn't need to redesign everything if the code reading o_valid assumes it's 1 when o_valid doesn't exist.

> ... and then have some way of identifying when StageChain is used and when
> pipeline registers are used.

stagechain just copies o_valid from the last stage -- simple.

> Total mess, huge amount of work, and getting rid of the clean isolation and
> transparent data independence in the process.

In my opinion, having all the combinatorial blocks required to be conditional is a more flagrant violation of cleanliness.

> InputMux and OutputMux are bad enough as it is, reading ctx.muxid. These are
> literally the only classes that break the rule of total separation between
> data and data passing, and even then, there is no combinatorial block that
> ever does anything other than pass ctx through.
> 
> That rule is just about to get broken with early in and early out bypassing,
> but that's another story.

note that for the div pipe, since the first compute stage is in the same pipeline stage as the actual numbers are input, early-in would not be needed for integers or fp.

> > this is because saving power occurs in cmos when the smallest number of
> > gates switches from their previous state to a new state. switching all the
> > gates to 0 or 1 actually uses lots of power in the clock cycle transitioning
> > from in-use to idle.
> 
> It means adding an external "stall" capability, in effect.  6600 pipelines
> *never* stall. Saving power however is a good reason to break that rule.

just think of it as stages that don't have valid data have X for the data lines, the X is just selected to be the previous value that was in the flip-flops since that saves power.

> Or, in the case of the ReservationStations system: if the (unary) muxid is
> nonzero, data shall be passed along the register pipeline chain, and left
> alone if not.
> 
> That would do the trick. Only data that had a (unary) muxid would be
> propagated.  It would require that the muxid be always (unconditionally)
> propagated down the registers, whilst the data was *conditionally*
> propagated.
> 
> 
> > In my opinion, this should be done outside the div core, it should be done
> > in fpdiv/pipeline.py, which is where the pipeline registers are actually
> > created.
> 
> Sort-of, by using SimpleHandshake, yes. This needs to be replaced with a
> unary-muxid-understanding class with cancellation capability.  Will discuss
> on the other bugreport.
> 
> 
> > if debugging, all you have to do is view the data-valid signal as well as
> > the data signals,
> 
> What happens in fpmul and fpadd, they check out_do_z.  Zeros get propagated
> down the chain. Nothing changes.

if you want divcore to output zeros for zero inputs, just swap the values in DivPipeCoreOperation so SqrtRem is numbered 0. this works since sqrt(0) == 0. div doesn't do that since it solves for the largest q that satisfies dividend >= divisor * q, when divisor is 0, obviously the largest q is all 1s.

this will also have the benefit of finding cases where we are still using magic numbers instead of the named values, since those places will break.
Comment 4 Luke Kenneth Casson Leighton 2019-07-30 01:04:52 BST
p.s. in case it was't obvious, what you said about the data not changing was important and where previously i wanted a no-stall non-conditional pipeline design that has to change, with unary muxid being what says whether pipeline registers change or not.
Comment 5 Luke Kenneth Casson Leighton 2019-07-30 01:25:23 BST
(In reply to Jacob Lifshay from comment #3)
> (In reply to Luke Kenneth Casson Leighton from comment #2)

Arg I just managed to delete sonething I wrote and can't get it back. It was to do with how StageChain has no knowledge of or access to the Data *Handling* side.

This is thoroughly deliberate: it is precisely this which allows combinatorial blocks to be chained either with StageChain *or* as Pipeline blocks, and for the very same combinatorial blocks to NEVER have to worry about which or how they are used.

The moment that the combinatorial blocks are given access to o_valid (etc) that's it, it's game over, they can never be combinatorially chained together with StageChain, not without a total code rewrite.

> note that for the div pipe, since the first compute stage is in the same
> pipeline stage as the actual numbers are input, early-in would not be needed
> for integers or fp.

Except that the specialcases need to be bypassed, as does the "denorm" stage (which deals with subnormals). See diagram here http://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/fpdiv/pipeline.py;h=499bd79c4373ce7ecfe2ada31f12fd39c58f4830;hb=refs/heads/master

The first pipeline stage is what needs bypassing, dropping the INT data directly into the DIV pipe.

> > It means adding an external "stall" capability, in effect.  6600 pipelines
> > *never* stall. Saving power however is a good reason to break that rule.
> 
> just think of it as stages that don't have valid data have X for the data
> lines, the X is just selected to be the previous value that was in the
> flip-flops since that saves power.

Yehyeh. Good perspective.

> > 
> > What happens in fpmul and fpadd, they check out_do_z.  Zeros get propagated
> > down the chain. Nothing changes.
> 
> if you want divcore to output zeros for zero inputs, just swap the values in
> DivPipeCoreOperation so SqrtRem is numbered 0. this works since sqrt(0) ==
> 0. div doesn't do that since it solves for the largest q that satisfies
> dividend >= divisor * q, when divisor is 0, obviously the largest q is all
> 1s.

Okaay. That would work.

> this will also have the benefit of finding cases where we are still using
> magic numbers instead of the named values, since those places will break.

Shouldn't be any.
Comment 6 Luke Kenneth Casson Leighton 2019-07-30 01:36:25 BST
(In reply to Jacob Lifshay from comment #3)
> (In reply to Luke Kenneth Casson

> stagechain just copies o_valid from the last stage -- simple.

StageChain is at the stage level, not the Data Handling API level.

StageChain has zero knowledge of o_valid (etc) because those are for *register* handling, *not* combinatorial block handling.

There are two APIs.

1. Stage API. It handles combinatorial blocks ONLY.

2. Data Handling API. It handles o_valid etc and does pipeline registers ONLY.

The combination of those two is what gives the pipeline API. Aside from the annoying muxid exception, they are entirely opaque to each other.

If we added o_valid etc to the *Stage* API, which are currently in the Data *Handling* API and completely separated from and absolutely nothing to do with the *Stage* API, it would be a massive redesign of the entire pipeline API, breaking several fundamental design aspects.

Anyway. I think the idea of swapping SQRT with DIV would do the trick I was looking for.
Comment 7 Jacob Lifshay 2019-07-30 01:45:28 BST
(In reply to Luke Kenneth Casson Leighton from comment #5)
> (In reply to Jacob Lifshay from comment #3)
> > (In reply to Luke Kenneth Casson Leighton from comment #2)
> 
> Arg I just managed to delete sonething I wrote and can't get it back. It was
> to do with how StageChain has no knowledge of or access to the Data
> *Handling* side.

fun :(

> This is thoroughly deliberate: it is precisely this which allows
> combinatorial blocks to be chained either with StageChain *or* as Pipeline
> blocks, and for the very same combinatorial blocks to NEVER have to worry
> about which or how they are used.
> 
> The moment that the combinatorial blocks are given access to o_valid (etc)
> that's it, it's game over, they can never be combinatorially chained
> together with StageChain, not without a total code rewrite.

ok. yeah, having o_valid just go through the pipeline blocks would be better.

> > note that for the div pipe, since the first compute stage is in the same
> > pipeline stage as the actual numbers are input, early-in would not be needed
> > for integers or fp.
> 
> Except that the specialcases need to be bypassed, as does the "denorm" stage
> (which deals with subnormals). See diagram here
> http://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/fpdiv/
> pipeline.py;h=499bd79c4373ce7ecfe2ada31f12fd39c58f4830;hb=refs/heads/master
> 
> The first pipeline stage is what needs bypassing, dropping the INT data
> directly into the DIV pipe.

I would move both FPDivStage0Mod and DivPipeSetupStage into the same stage as denormalization. that pipeline stage would handle sign flipping for signed integers as a parallel combinatorial unit.

I would also move DivPipeFinalStage and FPDivStage2Mod into the same stage as normalization, that pipeline stage would handle sign flipping and special cases (overflow and div-by-zero) for integers.

this allows all the interior stages to be only DivPipeCoreCalculate stages, since those are the ones with the big propagation times.
Comment 8 Luke Kenneth Casson Leighton 2019-07-30 03:18:25 BST
(In reply to Jacob Lifshay from comment #7)
> (In reply to Luke Kenneth Casson Leighton from comment #5)
.
> > 
> > The moment that the combinatorial blocks are given access to o_valid (etc)
> > that's it, it's game over, they can never be combinatorially chained
> > together with StageChain, not without a total code rewrite.
> 
> ok. yeah, having o_valid just go through the pipeline blocks would be better.

It means moving stuff to reduce gate latency is reeeal easy.

> > The first pipeline stage is what needs bypassing, dropping the INT data
> > directly into the DIV pipe.
> 
> I would move both FPDivStage0Mod and DivPipeSetupStage into the same stage
> as denormalization. 

Hm, hm, the subnormal alignment (making sure the MSB is 1) is a count-leading-zeros plus a single clock multishifter. It's quite a high gate latency.

> that pipeline stage would handle sign flipping for
> signed integers as a parallel combinatorial unit.

That's another possible approach, not discussed before: combinatorial stages that (similar to how muxid works) generate decision-routing output that results in data going to *completely different* blocks.

> I would also move DivPipeFinalStage and FPDivStage2Mod into the same stage
> as normalization, that pipeline stage would handle sign flipping and special
> cases (overflow and div-by-zero) for integers.

Separate block, routed based on ctx.operator.

> this allows all the interior stages to be only DivPipeCoreCalculate stages,
> since those are the ones with the big propagation times.

With INT pipes (FP32/64) on radix 16, that's divisible.

On radix 8 (3 bits at a time) neither INT nor FP will divide evenly, leaving room to put pre and post processing with no fate latency penalties.

Much as I would like all this to be automated, I don't think it can be. We just have to patiently go through and plan the interconnect.

With INT, FP, SIMD, Cancellation and sharing of the alignment/pre/post-normalisation *and* doing multiple operations (ADD, MUL, DIV) it's going to be a handful :)