Bug 267 - The efficiency of adder/subtractor
Summary: The efficiency of adder/subtractor
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: ALU (including IEEE754 16/32/64-bit FPU) (show other bugs)
Version: unspecified
Hardware: PC Windows
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL:
Depends on:
Blocks:
 
Reported: 2020-03-26 22:05 GMT by Jock Tanner
Modified: 2020-04-07 15:44 BST (History)
2 users (show)

See Also:
NLnet milestone: ---
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jock Tanner 2020-03-26 22:05:29 GMT
From the task http://bugs.libre-riscv.org/show_bug.cgi?id=217

(In reply to Luke Kenneth Casson Leighton from comment #42)
> (In reply to Jock Tanner from comment #41)
> > I think most ALUs
> > just negate and add instead of subtract. I think this would use a bit less
> > logic, and negation in itself is also a useful operation.
> 
> yes. in the "real" alu we have... mmm... not sure, actually.  probably -
> 
> do me a favour and raise a bugreport as a reminder to look up what
> rocketchip does?

Actually I was thinking about what Harris&Harris suggested in their famous book. You just have to have an inverter on one operand input of an adder. The two's complement of that operand can be achieved via the carry input.
Comment 1 Luke Kenneth Casson Leighton 2020-03-26 23:10:07 GMT
(In reply to Jock Tanner from comment #0)

> Actually I was thinking about what Harris&Harris suggested in their famous
> book. You just have to have an inverter on one operand input of an adder.
> The two's complement of that operand can be achieved via the carry input.

yes, interestingly, michael and i are reviewing in the simulator, various
implementations (microwatt, qemu, pear_pc), they all seem to do this
http://bugs.libre-riscv.org/show_bug.cgi?id=186#c130

it looks like power was extremely well designed, and microwatt as well,
to basically "translate" opcodes into "everything is an ADD, but you
optionally do add1, add0, add-carry, occasionally invert op1,
oh and farm out carry to different places, or set carry to 0, or 1" etc. etc.

it's quite neat.
Comment 2 Jock Tanner 2020-04-07 06:23:01 BST
Another thing that is slightly bugging me here is lookahead carry units. Should we reconfigure them "on the fly" together with adders/subtractors? Is that viable? Multiplexers can introduce enough delay to make the whole looking-ahead thing inefficient comparing to simple serial carry propagation.
Comment 3 Luke Kenneth Casson Leighton 2020-04-07 10:48:44 BST
(In reply to Jock Tanner from comment #2)

> Another thing that is slightly bugging me here is lookahead carry units.

you'll need to send me some links so i can take a look at what you mean.

> Should we reconfigure them "on the fly" together with adders/subtractors? Is
> that viable? Multiplexers can introduce enough delay to make the whole
> looking-ahead thing inefficient comparing to simple serial carry propagation.

if you can put some links here as to what you're comparing (descriptions,
tutorials, gate-level diagrams) i can get up to speed and follow the
conversation.
Comment 4 Jock Tanner 2020-04-07 12:39:11 BST
(In reply to Luke Kenneth Casson Leighton from comment #3)
> if you can put some links here as to what you're comparing (descriptions,
> tutorials, gate-level diagrams) i can get up to speed and follow the
> conversation.

I mostly rely on a college textbook by David & Sarah Harris, named “Digital Design and Computer Architecture”. I read it only in Russian though. The book has a very detailed explanation of carry-lookahead adder, with truth tables and schematics.

Some knowledge can even be found in Wiki [1].

Very roughly speaking, N-bit ripple-carry adder takes a*N transistors and b*N time for calculation, while straight-out carry-lookahead adder takes a^N transistors and b time. AFAIK in practical ALU design they always make a compromise between the two. And I just wonder what kind of compromise could be suitable for *reconfigurable*, high-speed ALU.

I imagine something cascadable, like 74181 bit-slice ALUs + 74182 carry generators [2], but with 'plexors in between. But the speed of such a design may be questionable.

[1] https://en.wikipedia.org/wiki/Adder_(electronics)
[2] https://en.wikipedia.org/wiki/74181
Comment 5 Luke Kenneth Casson Leighton 2020-04-07 12:48:38 BST
(In reply to Jock Tanner from comment #4)
> (In reply to Luke Kenneth Casson Leighton from comment #3)
> > if you can put some links here as to what you're comparing (descriptions,
> > tutorials, gate-level diagrams) i can get up to speed and follow the
> > conversation.
> 
> I mostly rely on a college textbook by David & Sarah Harris, named “Digital
> Design and Computer Architecture”. I read it only in Russian though. The
> book has a very detailed explanation of carry-lookahead adder, with truth
> tables and schematics.

interesting.  well, what we will be relying on is yosys "synth" command
turning things from high-level "we want an N-bit adder" into low-level
"these 4-bit adders and etc. etc. get created".

we are pretty much trusting yosys "synth" to have the optimal state-of-the-art
latest industry-recognised "thing".


> Very roughly speaking, N-bit ripple-carry adder takes a*N transistors and
> b*N time for calculation, while straight-out carry-lookahead adder takes a^N
> transistors and b time. AFAIK in practical ALU design they always make a
> compromise between the two. And I just wonder what kind of compromise could
> be suitable for *reconfigurable*, high-speed ALU.

i honestly have no idea! :)

the basic units will actually be a 64+8-bit add (yes 72 bit).

https://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part_mul_add/adder.py;h=e1849b4d25fc5ec4fc4473cc02b32f49dd5912b2;hb=77f150cff440bed025e2487b6f0fcda9f529290b#l108

see the comment description there of how it works

the "partitions" are done via the insertion of those "extra bits".

those "extra bits" will be set to 0 to create a partition, and "1"
to get a roll-over.


so we're not *actually* "partitioning" into completely separate 8-bit adders.

therefore i would hazard a guess that the (underlying 72-bit) adders are
properly optimised by yosys "synth" command to give a decent fast adder,
thus giving us the "best of both worlds".

in other words, a quick analysis leads me to believe that what we're using
is *not* adversely impacted just because it's a partitioned adder.

it has the *functionality* of a partitioned adder but is in fact a 72-bit
"straight" (normal) add.
Comment 6 Jock Tanner 2020-04-07 14:15:46 BST
The idea is so simple yet so out-of-the-box! I feel like I'm in a Western movie, while my partner is devising a get-rich-fast scheme:

− And where's the catch?
− There is no catch.

BTW is it patented?

(In reply to Luke Kenneth Casson Leighton from comment #5)
> https://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/
> part_mul_add/adder.py;h=e1849b4d25fc5ec4fc4473cc02b32f49dd5912b2;
> hb=77f150cff440bed025e2487b6f0fcda9f529290b#l108
> 
> see the comment description there of how it works
> 
> the "partitions" are done via the insertion of those "extra bits".
> 
> those "extra bits" will be set to 0 to create a partition, and "1"
> to get a roll-over.
> 
> 
> so we're not *actually* "partitioning" into completely separate 8-bit adders.
> 
> therefore i would hazard a guess that the (underlying 72-bit) adders are
> properly optimised by yosys "synth" command to give a decent fast adder,
> thus giving us the "best of both worlds".
Comment 7 Jacob Lifshay 2020-04-07 14:36:20 BST
(In reply to Jock Tanner from comment #4)
> Very roughly speaking, N-bit ripple-carry adder takes a*N transistors and
> b*N time for calculation, while straight-out carry-lookahead adder takes a^N
> transistors and b time. AFAIK in practical ALU design they always make a
> compromise between the two. And I just wonder what kind of compromise could
> be suitable for *reconfigurable*, high-speed ALU.
> 
> I imagine something cascadable, like 74181 bit-slice ALUs + 74182 carry
> generators [2], but with 'plexors in between. But the speed of such a design
> may be questionable.

I haven't checked which specific design yosys produces for an adder, but I know there are divide and conquer carry lookahead designs that have area of O(N*log(N)) and delay of O(log(N)) when ignoring delay due to wire length, so that's much better than the above a^N value.

I found a paper with an overview of different carry lookahead adder designs:
https://www.ijrte.org/wp-content/uploads/papers/v7i5s4/E10920275S419.pdf

In fact, I'm planning on using the same kind of recursive divide and conquer structure for purposes other than addition such as finding the first set bit or a prefix-sum.
Comment 8 Jacob Lifshay 2020-04-07 14:38:46 BST
(In reply to Jock Tanner from comment #6)
> The idea is so simple yet so out-of-the-box! I feel like I'm in a Western
> movie, while my partner is devising a get-rich-fast scheme:
> 
> − And where's the catch?
> − There is no catch.
> 
> BTW is it patented?

Not that I'm aware of, I invented it.
Comment 9 Luke Kenneth Casson Leighton 2020-04-07 15:44:36 BST
(In reply to Jock Tanner from comment #6)
> The idea is so simple yet so out-of-the-box! I feel like I'm in a Western
> movie, while my partner is devising a get-rich-fast scheme:
> 
> − And where's the catch?
> − There is no catch.

a lot of what we're doing is like that.  the 6600 scoreboards are *very*
badly misunderstood and drastically underestimated right across the
entire industry (thanks largely to Patterson's book on RISC architectures).

consequently what's in Thornton's book "Design of a Computer" (google it)
has had to be "reinvented" and often comes attached with unnecessary baggage
such as "register renamers".

then again there are areas where we are "functional" but need serious
improvement.  we need a Dadda multiplier algorithm for example, because
we currently have a Wallace Tree one.
https://bugs.libre-soc.org/show_bug.cgi?id=136