Bug 137 - NLNet 2019 Video Acceleration Proposal
Summary: NLNet 2019 Video Acceleration Proposal
Status: CONFIRMED
Alias: None
Product: Libre Shakti M-Class
Classification: Unclassified
Component: Milestones (show other bugs)
Version: unspecified
Hardware: PC Linux
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL: https://libre-riscv.org/vpu/ https://...
Depends on: 159
Blocks:
  Show dependency tree
 
Reported: 2019-09-23 09:36 BST by Luke Kenneth Casson Leighton
Modified: 2020-01-25 18:24 GMT (History)
2 users (show)

See Also:
NLnet milestone: NLNet.2019.Video
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for completion of task (excludes budget allocated to subtasks): 0
parent task for budget allocation:
child tasks for budget allocation:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2019-09-23 09:36:54 BST
To add video acceleration to the Libre RISC-V SoC, upstream, for
ffmpeg, gstreamer, libswscale, libh264, libh265 and other libraries.
https://libre-riscv.org/nlnet_2019_video/

https://libre-riscv.org/vpu/

Audio
* bug #NN, MP3
* bug #NN, AC3
* bug #NN, Vorbis
* bug #NN, Opus

Video
* bug #NN, MJPEG (JPEG)
* bug #NN, MPEG1/2
* bug #NN, MPEG4 ASP (xvid)
* bug #NN, H.264
* bug #NN, H.265
* bug #NN, VP8
* bug #NN, VP9
* bug #NN, AV1

Opcodes
* rgb/bgr24 (TBD in 3D GPU or in this one?)
* rgbx/bgrx/xrgb/xbgr32 (TBD in 3D GPU or in this one?)
* nv12 (TBD in 3D GPU or in this one?)
* nv21 (TBD in 3D GPU or in this one?)

Simulator
* bug #NN discuss and add opcode(s) proposed by lauri
* bug #NN set up unit tests for opcodes under simulator

note: this is where the iterative loop comes in.  there will be several rounds adding different opcodes to try out

Hardware
* bug #NN implement opcodes in hardware
* bug #NN run unit tests under FPGA
* bug #NN run full OS (VLC?) demo under FPGA

todo, edit this comment and list a series of tasks to assign budgets to.  then, create bugreports for each.  see bug #48 for a template

TODO, subdivide these down into smaller tasks (discuss below) so that reasonably accurate budgetary amounts can be assigned to them.  slight overestimation (10 to 15% or so) is recommended (and acceptable).
Comment 1 cand 2020-01-24 08:27:26 GMT
https://libre-riscv.org/vpu/

Audio
* bug #NN, MP3
* bug #NN, AC3
* bug #NN, Vorbis
* bug #NN, Opus

Video
* bug #NN, MJPEG (JPEG)
* bug #NN, MPEG1/2
* bug #NN, MPEG4 ASP (xvid)
* bug #NN, H.264
* bug #NN, H.265
* bug #NN, VP8
* bug #NN, VP9
* bug #NN, AV1

todo, edit this comment and list a series of tasks to assign budgets to.  then, create bugreports for each.  see bug #48 for a template

TODO, subdivide these down into smaller tasks (discuss below) so that reasonably accurate budgetary amounts can be assigned to them.  slight overestimation (10 to 15% or so) is recommended (and acceptable).
Comment 2 cand 2020-01-24 08:48:06 GMT
Each codec then has these phases:
- research
- for each hotspot, implementation
- for each target library, upstreaming

HW implementations of new instructions would be later, once the instructions are known.
Comment 3 Luke Kenneth Casson Leighton 2020-01-24 09:07:07 GMT
(In reply to cand from comment #2)
> Each codec then has these phases:
> - research
> - for each hotspot, implementation
> - for each target library, upstreaming

ok great, do you have an estimation of time (and budget you'd like to receive) for each? 1 week research, 2 week impl, 3 day upstream coordination, that sort of thing?

we can subdivide later (3 subbugs per each top bug) if you would like part-payment however that is for later.

the focus now is to identify toplevel and assign budgets. 

> HW implementations of new instructions would be later, once the instructions
> are known.

yes.  or, more to the point, you advise us what you would like, then we implement them in a simulator (which we have to budget how to run under that, btw - it may be that we only run a subset of the code, say, only the algorithm or a unit test rather than full VLC or sonething)

then after the cycles/sec is confirmed *then* we implement that opcode in hw and finally actually run under an FPGA.  this will be much later, at the end of the process.
Comment 4 cand 2020-01-24 09:22:26 GMT
Each codec is of different complexity. The audio codecs usually only have a single hotspot, while at the other end AV1 has several dozen. I'll do a quick pass later, to get rough figures on those.

I thought the simulator would be part of the implementation loop?
Comment 5 Luke Kenneth Casson Leighton 2020-01-24 10:06:38 GMT
(In reply to cand from comment #4)
> Each codec is of different complexity. The audio codecs usually only have a
> single hotspot, while at the other end AV1 has several dozen.

thought so.

> I'll do a
> quick pass later, to get rough figures on those.

great.
 
> I thought the simulator would be part of the implementation loop?

hmmm yes, however think about it: several CODECs will share the same opcodes.  you don't make a YUV2RGB opcode for VP9 and a different one for MPEG :)

so i was kinda leaning towards them being on their own (aggregated) iterative cycle, if you know what i mean.

if we can get a rough idea in advance of the sorts of opcodes needed, bear in mind that for the most part they need to be "scalar" in nature because the Vector System adds that hardware-for-loop on top *of* scalar operations, it would be very handy.

then those can also be analysed as to a simulation implementation timescale and hw timescale and budget as well.

we are not going to be able to predict exactly everything here, that is what the iterations are for.  we just need a start.
Comment 6 cand 2020-01-24 11:23:26 GMT
Weren't the colorspace conversions part of the GPU milestone? That's what I understood from the ML earlier.
Comment 7 Luke Kenneth Casson Leighton 2020-01-24 11:29:50 GMT
(In reply to cand from comment #6)
> Weren't the colorspace conversions part of the GPU milestone? That's what I
> understood from the ML earlier.

yes good point, so we need to make sure not to double-allocate budget.
Comment 8 cand 2020-01-24 19:49:17 GMT
Rough relative complexities:

MP3                     1       1%
AC3                     1       1%
Vorbis                  1       1%
Opus                    1       1%

MJPEG (JPEG)            2       2%
MPEG1/2                 2       2%
MPEG4 ASP (xvid)        4       5%
H.264                   10      11%
H.265                   20      23%
VP8                     8       9%
VP9                     10      11%
AV1                     28      32%

This doesn't translate well to budget though, no sense in spending a third on AV1. Perhaps a more sensible goal would be to target the largest hot spots of each, with only smaller budget differences due to complexity.

Another point to consider is that while ffmpeg is the prime lib, parts of accel code made for ffmpeg aren't really usable in the various standalone libs. Different structures, etc. In order to not write things twice, some decisions need to be made on which upstreams particularly matter.
Comment 9 Luke Kenneth Casson Leighton 2020-01-25 11:30:49 GMT
(In reply to cand from comment #8)

> This doesn't translate well to budget though, no sense in spending a third
> on AV1. Perhaps a more sensible goal would be to target the largest hot
> spots of each, with only smaller budget differences due to complexity.

yes.  and, during later iterations, do some more.
 
> Another point to consider is that while ffmpeg is the prime lib, parts of
> accel code made for ffmpeg aren't really usable in the various standalone
> libs. Different structures, etc. In order to not write things twice, some
> decisions need to be made on which upstreams particularly matter.

well, ultimately, gstreamer has an ffmpeg plugin, ffmpeg has a gstreamer plugin, vdpau has a vaa plugin, vaa has a vdpau plugin, it's all circular [1] and up its own backside [2], so whichever we pick is good :)

which route would be easiest for you, do let's go with that.

[1] yes i managed to install both vdpau and vaa recursively, once, whoops...
[2] the beatles "yellow submarine" film demonstrates this well
Comment 10 cand 2020-01-25 18:24:13 GMT
Okay, then I'd say ffmpeg for everything else except av1 (dav1d) and jpeg (libjpeg-turbo).

Time and budget, your earlier comment on 1 week research, 2 week impl, 3 day upstream coordination is fairly on point, for one hotspot (or a couple smaller ones). For the later iterations only the impl phase would be budgeted.

I'd say 400e/wk, so 400 for research, 800 for one impl iteration, and 240 for the upstream part. I don't know how difficult the fpga side is, how much should be budgeted for that; IIRC you also said the entire amount should be used this year, or it'd be lost. Starting point for discussion anyway.