137 – NLNet 2019 Video Acceleration Proposal 2019-10-031

Bug 137 - NLNet 2019 Video Acceleration Proposal 2019-10-031

Summary: NLNet 2019 Video Acceleration Proposal 2019-10-031

Status:	RESOLVED FIXED

Alias:	None

Product:	Libre-SOC's first SoC
Classification:	Unclassified
Component:	Milestones (show other bugs)
Version:	unspecified
Hardware:	PC Linux

Importance:	--- enhancement
Assignee:	Luke Kenneth Casson Leighton

URL:	https://libre-soc.org/vpu/

Depends on:	159 219 220 221 224 225 226 231 233 235 218 222 223 227 228 229 230 232 234
Blocks:	938
	Show dependency tree / graph

Reported:	2019-09-23 09:36 BST by Luke Kenneth Casson Leighton
Modified:	2022-10-22 10:29 BST (History)
CC List:	6 users (show)

See Also:	758 963
NLnet milestone:	NLNet.2019.10.031.Video
total budget (EUR) for completion of task and all subtasks:	50000
budget (EUR) for this task, excluding subtasks' budget:	0
parent task for budget allocation:
child tasks for budget allocation:	218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235
The table of payments (in EUR) for this task; TOML format:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Luke Kenneth Casson Leighton 2019-09-23 09:36:54 BST

To add video acceleration to Libre-SoC, upstream, for
ffmpeg, gstreamer, libswscale, libh264, libh265 and other libraries.
https://libre-soc.org/nlnet_2019_video/

https://libre-soc.org/vpu/

Audio
* bug #218, MP3
* bug #219, AC3
* bug #220, Vorbis
* bug #221, Opus

Video
* bug #222, MJPEG (JPEG)
* bug #223, MPEG1/2
* bug #224, MPEG4 ASP (xvid)
* bug #225, H.264
* bug #226, H.265
* bug #227, VP8
* bug #228, VP9
* bug #229, AV1

Opcodes: bug #234 implement opcodes in hardware
* rgb/bgr24 (TBD in 3D GPU or in this one?)
* rgbx/bgrx/xrgb/xbgr32 (TBD in 3D GPU or in this one?)
* nv12 (TBD in 3D GPU or in this one?)
* nv21 (TBD in 3D GPU or in this one?)

Simulator
* bug #230 discuss and add opcode(s) proposed by lauri
* bug #233 set up unit tests for opcodes under simulator

Standards Documentation: bug #231
* write up all opcodes (related to #230) as formal standards


note: this is where the iterative loop comes in.  there will be several rounds adding different opcodes to try out

FPGA - bug #235
* run unit tests under FPGA
* run full OS (VLC?) demo under FPGA

todo, edit this comment and list a series of tasks to assign budgets to.  then, create bugreports for each.  see bug #48 for a template

TODO, subdivide these down into smaller tasks (discuss below) so that reasonably accurate budgetary amounts can be assigned to them.  slight overestimation (10 to 15% or so) is recommended (and acceptable).

Comment 1 cand 2020-01-24 08:27:26 GMT

https://libre-soc.org/vpu/

Audio
* 2 weeks MP3 EUR 750
* 2 weeks AC3 EUR 750
* 2 weeks Vorbis EUR 750
* 2 weeks Opus EUR 750

Video
* 4 weeks MJPEG (JPEG) EUR 1500
* 4 weeks MPEG1/2 EUR 1500
* 5 weeks MPEG4 ASP (xvid) EUR 2000
* 8 weeks H.264 EUR 3000
* 10 weeks H.265 EUR 4000
* 8 weeks VP8 EUR 3000
* 8 weeks VP9 EUR 3000
* 10 weeks AV1 EUR 4000

Total EUR 25000

* Opcodes development and discussion: EUR 4000
* Opcodes Standards writeup: EUR 2000
* Implementation of opcodes in simulator: EUR 5000
* Unit tests in simulator: EUR 3000
* Hardware implementation: EUR 9000
* FPGA tests: EUR 2000

Comment 2 cand 2020-01-24 08:48:06 GMT

Each codec then has these phases:
- research
- for each hotspot, implementation
- for each target library, upstreaming

HW implementations of new instructions would be later, once the instructions are known.

Comment 3 Luke Kenneth Casson Leighton 2020-01-24 09:07:07 GMT

(In reply to cand from comment #2)
> Each codec then has these phases:
> - research
> - for each hotspot, implementation
> - for each target library, upstreaming

ok great, do you have an estimation of time (and budget you'd like to receive) for each? 1 week research, 2 week impl, 3 day upstream coordination, that sort of thing?

we can subdivide later (3 subbugs per each top bug) if you would like part-payment however that is for later.

the focus now is to identify toplevel and assign budgets. 

> HW implementations of new instructions would be later, once the instructions
> are known.

yes.  or, more to the point, you advise us what you would like, then we implement them in a simulator (which we have to budget how to run under that, btw - it may be that we only run a subset of the code, say, only the algorithm or a unit test rather than full VLC or sonething)

then after the cycles/sec is confirmed *then* we implement that opcode in hw and finally actually run under an FPGA.  this will be much later, at the end of the process.

Comment 4 cand 2020-01-24 09:22:26 GMT

Each codec is of different complexity. The audio codecs usually only have a single hotspot, while at the other end AV1 has several dozen. I'll do a quick pass later, to get rough figures on those.

I thought the simulator would be part of the implementation loop?

Comment 5 Luke Kenneth Casson Leighton 2020-01-24 10:06:38 GMT

(In reply to cand from comment #4)
> Each codec is of different complexity. The audio codecs usually only have a
> single hotspot, while at the other end AV1 has several dozen.

thought so.

> I'll do a
> quick pass later, to get rough figures on those.

great.
 
> I thought the simulator would be part of the implementation loop?

hmmm yes, however think about it: several CODECs will share the same opcodes.  you don't make a YUV2RGB opcode for VP9 and a different one for MPEG :)

so i was kinda leaning towards them being on their own (aggregated) iterative cycle, if you know what i mean.

if we can get a rough idea in advance of the sorts of opcodes needed, bear in mind that for the most part they need to be "scalar" in nature because the Vector System adds that hardware-for-loop on top *of* scalar operations, it would be very handy.

then those can also be analysed as to a simulation implementation timescale and hw timescale and budget as well.

we are not going to be able to predict exactly everything here, that is what the iterations are for.  we just need a start.

Comment 6 cand 2020-01-24 11:23:26 GMT

Weren't the colorspace conversions part of the GPU milestone? That's what I understood from the ML earlier.

Comment 7 Luke Kenneth Casson Leighton 2020-01-24 11:29:50 GMT

(In reply to cand from comment #6)
> Weren't the colorspace conversions part of the GPU milestone? That's what I
> understood from the ML earlier.

yes good point, so we need to make sure not to double-allocate budget.

Comment 8 cand 2020-01-24 19:49:17 GMT

Rough relative complexities:

MP3                     1       1%
AC3                     1       1%
Vorbis                  1       1%
Opus                    1       1%

MJPEG (JPEG)            2       2%
MPEG1/2                 2       2%
MPEG4 ASP (xvid)        4       5%
H.264                   10      11%
H.265                   20      23%
VP8                     8       9%
VP9                     10      11%
AV1                     28      32%

This doesn't translate well to budget though, no sense in spending a third on AV1. Perhaps a more sensible goal would be to target the largest hot spots of each, with only smaller budget differences due to complexity.

Another point to consider is that while ffmpeg is the prime lib, parts of accel code made for ffmpeg aren't really usable in the various standalone libs. Different structures, etc. In order to not write things twice, some decisions need to be made on which upstreams particularly matter.

Comment 9 Luke Kenneth Casson Leighton 2020-01-25 11:30:49 GMT

(In reply to cand from comment #8)

> This doesn't translate well to budget though, no sense in spending a third
> on AV1. Perhaps a more sensible goal would be to target the largest hot
> spots of each, with only smaller budget differences due to complexity.

yes.  and, during later iterations, do some more.
 
> Another point to consider is that while ffmpeg is the prime lib, parts of
> accel code made for ffmpeg aren't really usable in the various standalone
> libs. Different structures, etc. In order to not write things twice, some
> decisions need to be made on which upstreams particularly matter.

well, ultimately, gstreamer has an ffmpeg plugin, ffmpeg has a gstreamer plugin, vdpau has a vaa plugin, vaa has a vdpau plugin, it's all circular [1] and up its own backside [2], so whichever we pick is good :)

which route would be easiest for you, do let's go with that.

[1] yes i managed to install both vdpau and vaa recursively, once, whoops...
[2] the beatles "yellow submarine" film demonstrates this well

Comment 10 cand 2020-01-25 18:24:13 GMT

Okay, then I'd say ffmpeg for everything else except av1 (dav1d) and jpeg (libjpeg-turbo).

Time and budget, your earlier comment on 1 week research, 2 week impl, 3 day upstream coordination is fairly on point, for one hotspot (or a couple smaller ones). For the later iterations only the impl phase would be budgeted.

I'd say 400e/wk, so 400 for research, 800 for one impl iteration, and 240 for the upstream part. I don't know how difficult the fpga side is, how much should be budgeted for that; IIRC you also said the entire amount should be used this year, or it'd be lost. Starting point for discussion anyway.

Comment 11 Luke Kenneth Casson Leighton 2020-02-23 18:35:24 GMT

> should be budgeted for that; IIRC you also said the entire amount should be
> used this year, or it'd be lost. Starting point for discussion anyway.

we have until mid 2021 so not as heavy there.

Comment 12 Luke Kenneth Casson Leighton 2020-02-23 18:52:38 GMT

lauri do the budgets look reasonable?
l.

Comment 13 cand 2020-02-23 18:58:46 GMT

Sorry, which ones?

Comment 14 cand 2020-02-23 19:09:00 GMT

Oh, you edited comment #1, emails don't go out for edits so didn't see it at first. Yes, they look ok, other than being off by 1k (51k total).

Comment 15 Luke Kenneth Casson Leighton 2020-02-23 19:43:16 GMT

adjusted thx. alain this one needs a writeup too when the time is right.
similar to http://bugs.libre-riscv.org/show_bug.cgi?id=158#c4
except we need to create the individual bugreports first (all 17 of them)

Comment 16 cand 2020-03-13 10:16:31 GMT

# Schedule A to be attached to MoU

List of tasks, plus description, bugtracker URL and budget

# MP3 optimizations
Optimizing MP3 code in ffmpeg with new instructions.
URL: http://bugs.libre-riscv.org/show_bug.cgi?id=218
Budget: EUR 750

# AC3 optimizations
Optimizing AC3 code in ffmpeg with new instructions.
URL: http://bugs.libre-riscv.org/show_bug.cgi?id=219
Budget: EUR 750

# Vorbis optimizations
Optimizing Vorbis code in ffmpeg with new instructions.
URL: http://bugs.libre-riscv.org/show_bug.cgi?id=220
Budget: EUR 750

# Opus optimizations
Optimizing Opus code in ffmpeg with new instructions.
URL: http://bugs.libre-riscv.org/show_bug.cgi?id=221
Budget: EUR 750

# JPEG optimizations
Optimizing JPEG code in libjpeg-turbo with new instructions.
URL: http://bugs.libre-riscv.org/show_bug.cgi?id=222
Budget: EUR 1500

# MPEG1/2 optimizations
Optimizing MPEG1/2 code in ffmpeg with new instructions.
URL: http://bugs.libre-riscv.org/show_bug.cgi?id=223
Budget: EUR 1500

# MPEG4 ASP optimizations
Optimizing MPEG4 ASP (xvid) code in ffmpeg with new instructions.
URL: http://bugs.libre-riscv.org/show_bug.cgi?id=224
Budget: EUR 2000

# H.264 optimizations
Optimizing H.264 code in ffmpeg with new instructions.
URL: http://bugs.libre-riscv.org/show_bug.cgi?id=225
Budget: EUR 3000

# H.265 optimizations
Optimizing H.265 code in ffmpeg with new instructions.
URL: http://bugs.libre-riscv.org/show_bug.cgi?id=226
Budget: EUR 4000

# VP8 optimizations
Optimizing VP8 code in ffmpeg with new instructions.
URL: http://bugs.libre-riscv.org/show_bug.cgi?id=227
Budget: EUR 3000

# VP9 optimizations
Optimizing VP9 code in ffmpeg with new instructions.
URL: http://bugs.libre-riscv.org/show_bug.cgi?id=228
Budget: EUR 3000

# AV1 optimizations
Optimizing AV1 code in dav1d with new instructions.
URL: http://bugs.libre-riscv.org/show_bug.cgi?id=229
Budget: EUR 4000

# Video opcode development and discussion
Video opcode development and discussion is needed, as well as research
and informal write-up.
URL: http://bugs.libre-riscv.org/show_bug.cgi?id=230
Budget: EUR 4000

# Video Opcodes Standards "Formal" writeup
Video Opcodes Standards writeup is required, to a level that is acceptable
for formal proposal to the OpenPOWER Foundation
URL: http://bugs.libre-riscv.org/show_bug.cgi?id=231
Budget: EUR 2000

# Implementation of video opcodes in simulator
Implementation of video opcodes in simulator is needed, so that the
effectiveness of the opcodes can be tested prior to implementing them
in hardware (which simulates 10,000 to 100,000 times slower)
URL: http://bugs.libre-riscv.org/show_bug.cgi?id=232
Budget: EUR 5000

# Audio and Video unit tests in simulator
Audio and Video unit tests are needed, to be run in the simulator.
These are not the full GUI, just the core algorithm.
URL: http://bugs.libre-riscv.org/show_bug.cgi?id=233
Budget: EUR 3000

# Hardware implementation of video opcodes
Hardware implementation of video opcodes is needed, implementing
the instructions that were demonstrated to be effective from earlier
(software) simulations.
URL: http://bugs.libre-riscv.org/show_bug.cgi?id=234
Budget: EUR 9000

# Video opcode FPGA tests
Video opcode FPGA tests are needed, demonstrating the correctness
of the hardware implementation of the opcodes.
URL: http://bugs.libre-riscv.org/show_bug.cgi?id=235
Budget: EUR 2000

Comment 17 Luke Kenneth Casson Leighton 2020-04-04 11:57:23 BST

Summary sentence for MoU

Video acceleration is a necessary component for any modern CPU and GPU. Given the large amount of time the typical user spends on videos and its applications like videoconferencing, a performant and power-efficient implementation is necessary for wide adoption. With Zoom (etc.) now being critical to our modern life, and full of security holes, full transparency in the video encode/decode algorithms is more important than ever.

Comment 18 Luke Kenneth Casson Leighton 2020-04-11 12:19:02 BST

status: MoU signed, sent to NLNet.  bob to countersign.

Comment 19 Michael Pham 2020-04-18 01:34:17 BST

I see that you already sent this to NLNet, but I hope it's not too late to add AAC to the list of Audio Formats to accelerate. The reason being is that it is one of the primary audio formats used on YouTube and also has common use as a Bluetooth audio format. I would even argue that AAC is even more widely used than AC3 and Vorbis and Opus at least in my experience as an average user.

In regards to the video formats, what is the legality of putting into the hardware to help decoding for MPEG2, MPEG4, H.264, and H.265? I know these are at least covered by the patent pool of MPEG LA ( https://en.wikipedia.org/wiki/MPEG_LA ). Some of the audio codecs may or may not also be tangled in legalities (not sure, I haven't really looked into it for the audio formats).

Would it be prudent to get legal advice first? Or do you feel it is safe doing this?

Comment 20 cand 2020-04-18 07:43:04 BST

I believe it's too late to add AAC, however due to how ffmpeg is structured, some of the work on the other codecs will speed up AAC too. Not as much as focusing on it, but more than plain sw.

MPEG2, MP3, AC3 are all free, patents expired. Vorbis and Opus were explicitly designed for that.

For the newer video codecs, it's possible any implementation would infringe patents. So does 90% of ffmpeg code. IANAL, but only the sw patents I believe, as the hw blocks we will have will not be specialized to any codec. Unlike a modern GPU that has "H.264 frame in, RGB out" blocks, we will have sub-operations such as "calculate transform XYZ for this data". Those are not patentable in general, some specific algorithm in hw may be.

So my conclusion is that the hw is safe, but if a commercial entity wishes to ship our software in the US, they will need to disable the newer video codecs or to license patents. I.e. the exact same situation as ffmpeg code, if they want to ship that in a product with new stuff enabled.

Comment 21 Michael Pham 2020-04-18 08:03:29 BST

(In reply to cand from comment #20)
> I believe it's too late to add AAC, however due to how ffmpeg is structured,
> some of the work on the other codecs will speed up AAC too. Not as much as
> focusing on it, but more than plain sw.
> 

Ah ok, at least it will get some treatment.

> MPEG2, MP3, AC3 are all free, patents expired. Vorbis and Opus were
> explicitly designed for that.
> 
Ok, good to hear the other audio formats are safe!

> For the newer video codecs, it's possible any implementation would infringe
> patents. So does 90% of ffmpeg code. IANAL, but only the sw patents I
> believe, as the hw blocks we will have will not be specialized to any codec.
> Unlike a modern GPU that has "H.264 frame in, RGB out" blocks, we will have
> sub-operations such as "calculate transform XYZ for this data". Those are
> not patentable in general, some specific algorithm in hw may be.
> 
> So my conclusion is that the hw is safe, but if a commercial entity wishes
> to ship our software in the US, they will need to disable the newer video
> codecs or to license patents. I.e. the exact same situation as ffmpeg code,
> if they want to ship that in a product with new stuff enabled.

I'm not a lawyer either, though your conclusion sounds reasonable to me. :)

Patents are a minefield to navigate (sigh)