尝试提高WireGuard在RISC-V上的性能
前情提要:前段时间我参加比赛做gVisor在riscv64上的移植,看到MilkV出了一块名为Jupiter的板子(SpaceMIT X60,支持RVV 1.0),就买了一块准备做测试环境。结果比赛截止前迟迟未到,到的时候恰好碰上我比赛结束后摆烂:看到这块板子支持的扩展有点多,就想着折腾点什么。折腾的结果:在SpaceMIT X60上,本地运行测速脚本,速度从270Mbit/s提升到550Mbit/s,修改后的SpaceMit X60的内核树在我的Github上。
工作记录
工作的主要内容其实很简单,就是做个搬运工,把来自来自openssl和crpytogams的chacha20和poly1035针对riscv指令优化后的汇编算法模块放入Linux内核中,并开启相关支持使其能够被WireGuard使用。
其中chacha20在内核中已经有支持crypto vector的版本,但是X60没有crypto vector扩展,于是我将这个Github issue中实现的vector only implementation放到了内核中,然后打开CRYPTO_ARCH_HAVE_LIB_CHACHA
,编写相应的函数,使得wireguard能够最终使用优化的chacha20模块。poly1305同理,只是在cryptogams的实现中没有依赖V扩展。
chacha20模块遇到的一个问题是内存对齐的问题:
static void do_chacha20_v(u32 *state, const u8 *src, u8 *dst,
int bytes)
这个函数实际调用优化后的quarter round,但是传入的源数据src的地址不一定对齐到4byte,导致后面调用vlsseg8e32.v
读取源数据的时候会发生load misaligned异常,根据这个讨论的说法,我也做了一些尝试:
- 使用memcpy将数据拷贝到内部的数组缓存中,这个数组是对齐的;尝试一次性拷贝1个,4个,8个,16个块;发现效果最好的是8个块
- 尝试使用cryptogams的实现,其中包含了对未对齐数据的处理,实际效果好像并不好?(跟那篇讨论的说法一样,最简单最优雅的解决方法还是memcpy)
实际效果
测试环境:MilkV jupiter(SpaceMIT M1),测试脚本使用这个
优化前:
Connecting to host 169.254.200.2, port 5201
[ 5] local 169.254.200.1 port 43166 connected to 169.254.200.2 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 30.9 MBytes 259 Mbits/sec 0 310 KBytes
[ 5] 1.00-2.00 sec 32.3 MBytes 271 Mbits/sec 0 339 KBytes
[ 5] 2.00-3.00 sec 34.8 MBytes 292 Mbits/sec 0 339 KBytes
[ 5] 3.00-4.00 sec 35.1 MBytes 294 Mbits/sec 0 339 KBytes
[ 5] 4.00-5.00 sec 35.4 MBytes 297 Mbits/sec 0 339 KBytes
[ 5] 5.00-6.00 sec 34.5 MBytes 289 Mbits/sec 0 339 KBytes
[ 5] 6.00-7.00 sec 35.0 MBytes 293 Mbits/sec 0 358 KBytes
[ 5] 7.00-8.00 sec 29.3 MBytes 246 Mbits/sec 0 406 KBytes
[ 5] 8.00-9.00 sec 32.4 MBytes 272 Mbits/sec 0 406 KBytes
[ 5] 9.00-10.00 sec 29.4 MBytes 246 Mbits/sec 0 406 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 329 MBytes 276 Mbits/sec 0 sender
[ 5] 0.00-10.00 sec 328 MBytes 275 Mbits/sec receiver
iperf Done.
使用riscv优化的chacha20模块后:
Connecting to host 169.254.200.2, port 5201
[ 5] local 169.254.200.1 port 50852 connected to 169.254.200.2 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 36.4 MBytes 304 Mbits/sec 0 345 KBytes
[ 5] 1.00-2.00 sec 35.3 MBytes 296 Mbits/sec 0 345 KBytes
[ 5] 2.00-3.00 sec 34.5 MBytes 289 Mbits/sec 0 381 KBytes
[ 5] 3.00-4.00 sec 34.0 MBytes 285 Mbits/sec 0 401 KBytes
[ 5] 4.00-5.00 sec 36.7 MBytes 308 Mbits/sec 0 419 KBytes
[ 5] 5.00-6.00 sec 37.7 MBytes 316 Mbits/sec 0 419 KBytes
[ 5] 6.00-7.00 sec 37.3 MBytes 313 Mbits/sec 0 446 KBytes
[ 5] 7.00-8.00 sec 36.9 MBytes 310 Mbits/sec 0 446 KBytes
[ 5] 8.00-9.00 sec 36.4 MBytes 306 Mbits/sec 0 446 KBytes
[ 5] 9.00-10.00 sec 33.7 MBytes 283 Mbits/sec 0 446 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 359 MBytes 301 Mbits/sec 0 sender
[ 5] 0.00-10.01 sec 358 MBytes 300 Mbits/sec receiver
使用优化的chacha20模块,并且参考这个文章开启内核的CONFIG_PREEMPT_NONE
选项之后:
Connecting to host 169.254.200.2, port 5201
[ 5] local 169.254.200.1 port 47056 connected to 169.254.200.2 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 57.4 MBytes 482 Mbits/sec 0 369 KBytes
[ 5] 1.00-2.00 sec 60.4 MBytes 507 Mbits/sec 0 453 KBytes
[ 5] 2.00-3.00 sec 62.5 MBytes 524 Mbits/sec 0 453 KBytes
[ 5] 3.00-4.00 sec 63.8 MBytes 535 Mbits/sec 0 453 KBytes
[ 5] 4.00-5.00 sec 63.8 MBytes 535 Mbits/sec 0 496 KBytes
[ 5] 5.00-6.00 sec 64.8 MBytes 543 Mbits/sec 0 496 KBytes
[ 5] 6.00-7.00 sec 65.9 MBytes 553 Mbits/sec 0 522 KBytes
[ 5] 7.00-8.00 sec 66.9 MBytes 561 Mbits/sec 0 522 KBytes
[ 5] 8.00-9.00 sec 65.7 MBytes 551 Mbits/sec 0 522 KBytes
[ 5] 9.00-10.00 sec 66.2 MBytes 555 Mbits/sec 0 522 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 637 MBytes 535 Mbits/sec 0 sender
[ 5] 0.00-10.00 sec 636 MBytes 533 Mbits/sec receiver
iperf Done.
最终使用优化的chacha20和poly1305模块并且开启CONFIG_PREEMPT_NONE
选项之后:
Connecting to host 169.254.200.2, port 5201
[ 5] local 169.254.200.1 port 50708 connected to 169.254.200.2 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 61.5 MBytes 515 Mbits/sec 0 369 KBytes
[ 5] 1.00-2.00 sec 63.9 MBytes 536 Mbits/sec 0 422 KBytes
[ 5] 2.00-3.00 sec 64.8 MBytes 543 Mbits/sec 0 441 KBytes
[ 5] 3.00-4.00 sec 65.7 MBytes 551 Mbits/sec 0 441 KBytes
[ 5] 4.00-5.00 sec 67.0 MBytes 562 Mbits/sec 0 460 KBytes
[ 5] 5.00-6.00 sec 66.5 MBytes 558 Mbits/sec 0 482 KBytes
[ 5] 6.00-7.00 sec 68.0 MBytes 570 Mbits/sec 0 554 KBytes
[ 5] 7.00-8.00 sec 67.1 MBytes 563 Mbits/sec 0 581 KBytes
[ 5] 8.00-9.00 sec 68.6 MBytes 576 Mbits/sec 0 581 KBytes
[ 5] 9.00-10.00 sec 66.0 MBytes 553 Mbits/sec 0 581 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 659 MBytes 553 Mbits/sec 0 sender
[ 5] 0.00-10.01 sec 657 MBytes 551 Mbits/sec receiver
iperf Done.