尝试提高WireGuard在RISC-V上的性能

前情提要:前段时间我参加比赛做gVisor在riscv64上的移植,看到MilkV出了一块名为Jupiter的板子(SpaceMIT X60,支持RVV 1.0),就买了一块准备做测试环境。结果比赛截止前迟迟未到,到的时候恰好碰上我比赛结束后摆烂:看到这块板子支持的扩展有点多,就想着折腾点什么。折腾的结果:在SpaceMIT X60上,本地运行测速脚本,速度从270Mbit/s提升到550Mbit/s,修改后的SpaceMit X60的内核树在我的Github上

工作记录

工作的主要内容其实很简单,就是做个搬运工,把来自来自openssl和crpytogams的chacha20和poly1035针对riscv指令优化后的汇编算法模块放入Linux内核中,并开启相关支持使其能够被WireGuard使用。

其中chacha20在内核中已经有支持crypto vector的版本,但是X60没有crypto vector扩展,于是我将这个Github issue中实现的vector only implementation放到了内核中,然后打开CRYPTO_ARCH_HAVE_LIB_CHACHA,编写相应的函数,使得wireguard能够最终使用优化的chacha20模块。poly1305同理,只是在cryptogams的实现中没有依赖V扩展。

chacha20模块遇到的一个问题是内存对齐的问题:

static void do_chacha20_v(u32 *state, const u8 *src, u8 *dst,
                      int bytes)

这个函数实际调用优化后的quarter round,但是传入的源数据src的地址不一定对齐到4byte,导致后面调用vlsseg8e32.v读取源数据的时候会发生load misaligned异常,根据这个讨论的说法,我也做了一些尝试:

  1. 使用memcpy将数据拷贝到内部的数组缓存中,这个数组是对齐的;尝试一次性拷贝1个,4个,8个,16个块;发现效果最好的是8个块
  2. 尝试使用cryptogams的实现,其中包含了对未对齐数据的处理,实际效果好像并不好?(跟那篇讨论的说法一样,最简单最优雅的解决方法还是memcpy)

实际效果

测试环境:MilkV jupiter(SpaceMIT M1),测试脚本使用这个

优化前:

Connecting to host 169.254.200.2, port 5201
[  5] local 169.254.200.1 port 43166 connected to 169.254.200.2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  30.9 MBytes   259 Mbits/sec    0    310 KBytes
[  5]   1.00-2.00   sec  32.3 MBytes   271 Mbits/sec    0    339 KBytes
[  5]   2.00-3.00   sec  34.8 MBytes   292 Mbits/sec    0    339 KBytes
[  5]   3.00-4.00   sec  35.1 MBytes   294 Mbits/sec    0    339 KBytes
[  5]   4.00-5.00   sec  35.4 MBytes   297 Mbits/sec    0    339 KBytes
[  5]   5.00-6.00   sec  34.5 MBytes   289 Mbits/sec    0    339 KBytes
[  5]   6.00-7.00   sec  35.0 MBytes   293 Mbits/sec    0    358 KBytes
[  5]   7.00-8.00   sec  29.3 MBytes   246 Mbits/sec    0    406 KBytes
[  5]   8.00-9.00   sec  32.4 MBytes   272 Mbits/sec    0    406 KBytes
[  5]   9.00-10.00  sec  29.4 MBytes   246 Mbits/sec    0    406 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   329 MBytes   276 Mbits/sec    0             sender
[  5]   0.00-10.00  sec   328 MBytes   275 Mbits/sec                  receiver

iperf Done.

使用riscv优化的chacha20模块后:

Connecting to host 169.254.200.2, port 5201
[  5] local 169.254.200.1 port 50852 connected to 169.254.200.2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  36.4 MBytes   304 Mbits/sec    0    345 KBytes
[  5]   1.00-2.00   sec  35.3 MBytes   296 Mbits/sec    0    345 KBytes
[  5]   2.00-3.00   sec  34.5 MBytes   289 Mbits/sec    0    381 KBytes
[  5]   3.00-4.00   sec  34.0 MBytes   285 Mbits/sec    0    401 KBytes
[  5]   4.00-5.00   sec  36.7 MBytes   308 Mbits/sec    0    419 KBytes
[  5]   5.00-6.00   sec  37.7 MBytes   316 Mbits/sec    0    419 KBytes
[  5]   6.00-7.00   sec  37.3 MBytes   313 Mbits/sec    0    446 KBytes
[  5]   7.00-8.00   sec  36.9 MBytes   310 Mbits/sec    0    446 KBytes
[  5]   8.00-9.00   sec  36.4 MBytes   306 Mbits/sec    0    446 KBytes
[  5]   9.00-10.00  sec  33.7 MBytes   283 Mbits/sec    0    446 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   359 MBytes   301 Mbits/sec    0             sender
[  5]   0.00-10.01  sec   358 MBytes   300 Mbits/sec                  receiver

使用优化的chacha20模块,并且参考这个文章开启内核的CONFIG_PREEMPT_NONE选项之后:

Connecting to host 169.254.200.2, port 5201
[  5] local 169.254.200.1 port 47056 connected to 169.254.200.2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  57.4 MBytes   482 Mbits/sec    0    369 KBytes
[  5]   1.00-2.00   sec  60.4 MBytes   507 Mbits/sec    0    453 KBytes
[  5]   2.00-3.00   sec  62.5 MBytes   524 Mbits/sec    0    453 KBytes
[  5]   3.00-4.00   sec  63.8 MBytes   535 Mbits/sec    0    453 KBytes
[  5]   4.00-5.00   sec  63.8 MBytes   535 Mbits/sec    0    496 KBytes
[  5]   5.00-6.00   sec  64.8 MBytes   543 Mbits/sec    0    496 KBytes
[  5]   6.00-7.00   sec  65.9 MBytes   553 Mbits/sec    0    522 KBytes
[  5]   7.00-8.00   sec  66.9 MBytes   561 Mbits/sec    0    522 KBytes
[  5]   8.00-9.00   sec  65.7 MBytes   551 Mbits/sec    0    522 KBytes
[  5]   9.00-10.00  sec  66.2 MBytes   555 Mbits/sec    0    522 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   637 MBytes   535 Mbits/sec    0             sender
[  5]   0.00-10.00  sec   636 MBytes   533 Mbits/sec                  receiver

iperf Done.

最终使用优化的chacha20和poly1305模块并且开启CONFIG_PREEMPT_NONE选项之后:

Connecting to host 169.254.200.2, port 5201
[  5] local 169.254.200.1 port 50708 connected to 169.254.200.2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  61.5 MBytes   515 Mbits/sec    0    369 KBytes
[  5]   1.00-2.00   sec  63.9 MBytes   536 Mbits/sec    0    422 KBytes
[  5]   2.00-3.00   sec  64.8 MBytes   543 Mbits/sec    0    441 KBytes
[  5]   3.00-4.00   sec  65.7 MBytes   551 Mbits/sec    0    441 KBytes
[  5]   4.00-5.00   sec  67.0 MBytes   562 Mbits/sec    0    460 KBytes
[  5]   5.00-6.00   sec  66.5 MBytes   558 Mbits/sec    0    482 KBytes
[  5]   6.00-7.00   sec  68.0 MBytes   570 Mbits/sec    0    554 KBytes
[  5]   7.00-8.00   sec  67.1 MBytes   563 Mbits/sec    0    581 KBytes
[  5]   8.00-9.00   sec  68.6 MBytes   576 Mbits/sec    0    581 KBytes
[  5]   9.00-10.00  sec  66.0 MBytes   553 Mbits/sec    0    581 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   659 MBytes   553 Mbits/sec    0             sender
[  5]   0.00-10.01  sec   657 MBytes   551 Mbits/sec                  receiver

iperf Done.

标签: none

添加新评论