ceph-fuse性能调优背景
在使用FIO工具测试时,发现bs>128k、-iodepth=1的情况下,性能一直上不去,单个客户端最多只能维持在4M/s左右, 影响了CephFS在Docker中的应用。需要注意的是在-iodepth=0的情况下(即有缓存的情况下),单个客户端可以达到23M/s左右, 另外也排除了Ceph容器化对性能的影响。
测试结果:
[root@cephfs103 test99]# fio -direct=1 -ioengine=libaio -group_reporting -thread -rw=write -name=cephfs -runtime=1000s -filename=/share-fs/export/test99/test512m -bs=4m -size=512m -numjobs=4 -iodepth=4
cephfs: (g=0): rw=write, bs=4M-4M/4M-4M/4M-4M, ioengine=libaio, iodepth=4
...
fio-2.12
Starting 4 threads
cephfs: Laying out IO file(s) (1 file(s) / 512MB)
Jobs: 1 (f=1): [_(2),W(1),_(1)] [97.0% done] [0KB/32702KB/0KB /s] [0/7/0 iops] [eta 00m:18s]
cephfs: (groupid=0, jobs=4): err= 0: pid=19209: Tue Mar 6 17:23:16 2018
write: io=2048.0MB, bw=3593.4KB/s, iops=0, runt=583671msec
slat (msec): min=1188, max=9891, avg=4531.13, stdev=1449.52
clat (usec): min=8, max=20555K, avg=13325163.34, stdev=3302525.98
lat (msec): min=2716, max=25982, avg=17856.30, stdev=3656.44
clat percentiles (msec):
| 1.00th=[ 2769], 5.00th=[ 9110], 10.00th=[ 9503], 20.00th=[10683],
| 30.00th=[11600], 40.00th=[12649], 50.00th=[13304], 60.00th=[13960],
| 70.00th=[14746], 80.00th=[16057], 90.00th=[16712], 95.00th=[16712],
| 99.00th=[16712], 99.50th=[16712], 99.90th=[16712], 99.95th=[16712],
| 99.99th=[16712]
bw (KB /s): min= 8175, max= 8342, per=100.00%, avg=8213.09, stdev=23.94
lat (usec) : 10=0.59%, 20=0.20%
lat (msec) : >=2000=99.22%
cpu : usr=0.01%, sys=0.02%, ctx=17442, majf=0, minf=58034
IO depths : 1=0.8%, 2=1.6%, 4=97.7%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=512/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=4
Run status group 0 (all jobs):
WRITE: io=2048.0MB, aggrb=3593KB/s, minb=3593KB/s, maxb=3593KB/s, mint=583671msec, maxt=583671msec
只有4M左右:
aggrb=3593KB/s
原因
通过打开ceph /var/ceph/log/client.log的日志,可以通过连续的2次调用client.cc中的write函数打印信息(如下),可以判断fuse 的write每次下发的offset大小固定都是128K=131072
2018-03-10 11:01:29.197588 7f028e0cd700 7 client.84103 wrote to 117309440, leaving file size at 536870912
2018-03-10 11:01:29.271777 7f028f0cf700 7 client.84103 wrote to 117440512, leaving file size at 536870912
虽然libfuse支持设置-o big_writes -o max_write=4194304的方式来支持提高单次IO的写入数据量,但测试结果却事与愿违, 目前根据个人的分析,max_write是针对前端业务侧的设置,而libfuse链接fuse kernel侧还有参数控制,且libfuse的缓存大小 也被设置为固定(有点反人性,后续在libfuse提个issue),归根结底就是下面这2个参数在作祟:
- linux-3.10.0-xxx.el7/fs/fuse/fuse_i.h
#define FUSE_MAX_PAGES_PER_REQ 32 //fuse每个请求能使用的最大页数 - libfuse-fuse_2_9_2/lib/fuse_kern_chan.c
#define MIN_BUFSIZE 0x21000 //用户态与fuse/dev之间channal的缓冲区大小(需要使用多余4K来做管理)
每一个内核的默认请求,即REQ为4K,那么32*4K=128k,因此每单次由fuse提交到cephfs端的offset最大也只能是128K,为了最大化利用缓存, libfuse端的缓存空间MIN_BUFSIZE被设置为128K+4K(我认为libfuse可以做为动态参考来进行设置,提交issue到libfuse社区)。 这种配置的导致前端FIO的一个4M的io被分解为后端32个128K的IO流,最终导致整体带宽上不去。
解决方案
-
- 下载对应挂载客户端的内核版本,比如我使用的是linux-3.10.0-327.el7
下载地址:
https://buildlogs.centos.org/c7.1511.00/kernel/20151119220809/3.10.0-327.el7.x86_64/
或者:https://github.com/torvalds/linux
修改参数: 下面的fs/fuse/fuse_i.h中FUSE_MAX_PAGES_PER_REQ为1024,即最大4M
编译:
可以单独编译fuse.ko,但这种编译方式问题较多,我是编译整体内核。
make menuconfig
make dep
make clean
make zImage
make modules
make modules_install
将生产的fuse.ko替换掉/usr/lib/modules/3.10.0-xxx.el7.x86_64/kernel/fs/fuse/fuse.ko
- 下载对应挂载客户端的内核版本,比如我使用的是linux-3.10.0-327.el7
-
- 下载对应的libfuse模块
下载地址:
libfuse,我使用的是fuse2.9.2
修改参数:
修改libfuse-fuse_2_9_2/lib/fuse_kern_chan.c中的MIN_BUFSIZE为 #define MIN_BUFSIZE 0x401000
编译:
mkdir build; cd build; make install
将生成的libfuse.so.2.9.2 替换掉 /usr/lib64/libfuse.so.2.9.2。
- 下载对应的libfuse模块
-
- 重启节点,并重新挂载ceph-fuse
ceph-fuse -o big_writes -o max_write=4194304 -k /etc/ceph/ceph.client.admin.keyring -m 10.118.202.181:6789 /mycephfs
- 重启节点,并重新挂载ceph-fuse
-
- 测试
[root@ceph201 ceph]# fio -filename=/mycephfs/fio_4M-1 -direct=1 -iodepth 1 -thread -rw=write -ioengine=libaio -bs=4M -size=1g -numjobs=1 -runtime=300 -group_reporting -name=testceph testceph: (g=0): rw=write, bs=4M-4M/4M-4M/4M-4M, ioengine=libaio, iodepth=1 fio-2.12 Starting 1 thread testceph: Laying out IO file(s) (1 file(s) / 1024MB) Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/28672KB/0KB /s] [0/7/0 iops] [eta 00m:00s] testceph: (groupid=0, jobs=1): err= 0: pid=2862: Mon Mar 12 22:03:57 2018 write: io=1024.0MB, bw=37925KB/s, iops=9, runt= 27649msec slat (msec): min=70, max=299, avg=108.00, stdev=35.54 clat (usec): min=0, max=9, avg= 1.52, stdev= 0.90 lat (msec): min=70, max=299, avg=108.00, stdev=35.54 clat percentiles (usec): | 1.00th=[ 1], 5.00th=[ 1], 10.00th=[ 1], 20.00th=[ 1], | 30.00th=[ 1], 40.00th=[ 1], 50.00th=[ 1], 60.00th=[ 1], | 70.00th=[ 2], 80.00th=[ 2], 90.00th=[ 2], 95.00th=[ 3], | 99.00th=[ 3], 99.50th=[ 8], 99.90th=[ 9], 99.95th=[ 9], | 99.99th=[ 9] bw (KB /s): min=16384, max=49152, per=100.00%, avg=37981.09, stdev=7286.95 lat (usec) : 2=60.16%, 4=39.06%, 10=0.78% cpu : usr=0.17%, sys=0.04%, ctx=515, majf=0, minf=4 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=256/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: io=1024.0MB, aggrb=37924KB/s, minb=37924KB/s, maxb=37924KB/s, mint=27649msec, maxt=27649msec
达到39M:
aggrb=37924KB/s
- 测试
参考:
http://www.cnblogs.com/yunnotes/archive/2013/04/19/3032497.html
http://www.cnblogs.com/yunnotes/archive/2013/04/19/3032497.html