Ceph-Fuse性能调优

| 分类 分布式文件系统  高性能  | 标签 文件系统 

ceph-fuse性能调优背景

在使用FIO工具测试时,发现bs>128k、-iodepth=1的情况下,性能一直上不去,单个客户端最多只能维持在4M/s左右, 影响了CephFS在Docker中的应用。需要注意的是在-iodepth=0的情况下(即有缓存的情况下),单个客户端可以达到23M/s左右, 另外也排除了Ceph容器化对性能的影响。

测试结果:

[root@cephfs103 test99]# fio -direct=1  -ioengine=libaio -group_reporting -thread -rw=write -name=cephfs -runtime=1000s -filename=/share-fs/export/test99/test512m -bs=4m -size=512m -numjobs=4 -iodepth=4
cephfs: (g=0): rw=write, bs=4M-4M/4M-4M/4M-4M, ioengine=libaio, iodepth=4
...
fio-2.12
Starting 4 threads
cephfs: Laying out IO file(s) (1 file(s) / 512MB)
Jobs: 1 (f=1): [_(2),W(1),_(1)] [97.0% done] [0KB/32702KB/0KB /s] [0/7/0 iops] [eta 00m:18s]
cephfs: (groupid=0, jobs=4): err= 0: pid=19209: Tue Mar  6 17:23:16 2018
write: io=2048.0MB, bw=3593.4KB/s, iops=0, runt=583671msec
slat (msec): min=1188, max=9891, avg=4531.13, stdev=1449.52
clat (usec): min=8, max=20555K, avg=13325163.34, stdev=3302525.98
lat (msec): min=2716, max=25982, avg=17856.30, stdev=3656.44
clat percentiles (msec):
|  1.00th=[ 2769],  5.00th=[ 9110], 10.00th=[ 9503], 20.00th=[10683],
| 30.00th=[11600], 40.00th=[12649], 50.00th=[13304], 60.00th=[13960],
| 70.00th=[14746], 80.00th=[16057], 90.00th=[16712], 95.00th=[16712],
| 99.00th=[16712], 99.50th=[16712], 99.90th=[16712], 99.95th=[16712],
| 99.99th=[16712]
bw (KB  /s): min= 8175, max= 8342, per=100.00%, avg=8213.09, stdev=23.94
lat (usec) : 10=0.59%, 20=0.20%
lat (msec) : >=2000=99.22%
cpu          : usr=0.01%, sys=0.02%, ctx=17442, majf=0, minf=58034
IO depths    : 1=0.8%, 2=1.6%, 4=97.7%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued    : total=r=0/w=512/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
WRITE: io=2048.0MB, aggrb=3593KB/s, minb=3593KB/s, maxb=3593KB/s, mint=583671msec, maxt=583671msec

只有4M左右:
aggrb=3593KB/s

原因

通过打开ceph /var/ceph/log/client.log的日志,可以通过连续的2次调用client.cc中的write函数打印信息(如下),可以判断fuse 的write每次下发的offset大小固定都是128K=131072

2018-03-10 11:01:29.197588 7f028e0cd700  7 client.84103 wrote to 117309440, leaving file size at 536870912
2018-03-10 11:01:29.271777 7f028f0cf700  7 client.84103 wrote to 117440512, leaving file size at 536870912

虽然libfuse支持设置-o big_writes -o max_write=4194304的方式来支持提高单次IO的写入数据量,但测试结果却事与愿违, 目前根据个人的分析,max_write是针对前端业务侧的设置,而libfuse链接fuse kernel侧还有参数控制,且libfuse的缓存大小 也被设置为固定(有点反人性,后续在libfuse提个issue),归根结底就是下面这2个参数在作祟:

  • linux-3.10.0-xxx.el7/fs/fuse/fuse_i.h
    #define FUSE_MAX_PAGES_PER_REQ 32 //fuse每个请求能使用的最大页数
  • libfuse-fuse_2_9_2/lib/fuse_kern_chan.c
    #define MIN_BUFSIZE 0x21000 //用户态与fuse/dev之间channal的缓冲区大小(需要使用多余4K来做管理)

每一个内核的默认请求,即REQ为4K,那么32*4K=128k,因此每单次由fuse提交到cephfs端的offset最大也只能是128K,为了最大化利用缓存, libfuse端的缓存空间MIN_BUFSIZE被设置为128K+4K(我认为libfuse可以做为动态参考来进行设置,提交issue到libfuse社区)。 这种配置的导致前端FIO的一个4M的io被分解为后端32个128K的IO流,最终导致整体带宽上不去。

解决方案

    1. 下载对应挂载客户端的内核版本,比如我使用的是linux-3.10.0-327.el7
      下载地址:
      https://buildlogs.centos.org/c7.1511.00/kernel/20151119220809/3.10.0-327.el7.x86_64/
      或者:https://github.com/torvalds/linux
      修改参数: 下面的fs/fuse/fuse_i.h中FUSE_MAX_PAGES_PER_REQ为1024,即最大4M
      编译:
      可以单独编译fuse.ko,但这种编译方式问题较多,我是编译整体内核。
      make menuconfig
      make dep
      make clean
      make zImage
      make modules
      make modules_install
      将生产的fuse.ko替换掉/usr/lib/modules/3.10.0-xxx.el7.x86_64/kernel/fs/fuse/fuse.ko
    1. 下载对应的libfuse模块
      下载地址:
      libfuse,我使用的是fuse2.9.2
      修改参数:
      修改libfuse-fuse_2_9_2/lib/fuse_kern_chan.c中的MIN_BUFSIZE为 #define MIN_BUFSIZE 0x401000
      编译:
      mkdir build; cd build; make install
      将生成的libfuse.so.2.9.2 替换掉 /usr/lib64/libfuse.so.2.9.2。
    1. 重启节点,并重新挂载ceph-fuse
      ceph-fuse -o big_writes -o max_write=4194304 -k /etc/ceph/ceph.client.admin.keyring -m 10.118.202.181:6789 /mycephfs
    1. 测试
        [root@ceph201 ceph]# fio -filename=/mycephfs/fio_4M-1 -direct=1 -iodepth 1 -thread -rw=write -ioengine=libaio -bs=4M -size=1g -numjobs=1 -runtime=300 -group_reporting -name=testceph  
        testceph: (g=0): rw=write, bs=4M-4M/4M-4M/4M-4M, ioengine=libaio, iodepth=1
        fio-2.12
        Starting 1 thread
        testceph: Laying out IO file(s) (1 file(s) / 1024MB)
        Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/28672KB/0KB /s] [0/7/0 iops] [eta 00m:00s]
        testceph: (groupid=0, jobs=1): err= 0: pid=2862: Mon Mar 12 22:03:57 2018
        write: io=1024.0MB, bw=37925KB/s, iops=9, runt= 27649msec
        slat (msec): min=70, max=299, avg=108.00, stdev=35.54
        clat (usec): min=0, max=9, avg= 1.52, stdev= 0.90
        lat (msec): min=70, max=299, avg=108.00, stdev=35.54
        clat percentiles (usec):
        |  1.00th=[    1],  5.00th=[    1], 10.00th=[    1], 20.00th=[    1],
        | 30.00th=[    1], 40.00th=[    1], 50.00th=[    1], 60.00th=[    1],
        | 70.00th=[    2], 80.00th=[    2], 90.00th=[    2], 95.00th=[    3],
        | 99.00th=[    3], 99.50th=[    8], 99.90th=[    9], 99.95th=[    9],
        | 99.99th=[    9]
        bw (KB  /s): min=16384, max=49152, per=100.00%, avg=37981.09, stdev=7286.95
        lat (usec) : 2=60.16%, 4=39.06%, 10=0.78%
        cpu          : usr=0.17%, sys=0.04%, ctx=515, majf=0, minf=4
        IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
        submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
        complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
        issued    : total=r=0/w=256/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
        latency   : target=0, window=0, percentile=100.00%, depth=1
      
        Run status group 0 (all jobs):
        WRITE: io=1024.0MB, aggrb=37924KB/s, minb=37924KB/s, maxb=37924KB/s, mint=27649msec, maxt=27649msec
      

      达到39M:
      aggrb=37924KB/s

参考:
http://www.cnblogs.com/yunnotes/archive/2013/04/19/3032497.html
http://www.cnblogs.com/yunnotes/archive/2013/04/19/3032497.html


上一篇     下一篇