**设备环境:**物理机
BUG现象:
文件管理器一直转圈,永远卡死了

导出日志也会卡主:

出现频率:我的环境,特定操作几乎必现
复现步骤:
1、飞牛NAS B开启NFS,共享两个文件夹
2、飞牛NAS A 通过NFS挂载 B
3、B设备执行poweroff 或者UI上面关机都行
4、飞牛NAS A打开文件管理器
恢复方案:
把NAS B开机即可,哈哈哈
问题原因:
打开文件管理器的时候filemanager会获取nfs 挂载点 /vol02/1000-4-50d80283 信息,但是由于某种原因,在和服务器网络已经断开的情况下,nfs守护进程并没有正确的处理,还在等待nfs服务器响应。
定位信息和定位过程:
1、filemanager进程信息:
root@workstation:/home/ywh# ps -aux | grep file
message+ 997 0.0 0.0 9936 4968 ? Ss 23:09 0:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
root 1089 0.0 0.0 15464 6636 ? S<s 23:09 0:00 ovsdb-server /etc/openvswitch/conf.db -vconsole:emer -vsyslog:err -vfile:info --remote=punix:/var/run/openvswitch/db.sock --private-key=db:Open_vSwitch,SSL,private_key --certificate=db:Open_vSwitch,SSL,certificate --bootstrap-ca-cert=db:Open_vSwitch,SSL,ca_cert --no-chdir --log-file=/var/log/openvswitch/ovsdb-server.log --pidfile=/var/run/openvswitch/ovsdb-server.pid --detach
root 1184 0.3 1.2 1638200 198808 ? S<Lsl 23:09 0:02 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach
postgres 1378 0.0 0.5 239932 87368 ? Ss 23:09 0:00 /usr/lib/postgresql/15/bin/postgres -D /var/lib/postgresql/15/main -c config_file=/etc/postgresql/15/main/postgresql.conf
root 1751 0.0 0.1 722620 15964 ? Ssl 23:09 0:00 /usr/trim/bin/filestor_service
root 1767 0.0 0.0 72404 11972 ? Ssl 23:09 0:00 /usr/trim/bin/trim_file_monitor
root 2035 0.0 0.0 0 0 ? I< 23:09 0:00 [kworker/R-cifsfileinfoput]
ywh 4564 0.0 0.0 21956 10120 ? S 23:22 0:00 /usr/trim/bin/filemanager 61 OtmlGUrKOWjDyGdbyYC0LWjOWy7g4pjO6NBqgaERWN8= 1000 1001 ywh -1 114597414633472 h8BC0BypSzCL0UahdSQXvQ== 1 {"reqid":"6839cd35000000000000000003bd","req":"file.lsDir"}
ywh 4568 0.0 0.0 21956 10120 ? S 23:22 0:00 /usr/trim/bin/filemanager 61 OtmlGUrKOWjDyGdbyYC0LWjOWy7g4pjO6NBqgaERWN8= 1000 1001 ywh -1 114597414633472 ZUOFPyJxSySaDNC4SDBurA== 1 {"reqid":"6839cd35000000000000000003bc","req":"file.ls"}
root 4593 0.0 0.0 6336 2168 pts/1 S+ 23:22 0:00 grep --color=auto file
2、filemanager进程堆栈信息:
(gdb) thread apply all bt
Thread 1 (Thread 0x7f29172b3a00 (LWP 4564) "filemanager"):
#0 0x00007f291751681a in fstatat64 () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f2917a44db3 in util::is_same_inode(char const*, char const*) () from /lib/x86_64-linux-gnu/libnfile.so
#2 0x00007f2917a4688c in init_mount_cache(mount_info_t**, unsigned int) () from /lib/x86_64-linux-gnu/libnfile.so
#3 0x00007f2917a4318d in list_avail_vol(unsigned int**) () from /lib/x86_64-linux-gnu/libnfile.so
#4 0x0000560a0a36fbea in req_lsdir_userroot(PPJson::Value&, PPJson::MutValue&, PPJson::MutDocument&) ()
#5 0x0000560a0a3623bb in main ()
(gdb) disas 0x00007f291751681a
Dump of assembler code for function fstatat64:
0x00007f2917516810 <+0>: mov %ecx,%r10d
0x00007f2917516813 <+3>: mov $0x106,%eax
0x00007f2917516818 <+8>: syscall
=> 0x00007f291751681a <+10>: cmp $0xfffff000,%eax
0x00007f291751681f <+15>: ja 0x7f2917516828 <fstatat64+24>
0x00007f2917516821 <+17>: xor %eax,%eax
0x00007f2917516823 <+19>: ret
0x00007f2917516824 <+20>: nopl 0x0(%rax)
0x00007f2917516828 <+24>: mov 0xda5b1(%rip),%rdx # 0x7f29175f0de0
0x00007f291751682f <+31>: neg %eax
0x00007f2917516831 <+33>: mov %eax,%fs:(%rdx)
0x00007f2917516834 <+36>: mov $0xffffffff,%eax
0x00007f2917516839 <+41>: ret
End of assembler dump.
(gdb) x/s $rsi
0x7f2916647218: "/vol02/1000-4-50d80283"
3、线程的内核堆栈信息:
root@workstation:/home/ywh# cat /proc/4564/stack
[<0>] request_wait_answer+0xfe/0x220 [fuse]
[<0>] __fuse_simple_request+0xd3/0x290 [fuse]
[<0>] fuse_do_getattr+0xf2/0x1f0 [fuse]
[<0>] vfs_statx_path+0x9f/0xe0
[<0>] vfs_statx+0x97/0xf0
[<0>] vfs_fstatat+0x77/0xa0
[<0>] __do_sys_newfstatat+0x26/0x60
[<0>] do_syscall_64+0x4b/0x110
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
4、用户面fuse :
mount.trim_nfs on /vol02/1000-4-50d80283 type fuse.mount.trim_nfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
fuse——nfs守护进程:
root@workstation:/home/ywh# ps -aux | grep trim_nfs
root 2533 0.0 0.0 531760 3944 ? Ssl 23:09 0:00 /sbin/mount.trim_nfs nfs://172.20.10.7/fs/1000/nfs /vol02/1000-4-50d80283 -o rw
5、mount.trim_nfs的堆栈信息:
(gdb) bt
#0 0x00007f9defefb21f in poll () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x0000559dbf60681b in wait_for_nfs_reply ()
#2 0x0000559dbf606a1b in nfs_mount ()
#3 0x0000559dbf5f7452 in NfsConnection::open (this=this@entry=0x7f9dd0000c90, url=...) at /usr/include/c++/12/bits/basic_string.h:233
#4 0x0000559dbf5f8d3e in NfsConnectionPool::get (this=0x559dc2abf7f8) at NfsConnectionPool.cpp:133
#5 0x0000559dbf5ec1b9 in RpcContext::obtainConnection (this=0x7f9dd0000c00) at /workspace/SystemServices/mount_nfs/nfusr/RpcContext.h:101
#6 NfsClient::getattrWithContext (this=0x559dc2abf7f0, ctx=0x7f9dd0000c00) at NfsClient.cpp:344
#7 0x00007f9df028bc83 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
#8 0x00007f9df028c093 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
#9 0x00007f9df02890e4 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
#10 0x00007f9defe881f5 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#11 0x00007f9deff0889c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) info registers rdi rsi rdx
rdi 0x7f9decc4b8a0 140316258908320
rsi 0x1 1
rdx 0x64 100
(gdb) x/4wx $rdi
0x7f9decc4b8a0: 0x00000006 0x00000004 0x00000000 0x00000000
6、文件描述符0x00000006对应的套接字:
root@workstation:/home/ywh# ls -l /proc/2533/fd
total 0
lrwx------ 1 root root 64 May 30 23:10 0 -> /dev/null
lrwx------ 1 root root 64 May 30 23:10 1 -> /dev/null
lrwx------ 1 root root 64 May 30 23:10 2 -> /dev/null
l-wx------ 1 root root 64 May 30 23:10 3 -> /usr/trim/logs/mountmgr.log
lrwx------ 1 root root 64 May 30 23:10 4 -> /dev/fuse
lrwx------ 1 root root 64 May 30 23:10 5 -> /run/sys_apps/com.trim.mountmgr.pid
lrwx------ 1 root root 64 May 30 23:10 6 -> 'socket:[50289]'
lrwx------ 1 root root 64 May 30 23:10 8 -> 'socket:[12774]'
7、套接字已经不存在了:
root@workstation:/home/ywh# ss -xanp | grep -w "50289"
root@workstation:/home/ywh# ss -xanp | grep -w "12774"
u_dgr ESTAB 0 0 * 12774 * 7933 users:(("mount.trim_nfs",pid=2533,fd=8))
8、其它的一些现象观察:
ls stat cd 都会卡很久
root@workstation:/home/ywh# stat /vol02/1000-4-50d80283
stat: cannot statx '/vol02/1000-4-50d80283': No route to host
联系方式:(12群-Heisenberg)
日志文件:
导出了两个core文件,可以继续分析根本原因
飞牛私有云分享【coredump】,点击链接下载文件,App打开可转存到NAS:https://s.fnnas.net/s/d692219708014e4fac,密码:129716
附件过大无法上传可以通过飞牛外链分享或者百度网盘提供日志文件