复制~30-40GB的数据后,多径开始失败
Dec 15 01:57:53 test.example.com multipathd: 360000970000196801239533037303434: Recovered to normal mode Dec 15 01:57:53 test.example.com multipathd: 360000970000196801239533037303434: remaining active paths: 1 Dec 15 01:57:53 test.example.com kernel: sd 1:0:2:20: [sdeu] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK [root@test log]# multipath -ll |grep -i fail |- 1:0:0:15 sdq 65:0 Failed ready running - 3:0:0:15 sdai 66:32 Failed ready running
我们使用默认的multipath.conf
HBA driver version 8.07.00.26.06.8-k HBA model QLogic Corp. ISP8324-based 16Gb Fibre Channel to PCI Express Adapter OS: CentOS 64-bit/2.6.32-642.6.2.el6.x86_64 Hardware:Intel/HP ProLiant DL380 Gen9
已经验证了这个解决方案,并与EMC检查了一切看起来很好https://access.redhat.com/solutions/438403
更多信息
–
网络侧没有丢弃/错误数据包.
>文件系统安装了noatime,nodiratime
>文件系统ext4(已经尝试过xfs但错误相同)
> LVM处于条带模式(从线性选项开始,然后转换为条带)
>已经禁用THP
> echo never> / SYS /内核/ MM / redhat_transparent_hugepage /启用
>每当多路径启动失败时,进程进入D状态
>系统固件已升级
>尝试使用最新版本的qlogic驱动程序
>尝试使用不同的调度程序(noop,deadline,cfq)
>尝试使用不同的调整配置文件(企业存储)
在发行期间收集的Vmcore
我可以在发行期间收集vmcore
KERNEL: /usr/lib/debug/lib/modules/2.6.32-642.6.2.el6.x86_64/vmlinux DUMPFILE: vmcore [PARTIAL DUMP] cpuS: 36 DATE: Fri Dec 16 00:11:26 2016 UPTIME: 01:48:57 LOAD AVERAGE: 0.41,0.49,0.60 TASKS: 1238 NODENAME: test.example.com RELEASE: 2.6.32-642.6.2.el6.x86_64 VERSION: #1 SMP Wed Oct 26 06:52:09 UTC 2016 MACHINE: x86_64 (2297 Mhz) MEMORY: 511.9 GB PANIC: "BUG: unable to handle kernel NULL pointer dereference at 0000000000000018" PID: 15840 COMMAND: "kjournald" TASK: ffff884023446ab0 [THREAD_INFO: ffff88103def4000] cpu: 2 STATE: TASK_RUNNING (PANIC)
在qlogic sid上进行Enbaling Debug模式之后
qla2xxx [0000:0b:00.0]-3822:5: FCP command status: 0x2-0x0 (0x70000) nexus=5:1:0 portid=1f0160 oxid=0x800 cdb=2a200996238000038000 len=0x70000 rsp_info=0x0 resid=0x0 fw_resid=0x0 sp=ffff882189d42580 cp=ffff88276d249480. qla2xxx [0000:84:00.0]-3822:7: FCP command status: 0x2-0x0 (0x70000) nexus=7:0:3 portid=450000 oxid=0x4de cdb=2a20098a5b0000010000 len=0x20000 rsp_info=0x0 resid=0x0 fw_resid=0x0 sp=ffff882189d421c0 cp=ffff8880237e0880.
你能告诉我有关服务器固件版本的信息吗?
EMC PowerPath实际安装了吗?如果是这样,check here.
您是否安装了HP Management Agents?如果是这样,您是否能够发布hplog -v的输出.
你在ILO4日志中看到过什么吗?国际劳工组织可以获得吗?
你能描述一下系统插槽中安装的所有PCIe卡吗?
对于特定于RHEL6的调优,我强烈推荐使用XFS,运行tuned-adm配置文件企业存储并确保您的文件系统安装为nobarrier(调优的配置文件应该处理它).
对于卷,请确保您使用dm(多路径)设备而不是/ dev / sdX.见:https://access.redhat.com/solutions/1212233
看看你到目前为止所呈现的内容以及Redhat’s support site(以及description here)中列出的检查,我不能排除HBA故障或PCIe提升器问题的可能性.此外,VMAX方面存在一个问题.
你可以换掉PCIe插槽再试一次吗?你可以换卡再试一次吗?
HBA上的固件是最新的吗?这是December 2016的最新套餐.
Firmware 6.07.02
BIOS 3.21A DID_ERROR typically indicates the driver software detected some type
of hardware error via an anomaly within the returned data from the
HBA.A hardware or san-based issue is present within the storage subsystem
such that received fibre channel response frames contain invalid or
conflicting information that the driver is not able to use or
reconcile.Please review the systems hardware,switch error counters,etc. to see if there is any indication of where the issue might lie. The most likely candidate is the HBA itself.