hpacucli ctrl all show detail Smart Array P410 in Slot 1 Bus Interface: PCI ...
HP工具不会报告任何问题.
这是正常的分区ext3,块大小设置为2k,它很好 – fsck输出:
fsck 1.39 (29-May-2006) Pass 1: Checking inodes,blocks,and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information
文件inode也可以:
File: `name.xxx' Size: 3126962 Blocks: 6124 IO Block: 4096 regular file Device: 6851h/26705d Inode: 64579729 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2014-07-28 09:02:59.000000000 -0400 Modify: 2014-07-28 09:02:59.000000000 -0400 Change: 2014-07-28 09:02:59.000000000 -0400
我无法执行的操作之一是文件复制:
> cp /long_path/name.xxx . cp: reading `/long_path.name.xxx': Input/output error
为了找出问题所在,我运行dd来复制文件:
> dd if=/long_path/name.xxx bs=2048 of=test dd: reading `/long_path/name.xxx': Input/output error 222+0 records in 222+0 records out 454656 bytes (455 kB) copied,0.042867 seconds,10.6 MB/s
所以我猜这个问题出现在223文件块中.
Debugfs应该有助于在磁盘上找到该块
debugfs -R "stat name.xxx" /dev/sdf debugfs 1.39 (29-May-2006) Inode: 64579729 Type: regular Mode: 0644 Flags: 0x0 Generation: 2900468317 User: 0 Group: 0 Size: 3126962 File ACL: 0 Directory ACL: 0 Links: 1 Blockcount: 6124 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x53d64a03 -- Mon Jul 28 09:02:59 2014 atime: 0x53d64a03 -- Mon Jul 28 09:02:59 2014 mtime: 0x53d64a03 -- Mon Jul 28 09:02:59 2014 BLOCKS: (0):130402311,(1-4):130402844-130402847,(5-6):130484033-130484034,(7):130484036,(8-10):130484049-130484051,(11):130484055,(IND):130761221,(12-13):130761222-130761223,(14):130763791,(15):130763942,(16):130765268,(17-23):130838937-130838943,(24-46):130853946-130853968,(47-48):130855126-130855127,(49):130855215,(50-53):130856428-130856431,(54-104):130856533-130856583,(105-341):130856748-130856984,... [MORE BLOCKS] .... TOTAL: 1531
所以我猜有问题的数据在130856866块.
如何获得有关该块的更多信息?我运行了坏块,并列出了坏块.我的猜测是我必须将块数乘以2(文件系统块大小为2K,而badblocks默认使用1K).另外badblocks检查磁盘,而不是分区,所以也许我应该添加一些偏移(该磁盘上有一个分区,所以可能没有).
> fdisk -l /dev/sdf Disk /dev/sdf: 2000.3 GB,2000365379584 bytes 255 heads,63 sectors/track,243197 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/cciss/c0d5p1 * 1 243197 1953479871 83 Linux
我还想过使用smartd.我应该寻找什么?
Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 0 1457 0 2887405961 0 65948.712 18 write: 0 0 0 0 0 15056.493 0 verify: 0 1 0 361901613 0 3591.720 0 Non-medium error count: 226 SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background long Failed in segment --> - 34479 16845361 [0x3 0x11 0x0] # 2 Background short Completed - 44 - [- - -] # 3 Background short Completed - 39 - [- - -] # 4 Background long Completed - 6 - [- - -] Long (extended) Self Test duration: 18500 seconds [308.3 minutes] Background scan results log Status: scan is active Accumulated power on time,hours:minutes 34541:56 [2072516 minutes] Number of background scans performed: 1139,scan progress: 38.18% Number of background medium scans performed: 1139 # when lba(hex) [sk,asc,ascq] reassign_status 1 19215:06 0000000000014c61 [3,11,0] Recovered via rewrite in-place 2 19215:07 0000000000014c66 [3,0] Recovered via rewrite in-place 3 19413:28 0000000001010a31 [3,0] Require Write or Reassign Blocks command 4 19943:24 000000000001ea99 [3,0] Recovered via rewrite in-place 5 20152:23 00000000000232b8 [3,0] Recovered via rewrite in-place 6 31229:34 810000004087f984 [3,0] Require Write or Reassign Blocks command 7 33021:51 810000004087ba85 [3,0] Require Write or Reassign Blocks command 8 33021:51 000000004087ba9f [3,0] Require Write or Reassign Blocks command 9 33021:52 000000004087bad6 [3,0] Require Write or Reassign Blocks command 10 33029:43 000000004087baa5 [3,0] Require Write or Reassign Blocks command 11 33055:27 000000004087bac3 [3,0] Require Write or Reassign Blocks command 12 33244:40 810000004087f9d6 [3,0] Require Write or Reassign Blocks command 13 33431:58 990000004087f105 [0,0] Reassignment by disk Failed 14 33480:17 00000000463d7713 [3,0] Require Write or Reassign Blocks command 15 33480:19 00000000463d7723 [3,0] Require Write or Reassign Blocks command 16 33480:20 00000000463d7725 [3,0] Require Write or Reassign Blocks command 17 33480:28 81000000463d774e [3,0] Require Write or Reassign Blocks command 18 33686:17 8100000044e50edc [3,0] Require Write or Reassign Blocks command 19 34154:17 81000000432bef27 [3,0] Require Write or Reassign Blocks command 20 34463:43 810000001f32decd [3,0] Require Write or Reassign Blocks command 21 34463:43 0000000030080000 [3,0] Require Write or Reassign Blocks command
我应该如何与我的初始问题结合smartctl输出(或智能运行的任何其他输出).
硬盘软件也不应该解决这个问题吗?
BTW.我发现以下链接有助于理解’debugs -R’输出.也许link对其他人有用.
UPDATE
进一步研究我发现与有问题的inode相关的操作(如上面的cp命令)会触发内核日志中的以下行:
kernel: cciss: cmd ffff810037e00000 has CHECK CONDITION sense key = 0x3
解决方法
取你的块号,乘以4并加一个
(130856866 * 4) + 1 = 523427465
这表示报告的扇区产生I / O错误.块大小为2k,扇区为512字节.额外的一个额外考虑了分区的起始扇区偏移量.
要与SMART关联,我们需要将现在的值转换为十六进制.
$printf "0x%x\n" 523427465 0x1f32de89
现在,当您将其与SMART显示的内容相关联时,一条线路可疑地接近……
20 34463:43 810000001f32decd [3,0] Require Write or Reassign Blocks command
多远了?
$bc -l bc 1.06.95 Copyright 1991-1994,1997,1998,2000,2004,2006 Free Software Foundation,Inc. This is free software with ABSOLUTELY NO WARRANTY. For details type `warranty'. obase=16 ibase=16 1F32DECD-1F32DE89 44
这只是在34816和32768字节之间,但我们不能说哪个扇区在构成块的四个扇区中受损.
如果我不得不冒险猜测,我会说可能在同一地址周围有大量的块会报告I / O错误(假设raid条带化的大小为32k或者其他).
此外,如果RAID从另一个磁盘获取块块,则读取可能无法解决问题.写入必须传播到RAID1设置中的所有磁盘,这样可能会导致写入失败但读取成功.此外,如果我们假设RAID卡的块大小为32k,我们还可以假设损坏的块加上SMART报告的块都被该盘上发生的任何事情损坏.它只是SMART测试从第一个32k的好磁盘和下一个32k的坏磁盘读取.
现代硬盘保留“储备部门”,用新的部门位置取代这样的受损部门.看到你现在正在看到这个,并且从磁盘重新分配磁盘失败的消息我会说磁盘已经用完了.
在做某事方面;这有点棘手. LBA寻址是对下面真实磁盘的抽象.您需要确定导致此问题的磁盘,在RAID阵列中将其取消并替换它.
在任何情况下,你都有一个坏磁盘,你应该尽快替换它.