linux – mdadm以99.9%的速度重建RAID5阵列

前端之家收集整理的这篇文章主要介绍了linux – mdadm以99.9%的速度重建RAID5阵列前端之家小编觉得挺不错的,现在分享给大家,也给大家做个参考。
我最近在我的QNAP TS-412 NAS中安装了三个新磁盘.

这三个新磁盘应该与已存在的磁盘组合成一个4磁盘RAID5阵列,因此我开始了迁移过程.

经过多次尝试(每次大约需要24小时)后,迁移似乎有效,但导致无响应的NAS.

那时我重置了NAS.一切都从那里走下坡路:

> NAS启动但将第一个磁盘标记为失败并将其从所有阵列中删除,使它们保持跛行.
>我在磁盘上运行检查并且找不到任何问题(无论如何这都很奇怪,因为它几乎是新的).
>管理界面没有提供任何恢复选项,所以我想我只是手动完成.

我使用mdadm(/ dev / md4,/ dev / md13和/ dev / md9)成功重建了所有QNAP内部RAID1阵列,只留下了RAID5阵列;的/ dev / md0的:

我现在已经尝试了多次,使用这些命令:

mdadm -w /dev/md0

(从阵列中删除/ dev / sda3后,NAS必须以只读方式挂载数组.在RO模式下无法修改阵列).

mdadm /dev/md0 --re-add /dev/sda3

之后,阵列开始重建.
它虽然停滞在99.9%,但系统极其缓慢和/或没有响应. (大多数时候使用SSH登录失败).

事物的现状:

[admin@nas01 ~]# cat /proc/mdstat                            
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] 
md4 : active raid1 sdd2[2](S) sdc2[1] sdb2[0]
      530048 blocks [2/2] [UU]

md0 : active raid5 sda3[4] sdd3[3] sdc3[2] sdb3[1]
      8786092608 blocks super 1.0 level 5,64k chunk,algorithm 2 [4/3] [_UUU]
      [===================>.]  recovery = 99.9% (2928697160/2928697536) finish=0.0min speed=110K/sec

md13 : active raid1 sda4[0] sdb4[1] sdd4[3] sdc4[2]
      458880 blocks [4/4] [UUUU]
      bitmap: 0/57 pages [0KB],4KB chunk

md9 : active raid1 sda1[0] sdd1[3] sdc1[2] sdb1[1]
      530048 blocks [4/4] [UUUU]
      bitmap: 2/65 pages [8KB],4KB chunk

unused devices: <none>

(它现在停留在2928697160/2928697536几小时)

[admin@nas01 ~]# mdadm -D /dev/md0
/dev/md0:
        Version : 01.00.03
  Creation Time : Thu Jan 10 23:35:00 2013
     Raid Level : raid5
     Array Size : 8786092608 (8379.07 GiB 8996.96 GB)
  Used Dev Size : 2928697536 (2793.02 GiB 2998.99 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Mon Jan 14 09:54:51 2013
          State : clean,degraded,recovering
 Active Devices : 3
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 64K

 Rebuild Status : 99% complete

           Name : 3
           UUID : 0c43bf7b:282339e8:6c730d6b:98bc3b95
         Events : 34111

    Number   Major   Minor   RaidDevice State
       4       8        3        0      spare rebuilding   /dev/sda3
       1       8       19        1      active sync   /dev/sdb3
       2       8       35        2      active sync   /dev/sdc3
       3       8       51        3      active sync   /dev/sdd3

在检查/mnt/HDA_ROOT/.logs/kmsg之后,事实证明实际问题似乎是/ dev / sdb3:

<6>[71052.730000] sd 3:0:0:0: [sdb] Unhandled sense code
<6>[71052.730000] sd 3:0:0:0: [sdb] Result: hostbyte=0x00 driverbyte=0x08
<6>[71052.730000] sd 3:0:0:0: [sdb] Sense Key : 0x3 [current] [descriptor]
<4>[71052.730000] Descriptor sense data with sense descriptors (in hex):
<6>[71052.730000]         72 03 00 00 00 00 00 0c 00 0a 80 00 00 00 00 01 
<6>[71052.730000]         5d 3e d9 c8 
<6>[71052.730000] sd 3:0:0:0: [sdb] ASC=0x0 ASCQ=0x0
<6>[71052.730000] sd 3:0:0:0: [sdb] CDB: cdb[0]=0x88: 88 00 00 00 00 01 5d 3e d9 c8 00 00 00 c0 00 00
<3>[71052.730000] end_request: I/O error,dev sdb,sector 5859367368
<4>[71052.730000] raid5_end_read_request: 27 callbacks suppressed
<4>[71052.730000] raid5:md0: read error not correctable (sector 5857246784 on sdb3).
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5:md0: read error not correctable (sector 5857246792 on sdb3).
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5:md0: read error not correctable (sector 5857246800 on sdb3).
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5:md0: read error not correctable (sector 5857246808 on sdb3).
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5:md0: read error not correctable (sector 5857246816 on sdb3).
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5:md0: read error not correctable (sector 5857246824 on sdb3).
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5:md0: read error not correctable (sector 5857246832 on sdb3).
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5:md0: read error not correctable (sector 5857246840 on sdb3).
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5:md0: read error not correctable (sector 5857246848 on sdb3).
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5:md0: read error not correctable (sector 5857246856 on sdb3).
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.

对于585724XXXX范围内的各种(随机?)扇区,以稳定的速率重复上述序列.

我的问题是:

>为什么它停止如此接近结束,同时仍然使用系统停滞的那么多资源(md0_raid5和md0_resync进程仍在运行).
>有没有办法看到导致它失败/失速的原因? < - 可能由于sdb3错误.
>如何在不丢失3TB数据的情况下完成操作? (比如跳过sdb3上麻烦的扇区,但保留完整的数据?)

解决方法

它可能在完成之前停止,因为它需要故障磁盘返回某种状态,但它没有得到它.

无论如何,所有数据都是(或应该)完好无损,只有4个磁盘中的3个.

你说它从阵列中弹出有故障的磁盘 – 所以它应该仍在运行,尽管处于降级模式.

你能装吗?

您可以通过执行以下操作强制运行阵列:

>打印出数组的详细信息:mdadm -D / dev / md0
>停止数组:mdadm –stop / dev / md0
>重新创建数组并强制md接受它:“mdadm -C -n md0 –assume-clean / dev / sd [abcd] 3`

后一步是完全安全的,只要:

>你不写数组,和
>您使用了与以前完全相同的创建参数.

最后一个标志将阻止重建并跳过任何完整性测试.然后,您应该能够安装它并恢复您的数据.

猜你在找的Linux相关文章