微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

RAID5 故障,所有 Superblocks 丢失,无法重新组合阵列

如何解决RAID5 故障,所有 Superblocks 丢失,无法重新组合阵列

我有一个 4 磁盘 RAID5 阵列遇到驱动器故障。

当磁盘出现故障时,我让阵列继续运行降级并订购了一个新磁盘。

2 天后新磁盘到达时,阵列不再安装。

我从阵列中移除了所有驱动器,将它们连接到另一台机器并运行 smartctl。所有驱动器都通过了。

但是,当我尝试组装阵列时,它说没有可用的 raid 成员,并且所有 4 个驱动器似乎都缺少 raid5 超级块。

我从一个更聪明的朋友那里得到了一些帮助。我们尝试使用覆盖将 4 个驱动器组装成所有 24 种组合。从覆盖层组装时,识别出一些元数据,但阵列无法安装。

我无法解释所有 4 个驱动器如何同时丢失它们的超级块。我只能猜测把它们插到另一台机器上运行 smartctl 可能对分区结构有一些影响???

包括以下 psated 的 smartctl 输出在内的大量信息......


Linux fileserver 5.4.0-77-generic #86-Ubuntu SMP Thu Jun 17 02:35:03 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux



Drives 


sudo mdadm --examine /dev/sda
/dev/sda:
   MBR Magic : aa55
Partition[0] :   4294967295 sectors at            1 (type ee)
matthew@fileserver:~$ sudo mdadm --examine /dev/sdb
/dev/sdb:
   MBR Magic : aa55
Partition[0] :   4294967295 sectors at            1 (type ee)
matthew@fileserver:~$ sudo mdadm --examine /dev/sdc
/dev/sdc:
   MBR Magic : aa55
Partition[0] :   4294967295 sectors at            1 (type ee)
matthew@fileserver:~$ sudo mdadm --examine /dev/sdd
/dev/sdd:
   MBR Magic : aa55
Partition[0] :   4294967295 sectors at            1 (type ee)
matthew@fileserver:~$



sudo mdadm --examine /dev/sda1
mdadm: No md superblock detected on /dev/sda1.
matthew@fileserver:~$ sudo mdadm --examine /dev/sda2
mdadm: cannot open /dev/sda2: No such file or directory
matthew@fileserver:~$ sudo mdadm --examine /dev/sdb1
mdadm: No md superblock detected on /dev/sdb1.
matthew@fileserver:~$ sudo mdadm --examine /dev/sdc1
mdadm: No md superblock detected on /dev/sdc1.
matthew@fileserver:~$ sudo mdadm --examine /dev/sdd1
mdadm: No md superblock detected on /dev/sdd1.



sudo mdadm --detail /dev/md0
mdadm: cannot open /dev/md0: No such file or directory


matthew@fileserver:~/lsdrv$ sudo mdadm --detail --scan
matthew@fileserver:~/lsdrv$ sudo mdadm --detail --scan /dev/sda
mdadm: /dev/sda does not appear to be an md device
matthew@fileserver:~/lsdrv$ sudo mdadm --detail --scan /dev/sdb
mdadm: /dev/sdb does not appear to be an md device
matthew@fileserver:~/lsdrv$ sudo mdadm --detail --scan /dev/sdc
mdadm: /dev/sdc does not appear to be an md device
matthew@fileserver:~/lsdrv$ sudo mdadm --detail --scan /dev/sdd
mdadm: /dev/sdd does not appear to be an md device



This is an automatically generated mail message from mdadm
running on fileserver

A Fail event had been detected on md device /dev/md0.

It Could be related to component device /dev/sdd.

Faithfully yours,etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10] 
md0 : active raid5 sdd[5](F) sdc[4] sdb[6] sde[1]
     11720658432 blocks super 1.2 level 5,512k chunk,algorithm 2 [4/3] [UUU_]
     bitmap: 0/30 pages [0KB],65536KB chunk

unused devices: <none>



This is an automatically generated mail message from mdadm
running on fileserver

A FailSpare event had been detected on md device /dev/md0.

It Could be related to component device /dev/sdd.

Faithfully yours,etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10] 
md0 : active raid5 sdd[5](F) sdc[4] sde[1] sdb[6]
     11720658432 blocks super 1.2 level 5,algorithm 2 [4/3] [UUU_]
     [>....................]  recovery =  3.2% (125694336/3906886144) finish=12064.8min speed=5223K/sec
     bitmap: 29/30 pages [116KB],65536KB chunk

unused devices: <none>










Smartctl: 

=== START OF informatION SECTION ===
Model Family:     Hitachi/HGST Ultrastar 7K4000
Device Model:     Hitachi HUS724040ALE641
Serial Number:    PAH0PZUW
LU WWN Device Id: 5 000cca 22bce6a11
Firmware Version: MJAOA5F0
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical,4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0,6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Jul 11 21:57:32 2021 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   disabled
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  disabled,frozen [SEC2]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80) Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The prevIoUs self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:        (   24) seconds.
Offline data collection
capabilities:            (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:    (   1) minutes.
Extended self-test routine
recommended polling time:    (   1) minutes.
SCT capabilities:          (0x003d) SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     PO-R--   100   100   016    -    65536
  2 Throughput_Performance  P-S---   100   100   054    -    230
  3 Spin_Up_Time            POS---   135   135   024    -    555 (Average 593)
  4 Start_Stop_Count        -O--C-   100   100   000    -    91
  5 Reallocated_Sector_Ct   PO--CK   100   100   005    -    0
  7 Seek_Error_Rate         PO-R--   100   100   067    -    0
  8 Seek_Time_Performance   P-S---   124   124   020    -    33
  9 Power_On_Hours          -O--C-   099   099   000    -    9146
 10 Spin_Retry_Count        PO--C-   100   100   060    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    91
192 Power-Off_Retract_Count -O--CK   100   100   000    -    462
193 Load_Cycle_Count        -O--C-   100   100   000    -    462
194 Temperature_Celsius     -O----   176   176   000    -    34 (Min/Max 24/63)
196 Reallocated_Event_Count -O--CK   100   100   000    -    0
197 Current_Pending_Sector  -O---K   100   100   000    -    0
198 Offline_Uncorrectable   ---R--   100   100   000    -    0
199 UDMA_CRC_Error_Count    -O-R--   200   200   000    -    10
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x03       GPL     R/O      1  Ext. Comprehensive SMART error log
0x04       GPL     R/O      7  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x08       GPL     R/O      2  Power Conditions log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x12       GPL     R/O      1  SATA NCQ Non-Data log
0x20       GPL     R/O      1  Streaming performance log [OBS-8]
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x80       GPL     R/W     63  Host vendor specific log
0x81-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xb2       GPL     VS      63  Device vendor specific log
0xc8       GPL     VS     617  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
Device Error Count: 10 (device log contains only the most recent 4 errors)
    CR     = Command Register
    FEATR  = Features Register
    COUNT  = Count (was: Sector Count) Register
    LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
    LH     = LBA High (was: Cylinder High) Register    ]   LBA
    LM     = LBA Mid (was: Cylinder Low) Register      ] Register
    LL     = LBA Low (was: Sector Number) Register     ]
    DV     = Device (was: Device/Head) Register
    DC     = Device Control Register
    ER     = Error register
    ST     = Status register
Powered_Up_Time is measured from power on,and printed as
DDd+hh:mm:SS.sss where DD=days,hh=hours,mm=minutes,SS=sec,and sss=millisec. It "wraps" after 49.710 days.

Error 10 [1] occurred at disk power-on lifetime: 6685 hours (278 days + 13 hours)
  When the command that caused the error occurred,the device was active or idle.

  After command completion occurred,registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  84 -- 51 00 00 00 00 00 00 00 00 00 00

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  47 00 00 00 01 00 00 00 00 00 12 a0 00     00:17:41.571  READ LOG DMA EXT
  47 00 00 00 01 00 00 00 00 00 00 a0 00     00:17:41.571  READ LOG DMA EXT
  ef 00 10 00 02 00 00 00 00 00 00 a0 00     00:17:41.571  SET FEATURES [Enable SATA feature]
  27 00 00 00 00 00 00 00 00 00 00 e0 00     00:17:41.571  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 00 00 00 00 00 a0 00     00:17:41.570  IDENTIFY DEVICE

Error 9 [0] occurred at disk power-on lifetime: 6685 hours (278 days + 13 hours)
  When the command that caused the error occurred,registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  84 -- 51 00 00 00 00 00 00 00 00 00 00

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  ec 00 00 00 00 00 00 00 00 00 00 a0 00     00:16:42.626  IDENTIFY DEVICE
  ec 00 00 00 01 00 00 00 00 00 01 a0 ff     00:16:42.322  IDENTIFY DEVICE
  ec 00 00 00 00 00 00 00 00 00 00 00 ff     00:16:42.297  IDENTIFY DEVICE
  ec 00 00 00 00 00 00 00 00 00 00 00 00     00:16:33.076  IDENTIFY DEVICE
  ec 00 00 00 01 00 00 00 00 00 01 00 ff     00:16:33.047  IDENTIFY DEVICE

Error 8 [3] occurred at disk power-on lifetime: 6685 hours (278 days + 13 hours)
  When the command that caused the error occurred,registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  84 -- 51 00 00 00 00 00 00 00 00 00 00

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  ec 00 00 00 00 00 00 00 00 00 00 00 00     00:16:33.076  IDENTIFY DEVICE
  ec 00 00 00 01 00 00 00 00 00 01 00 ff     00:16:33.047  IDENTIFY DEVICE
  ec 00 00 00 01 00 00 00 00 00 01 00 ff     00:16:24.869  IDENTIFY DEVICE
  ec 00 00 00 01 00 00 00 00 00 01 00 ff     00:16:22.862  IDENTIFY DEVICE
  ec 00 00 00 01 00 00 00 00 00 01 00 ff     00:16:22.291  IDENTIFY DEVICE

Error 7 [2] occurred at disk power-on lifetime: 6685 hours (278 days + 13 hours)
  When the command that caused the error occurred,registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  84 -- 51 00 00 00 00 00 00 00 00 00 00

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  ec 00 00 00 00 00 00 00 00 00 00 a0 00     00:13:10.514  IDENTIFY DEVICE
  ec 00 00 00 00 00 00 00 00 00 00 a0 ff     00:13:10.206  IDENTIFY DEVICE
  ec 00 00 00 00 00 00 00 00 00 00 a0 00     00:13:05.119  IDENTIFY DEVICE
  ec 00 00 00 01 00 00 00 00 00 01 a0 ff     00:13:04.815  IDENTIFY DEVICE
  ec 00 00 00 00 00 00 00 00 00 00 00 ff     00:13:04.790  IDENTIFY DEVICE

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      9144         -
# 2  vendor (0xb0)       Completed without error       00%     37694         -
# 3  vendor (0x71)       Completed without error       00%     37694         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans,do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up,resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       256 (0x0100)
Device State:                        Active (0)
Current Temperature:                    34 Celsius
Power Cycle Min/Max Temperature:     32/34 Celsius
Lifetime    Min/Max Temperature:     24/63 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/60 Celsius
Min/Max Temperature Limit:           -40/70 Celsius
Temperature History Size (Index):    128 (24)

Index    Estimated Time   Temperature Celsius
  25    2021-07-11 19:50    43  ************************
 ...    ..( 13 skipped).    ..  ************************
  39    2021-07-11 20:04    43  ************************
  40    2021-07-11 20:05    44  *************************
 ...    ..( 32 skipped).    ..  *************************
  73    2021-07-11 20:38    44  *************************
  74    2021-07-11 20:39    45  **************************
 ...    ..( 60 skipped).    ..  **************************
   7    2021-07-11 21:40    45  **************************
   8    2021-07-11 21:41    46  ***************************
   9    2021-07-11 21:42    45  **************************
  10    2021-07-11 21:43    46  ***************************
  11    2021-07-11 21:44    45  **************************
  12    2021-07-11 21:45    46  ***************************
  13    2021-07-11 21:46    46  ***************************
  14    2021-07-11 21:47    45  **************************
  15    2021-07-11 21:48    45  **************************
  16    2021-07-11 21:49    46  ***************************
 ...    ..(  4 skipped).    ..  ***************************
  21    2021-07-11 21:54    46  ***************************
  22    2021-07-11 21:55     ?  -
  23    2021-07-11 21:56    33  **************
  24    2021-07-11 21:57    33  **************

SCT Error Recovery Control:
           Read: disabled
          Write: disabled

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 2) ==
0x01  0x008  4              91  ---  Lifetime Power-On Resets
0x01  0x018  6      9826823751  ---  Logical Sectors Written
0x01  0x020  6        12801493  ---  Number of Write Commands
0x01  0x028  6     38333927386  ---  Logical Sectors Read
0x01  0x030  6        51828028  ---  Number of Read Commands
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4            9144  ---  Spindle Motor Power-on Hours
0x03  0x010  4            9144  ---  Head Flying Hours
0x03  0x018  4             462  ---  Head Load Events
0x03  0x020  4               0  ---  Number of Reallocated Logical Sectors
0x03  0x028  4          352476  ---  Read Recovery Attempts
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4               4  ---  Resets Between Cmd Acceptance and Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              34  ---  Current Temperature
0x05  0x010  1              46  N--  Average Short Term Temperature
0x05  0x018  1              50  N--  Average Long Term Temperature
0x05  0x020  1              63  ---  Highest Temperature
0x05  0x028  1              24  ---  Lowest Temperature
0x05  0x030  1              59  N--  Highest Average Short Term Temperature
0x05  0x038  1              25  N--  Lowest Average Short Term Temperature
0x05  0x040  1              55  N--  Highest Average Long Term Temperature
0x05  0x048  1              25  N--  Lowest Average Long Term Temperature
0x05  0x050  4             980  ---  Time in Over-Temperature
0x05  0x058  1              60  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4            2899  ---  Number of Hardware Resets
0x06  0x010  4             423  ---  Number of ASR Events
0x06  0x018  4              10  ---  Number of Interface CRC Errors
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command Failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0009  2            2  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            2  Device-to-host register FISes sent due to a ComrESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000d  2            0  Non-CRC errors within host-to-device FIS


版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。