Solved: E5600 Lockdown LJ Recovery

ufa · ‎2022-05-30

Hi,

we have a controller pair (out of support since EoApril, unfortunately) which went into an LJ lockdown. AFAICS that comes from a logical failure of the Config Database. That condition started in timely vicinity of a disk replacement. The storage is organized in disk pools ("declustered arrays"). I did not notify the issue immediately as I was replacing more disks in other storage systems. I do also not completely recall whether the LJ lockdown was the initial state after the failure set on; I realized it only after resetting the controllers. However, they were unreachable via their admin IFs before the reset.

The meaning of the LJ code is "The controller has insufficient memory to support the current configuration".

We tried resetting the system in different variants (resetting one controller while the other kept running, resetting one controller with the respective other one removed, all with the newly inserted disk in or removed, respectively. We also tried running lemClearLockdown. The controllers always run into the same lockdown. I suspect the database has gotten some (invalid) entry which confuses the CRUSH setup, which can also be derived from the boot messages visible at the serial console which include:

05/25/22-11:53:06 (tRAID): WARN: Error threshold exceeded: "Client_Crush_Invalid_Memory_Config"
05/25/22-11:53:06 (tRAID): WARN: LEM:Client threshold Exceeded
05/25/22-11:53:06 (tRAID): NOTE: LEM: checkError(); Error threshold exceeded;SUSPENDED_LOCKDOWN;
05/25/22-11:53:06 (tRAID): WARN: sodLockdownCheck: sodSequence update: SuspLockdown
05/25/22-11:53:06 (tRAID): NOTE: Inter-Controller Communication Channels Opened
05/25/22-11:53:06 (tRAID): WARN: Error threshold exceeded: "Client_Crush_Invalid_Memory_Config"
05/25/22-11:53:06 (tRAID): WARN: Client reported lockdown: "Client_Crush_Invalid_Memory_Config"
05/25/22-11:53:06 (tRAID): NOTE: Flashing tray summary fault LED

The memory of the controllers has of course not been changed by us.

While we do have data backups, the controller pair contributes to the storage of file systems with about 11 PB, and if ever possible we would not want to restore from tape as that would probably take very long. We'd rather want to fix the supposed issue with the database.

We found stored database exports from both controllers which were very likely initiated by the StorageManager (located in /var/opt/SM/monitor/dbcapture/, last ones from Jan 12 this year). I have to find out whether disks were replaced after that; if not (except for the recent one) I hope removing that lastly installed disk and replaying the config from January could do the trick.

However, I do not know how to transfer and activate an exported database dump via the serial line.

If there are other/better ways to overcome our problem, I'd be glad to learn about them.

BTW: we are also trying to get Netapp involved, but as we purchased these systems as OEM we are not yet there.

Sincerely

ufa

ufa · ‎2022-07-08

Hi, thanks. Unfortunately, we could not loose the data. However, we involved Netapp. The solution was that they restored an on-board backup of the config . We had some auto-backup on our admin server from a few months ago but two disks had been replaced in between hence these might have been invalid anyway (I suppose running a config using 8+2p redundancy with sector allocation maps corrupt for two disks could be okay if one first does a full scrubbing as the redundancy code should be able to correct 2 invalid strips per stripe). However, that on-board backup (on one controller) could be restored. I suppose it is not publicly documented how to access these on-board backups and I myself was off when that fix was performed so I do not have more details. At least the config could be restored, problem resolved, but based on precious non-disclosed power-knowledge (o.k., if we were let on our own, or if the data on the system would not have been that precious, I would have restored our external backup and give it at try as laid out above)...

View solution in original post

mos2 · ‎2022-05-31

Hello Ufa,

Just to confirm what is the seven segment display on both controllers?

ufa · ‎2022-05-31

LJ

ufa · ‎2022-06-01

LJ

mos2 · ‎2022-05-31

Hello Ufa,

With this kind of lockdown we will require additional logging to see what's really going on here. Fortunately your in the right direction getting support involved as well also require shell access to troubleshoot this.

ufa · ‎2022-06-01

Add'l Logging via loadDebug ? As I said , support for OEMs on an on-off base appears tricky to impossible 😞

ufa · ‎2022-06-01

I should add to the description that the inserted disk most likely triggering the LJ came from a dismantled E5600 system, i.e. is a used one.

NetApp_MC · ‎2022-06-03

Was there ECC errors logged against the DIMMs?

ufa · ‎2022-06-08

AFAICS, no.

no mention of ECC errors in excLogShow (on the second ctrl, i just cannot connect to the first one right now, maybe the serial cable was disconnected; i am not currently at the system's location).

Just that error :

---- Log Entry #9 (Core 0) MAY-25-2022 06:21:19 AM ----

Exception: Page Fault
   errCode:  0  pc:  0xffffffff80591acd   pathCat + 0x2d

Execution of

evfShowVol(0,99,0,0,0,0,0,0,0,0)

resets the controller ("---- Log Entry #17 (Core 0) JUN-08-2022 04:48:57 AM ----
06/08/22-10:19:32 (tShell0): ASSERT: Assertion failed: template N3evf16VolumeCfgManagerE, instance is Null, function getInstance") - which is probably due to the confused Crush configuration.

NetApp_RZ · ‎2022-07-07

Given your last output, and the symptoms, I would agree there is an issue with the crush (disk pool) configuration.

So the question now is does this array contain data that needs to be saved?
Recovering from this situation can be very tricky and not always guaranteed to get the data back as the metadata for the disk pool configuration is damaged.

If the data does not need to be saved, the configuration can be wiped via the dbmWipeAllAtSOD=1 method as detailed in the following KB:
https://kb.netapp.com/Advice_and_Troubleshooting/Data_Storage_Systems/E-Series_Storage_Array/How_to_clear_the_configuration_without_setting_a_breakpoi...

Mainly this part:

VKI_EDIT_OPTIONS
Type I and press Enter to insert a statement.
dbmWipeAllAtSOD=1
Press Enter Twice.
Type + and press Enter to enable option.
Type q and press Enter to quit.
Type y and press Enter to commit changes to NVSRAM

Then, run lemClearLockdown on both controllers.

ufa · ‎2022-07-08

Hi, thanks. Unfortunately, we could not loose the data. However, we involved Netapp. The solution was that they restored an on-board backup of the config . We had some auto-backup on our admin server from a few months ago but two disks had been replaced in between hence these might have been invalid anyway (I suppose running a config using 8+2p redundancy with sector allocation maps corrupt for two disks could be okay if one first does a full scrubbing as the redundancy code should be able to correct 2 invalid strips per stripe). However, that on-board backup (on one controller) could be restored. I suppose it is not publicly documented how to access these on-board backups and I myself was off when that fix was performed so I do not have more details. At least the config could be restored, problem resolved, but based on precious non-disclosed power-knowledge (o.k., if we were let on our own, or if the data on the system would not have been that precious, I would have restored our external backup and give it at try as laid out above)...

NetApp_RZ · ‎2022-07-08

Thanks for the followup on this confirming that the issue has been resolved.
Really good to hear that restoring the OBB took care of the issue.
I do know sometimes we have to get the OBB data patched by development in certain circumstances to correct issues when the metadata really becomes compromised.

Glad to hear this is resolved.