Problem with cluster

testspa2k4 · ‎2011-01-17

Since I have add a new Shelf to FAS2040, the second head (which has an aggr on the new shelf) randomly enable and disable cluster configuration. The first one works perfectly.

This are syslog messages:

Wed Jan 12 23:56:00 CET [str02: monitor.globalStatus.nonCritical:warning]: /vol/datastore02 is full (using or reserving 100% of space and 0% of inodes, using 100% of reserve).
Thu Jan 13 00:00:00 CET [str02: kern.uptime.filer:info]: 12:00am up 16 days, 7:23 0 NFS ops, 0 CIFS ops, 393 HTTP ops, 72149082 FCP ops, 5448089 iSCSI ops
Thu Jan 13 00:30:29 CET [str02: cf.fsm.takeoverOfPartnerDisabled:notice]: Cluster monitor: takeover of strpratic01 disabled (status of backup mailbox is uncertain)
Thu Jan 13 00:30:33 CET [str02: cf.fsm.takeoverOfPartnerEnabled:notice]: Cluster monitor: takeover of strpratic01 enabled
Thu Jan 13 01:00:00 CET [str02: kern.uptime.filer:info]:   1:00am up 16 days, 8:23 0 NFS ops, 0 CIFS ops, 393 HTTP ops, 72216191 FCP ops, 5448089 iSCSI ops
Thu Jan 13 01:47:19 CET [str02: cf.fsm.takeoverOfPartnerDisabled:notice]: Cluster monitor: takeover of strpratic01 disabled (status of backup mailbox is uncertain)
Thu Jan 13 01:47:24 CET [str2: cf.fsm.takeoverOfPartnerEnabled:notice]: Cluster monitor: takeover of strpratic01 enabled
Thu Jan 13 02:00:00 CET [str02: kern.uptime.filer:info]:   2:00am up 16 days, 9:23 0 NFS ops, 0 CIFS ops, 393 HTTP ops, 72286700 FCP ops, 5448089 iSCSI ops
Thu Jan 13 03:00:00 CET [str02: kern.uptime.filer:info]:   3:00am up 16 days, 10:23 0 NFS ops, 0 CIFS ops, 393 HTTP ops, 72380904 FCP ops, 5448089 iSCSI ops
Thu Jan 13 03:45:14 CET [str02: cf.fsm.takeoverOfPartnerDisabled:notice]: Cluster monitor: takeover of strpratic01 disabled (status of backup mailbox is uncertain)
Thu Jan 13 03:45:19 CET [str02: cf.fsm.takeoverOfPartnerEnabled:notice]: Cluster monitor: takeover of strpratic01 enabled
Thu Jan 13 04:00:00 CET [str02: kern.uptime.filer:info]:   4:00am up 16 days, 11:23 0 NFS ops, 0 CIFS ops, 393 HTTP ops, 72433063 FCP ops, 5448089 iSCSI ops
Thu Jan 13 05:00:00 CET [str02: kern.uptime.filer:info]:   5:00am up 16 days, 12:23 0 NFS ops, 0 CIFS ops, 393 HTTP ops, 72596309 FCP ops, 5448089 iSCSI ops
Thu Jan 13 05:02:11 CET [str02: cf.fsm.takeoverOfPartnerDisabled:notice]: Cluster monitor: takeover of strpratic01 disabled (status of backup mailbox is uncertain)
Thu Jan 13 05:02:17 CET [str02: cf.fsm.takeoverOfPartnerEnabled:notice]: Cluster monitor: takeover of strpratic01 enabled
Thu Jan 13 05:41:39 CET [str02: cf.fsm.takeoverOfPartnerDisabled:notice]: Cluster monitor: takeover of strpratic01 disabled (status of backup mailbox is uncertain)

How can I solve this issue?

Marco

BrendonHiggins · ‎2011-01-17

Hi welcome to the comminity

This sounds like your issue:

https://kb.netapp.com/support/index?page=content&id=1010888

Cluster monitor: takeover of filerB disabled (status of backup mailbox is uncertain)
- Example:
  Sat Sep 30 12:18:10 CEST [filerA: cf.fsm.takeoverOfPartnerDisabled:notice]: Cluster monitor: takeover of filerB disabled (status of backup mailbox is uncertain)
- Cause:
  FilerA has disabled clustering because it cannot determine the status of the backup mailbox disk. When a mailbox disk becomes unreadable, the local node cannot accurately determine the state of the partner. If the Interconnect link is still enabled, clustering is disabled because the filer cannot accurately determine whether the drives are still accessible on its partner.

If the error clears within a minute, check whether the disk was updating firmware at the time of the error. During a disk firmware upgrade, the disk will be taken offline briefly, creating the false positive error message.
For FC-AL loops, run the fcadmin link_stats command several times over the course of 5 minutes to see whether errors are incrementing on the loop. For SAS systems, use sasadmin dev_stats to look for incrementing errors. If errors are incrementing, perform further troubleshooting to isolate the problematic component.
If the problem clears without manual intervention, monitor the system for a reoccurrence. It may have been caused by a transient condition that caused the filer to miss an update to the mailbox disk.

Be sure to check the logs on both filers.

sysconfig -v Show the firmware versions - At top or report - Multipath status

storage show disk -p - Have you go the cables correct A to B and B ot A?

Link to cable diagrams - http://now.netapp.com/NOW/knowledge/docs/san/fcp_iscsi_config/config_guide_73/frameset.html

Hope it helps

Bren