Solved: ONTAP 8.0 controller in HA pair can't see volumes

aaronswtelogis · ‎2016-02-27

Hello

Running ONTAP 8.0 on FAS3210, single chassis, dual controller. I created volumes on one node, but on failover they haven't appeared.

Waiting for giveback
Disk reTue Apr 3 08:22:38 GMT [ses.giveback.restartAfter:info]: Enclosure Services restarting after release of reservations.
servations have been released
Tue Apr 3 08:22:39 GMT [fmmb.current.lock.disk:info]: Disk 0a.00.12 is a local HA mailbox disk.
Tue Apr 3 08:22:39 GMT [fmmb.current.lock.disk:info]: Disk 0a.00.11 is a local HA mailbox disk.
Tue Apr 3 08:22:39 GMT [fmmb.instStat.change:info]: normal mailbox instance on local side.
Tue Apr 3 08:22:40 GMT [netif.linkDown:info]: Ethernet e1a: Link down, check cable.
Tue Apr 3 08:22:40 GMT [fmmb.current.lock.disk:info]: Disk 0b.01.14 is a partner HA mailbox disk.
Tue Apr 3 08:22:40 GMT [fmmb.current.lock.disk:info]: Disk 0b.01.13 is a partner HA mailbox disk.
Tue Apr 3 08:22:40 GMT [fmmb.instStat.change:info]: normal mailbox instance on partner side.
Tue Apr 3 08:22:40 GMT [cf.fm.partner:info]: Failover monitor: partner 'netapp-ctrl01-dev'
Tue Apr 3 08:22:40 GMT [cf.fm.timeMasterStatus:info]: Acting as time master
Tue Apr 3 08:22:42 GMT [shelf.config.mpha:info]: All attached storage on the system is multi-pathed HA.
Waiting for giveback...(Press Ctrl-C to abort wait)
Waiting for giveback...(Press Ctrl-C to abort wait)Tue Apr 3 08:23:10 GMT [ses.status.temperatureWarning:warning]: DS4243 (S/N SHU0954292N0285) shelf 0 on channel 0a temperature warning for Temperature sensor 1: communication error. Current temperature: <N/A> C (<N/A> F). This module is on the front side of the shelf, at the on the left, on the OPS panel.
Tue Apr 3 08:23:35 GMT [ses.status.displayWarning:warning]: DS4243 (S/N SHU0954292N0285) shelf 0 on channel 0a display warning for Display 1: not installed or failed; display panel failed. This module is on the front side of the shelf, at the at the left.

Waiting for giveback...(Press Ctrl-C to abort wait)Continuing boot...
Tue Apr 3 08:25:39 GMT [coredump.spare.none:info]: No sparecore disk was found.
Tue Apr 3 08:25:40 GMT [raid.cksum.replay.summary:info]: Replayed 0 checksum blocks.
Tue Apr 3 08:25:40 GMT [raid.stripe.replay.summary:info]: Replayed 0 stripes.
filter sync'd
Tue Apr 3 08:25:41 GMT [localhost: cf.fm.launch:info]: Launching failover monitor
Tue Apr 3 08:25:41 GMT [localhost: cf.fm.partner:info]: Failover monitor: partner 'netapp-ctrl01-dev'
Tue Apr 3 08:25:41 GMT [localhost: cf.fm.discardNvram:notice]: Failover monitor: node was previously taken over, nvram may be discarded
Tue Apr 3 08:25:41 GMT [localhost: sas.config.mixed.detected:warning]: SAS shelf "1" attached to adapter "0b" contains a mixture of drive types. Mixed configurations are not supported.
RDB-HA ending primary

Tue Apr 3 08:25:42 UTC 2007
add net 127.0.0.0: gateway 127.0.0.1
usage: route add [inet] [host|net] <destination>[&netmask|/prefix] <gateway> <metric>
route add [inet] default <gateway> <metric>
vlan: ten_gig_vif-30 has been created
CIFS local server is running.

login:

netapp-ctrl01-dev> cf status
Cluster enabled, netapp-ctrl00-dev is up.
Interconnect status: up.
netapp-ctrl01-dev> cf monitor
current time: 27Feb2016 20:48:15
UP 00:01:09, partner 'netapp-ctrl00-dev', cluster monitor enabled
Interconnect status: up, takeover capability on-line
partner update TAKEOVER_ENABLED (27Feb2016 20:48:15)
netapp-ctrl01-dev>

netapp-ctrl00-dev> vol status
Volume State Status Options
vol0 online raid_dp, flex root, create_ucode=on
ucs_pool_aus1 online raid_dp, flex create_ucode=on
ucs_pool_aus2 online raid_dp, flex create_ucode=on
isos online raid_dp, flex create_ucode=on
admin_pool_3550 online raid_dp, flex create_ucode=on
dev_pool1 online raid_dp, flex create_ucode=on
dev_pool2 online raid_dp, flex create_ucode=on
netapp-ctrl00-dev>

netapp-ctrl01-dev> vol status
Volume State Status Options
vol0 online raid_dp, flex root, create_ucode=on
testor1 online raid_dp, flex create_ucode=on
testor2 online raid_dp, flex create_ucode=on
netapp-ctrl01-dev>

Any ideas would be greatly welcome.

aborzenkov · ‎2016-02-28

Your network configuration is invalid. You have the same IP on both controllers. It is unpredictable which controller will be contacted. This could *not* work before except by sheer luck.

You need to set two different addresses and add partner statements so addresses are available after takeover as well.

View solution in original post

aborzenkov · ‎2016-02-27

And what do you expect to see? Both controllers have some volumes. Do you mean that some are missing?

aaronswtelogis · ‎2016-02-27

Hello

That's correct. What first alerted me was that some volumes were deleted as cleanup. When the controller failed over, the volumes were still present, the *new* storage could not be seen. None of the exports were present.

aaronswtelogis · ‎2016-02-27

Here the controller is being rebooted

[root@aus2cxxn-dev00 ~]# showmount -e 10.139.30.11
Export list for 10.139.30.11:
/vol/dev_pool1 10.139.30.26,10.139.30.28
/vol/dev_pool2 10.139.30.24,10.139.30.30
/vol/isos 10.139.30.0/24
/vol/vol0/home (everyone)
/vol/vol0 (everyone)
/vol/admin_pool_3550 10.139.30.16,10.139.30.18,10.139.30.20,10.139.30.22
/vol/ucs_pool_aus1 10.139.30.12,10.139.30.14
/vol/ucs_pool_aus2 10.139.30.32,10.139.30.34
/vol/testor1 (everyone)
/vol/testor3 (everyone)

where are the shares?

[root@aus2cxxn-dev00 ~]# showmount -e 10.139.30.11

Export list for 10.139.30.11:
/vol/vol0/home (everyone)
/vol/vol0 (everyone)
[root@aus2cxxn-dev00 ~]#
[root@aus2cxxn-dev00 ~]#

aaronswtelogis · ‎2016-02-27

netapp-ctrl01-dev*> cf monitor all output
cf: Current monitor status (27Feb2016 21:59:51):
partner 'netapp-ctrl00-dev', Interconnect status: up
HA Interconnect Device is Chelsio T3
state UP, time 5841182, event CHECK_FSM, elem ChkMbValid (13)
mirrorConsistencyRequired TRUE
takeoverByPartner 0x3000 <TAKEOVER_ON_REBOOT,TAKEOVER_ON_PANIC>
mirrorEnabled TRUE, lowMemory FALSE, memio UNINIT, killPackets TRUE
degraded FALSE, reservePolicy ALWAYS_AFTER_TAKEOVER, resetDisks TRUE
hw_assist status:
hw_assist Active on netapp-ctrl01-dev. netapp-ctrl01-dev monitoring alerts from partner netapp-ctrl00-dev
hw_assist Active on netapp-ctrl00-dev: netapp-ctrl00-dev monitoring alerts from partner netapp-ctrl01-dev
timeouts:
fast 1000, slow 2500, mailbox 10000, connect 5000
operator 600000, firmware 15000 (recvd 5841182), dumpcore 60000
booting 300000 (recvd 0)
transit timer enabled TRUE, transit 600000 (last 1476771)
mailbox disks:
Disk 0b.01.14 is a local mailbox disk
Disk 0b.01.13 is a local mailbox disk
Disk 0a.00.12 is a partner mailbox disk
Disk 0a.00.11 is a partner mailbox disk
primary state:
version 2, senderSysid 1574172085
cluster_time 1456630963, hbt 5983, node_status TAKEOVER_ENABLED
info 0x3000 <TAKEOVER_ON_REBOOT,TAKEOVER_ON_PANIC>
flags 0x0 <>
channel CHANNEL_MAILBOX, abs_time 1456635589, sk_time 5839182
channel_status 0
channel CHANNEL_IC, abs_time 1456635591, sk_time 5841182
channel_status 0
channel CHANNEL_NETWORK, abs_time 0, sk_time 0
channel_status -1
backup state:
version 2, senderSysid 1574125760
cluster_time 1456630963, hbt 8263, node_status TAKEOVER_ENABLED
info 0x3000 <TAKEOVER_ON_REBOOT,TAKEOVER_ON_PANIC>
flags 0x20 <FMFLG_AUTOGB_DONE>
channel CHANNEL_MAILBOX, abs_time 1456635591, sk_time 5841236
channel_status 0
Channel Read Ctx:
version 2, senderSysid 1574125760
cluster_time 1456630963, hbt 8263, node_status TAKEOVER_ENABLED
info 0x3000 <TAKEOVER_ON_REBOOT,TAKEOVER_ON_PANIC>
flags 0x20 <FMFLG_AUTOGB_DONE>
channel CHANNEL_IC, abs_time 1456635591, sk_time 5841182
channel_status 0
Channel Read Ctx:
version 2, senderSysid 1574125760
cluster_time 1456630963, hbt 8261, node_status TAKEOVER_ENABLED
info 0x3000 <TAKEOVER_ON_REBOOT,TAKEOVER_ON_PANIC>
flags 0x20 <FMFLG_AUTOGB_DONE>
channel CHANNEL_NETWORK, abs_time 0, sk_time 0
channel_status -1
Channel Read Ctx:
version 2, senderSysid 0
cluster_time 0, hbt 0, node_status UNKNOWN
info 0x0 <>
flags 0x0 <>
takeoverState FT_NONE, takeoverString 'No takeover information'
givebackState FT_NONE, givebackString 'No giveback information'
givebackRetries 0, givebackRequested FALSE
autoGivebackEnabled FALSE, autoGivebackWasDone FALSE, autoGivebackCifsStopping FALSE
autoGivebackLastVetoCheck 0, autoGivebackAttemptsExceeded FALSE
Maximum primary disk mailbox io times: normal = 710, transition = 0
Maximum backup disk mailbox io times: normal = 296, transition = 0
Num times logs unsynced : 0
Total system uptime: 5841949 msec
Sync state total time : 4746092 msec
Sync state Max time : 4362363 msec

aborzenkov · ‎2016-02-27

You show 7-Mode commands, but you tagged your question with cluster mode. So start with explaining what you actually have.

We have no idea what is IP address you show and how it is related to your question. If this is filer address - tell so, which filer. Or may be it is SVM if you really has C-Mode.

aaronswtelogis · ‎2016-02-27

Hello

Thanks for the response.

I thought the HA pairs were a cluster.

netapp-ctrl01-dev> cf status
Cluster enabled, netapp-ctrl00-dev is up.
Interconnect status: up.
netapp-ctrl01-dev>

netapp-ctrl01-dev> sysconfig -v

NetApp Release 8.0.2P4 7-Mode: Tue Nov 15 16:16:47 PST 2011
System ID: 1574172085 (netapp-ctrl01-dev); partner ID: 1574125760 (netapp-ctrl00-dev)
System Rev: F3
System Storage Configuration: Multi-Path HA

The host was looking at available NFS shares while the node that was serving them was up. When that node was rebooted the second node took over. Those were the shares that were available then. After reboot.

Ask me anything you like. I'll do my best to provide that information. I appreciate any help.

aaronswtelogis · ‎2016-02-27

this is where the volumes stand at this moment.

my understanding is that if one controller goes away, the now active one can take over serving shares -- the drives are there for a fact

netapp-ctrl00-dev> vol status
Volume State Status Options
vol0 online raid_dp, flex root, create_ucode=on
ucs_pool_aus1 online raid_dp, flex create_ucode=on
ucs_pool_aus2 online raid_dp, flex create_ucode=on
isos online raid_dp, flex create_ucode=on
admin_pool_3550 online raid_dp, flex create_ucode=on
dev_pool1 online raid_dp, flex create_ucode=on
dev_pool2 online raid_dp, flex create_ucode=on
testor1 online raid_dp, flex create_ucode=on
testor3 online raid_dp, flex create_ucode=on
netapp-ctrl00-dev> vol status isos
Volume State Status Options
isos online raid_dp, flex create_ucode=on
Volume UUID: 33ac429a-e247-11db-9ca1-00a0981736f8
Containing aggregate: 'aggr0'
netapp-ctrl00-dev> vol online isos
vol online: Volume 'isos' is already online.
netapp-ctrl00-dev>

netapp-ctrl01-dev> vol status
Volume State Status Options
vol0 online raid_dp, flex root, create_ucode=on
netapp-ctrl01-dev>
netapp-ctrl01-dev> vol online isos
vol online: No volume named 'isos' exists.
netapp-ctrl01-dev>

aaronswtelogis · ‎2016-02-28

One other thing.

I've understood variously that the controllers need to be cabled. I can't point to the docs at this time, but the wording essentially is cabling is not needed for dual-controller single chassis setups. That said, do c0a/c0b on either controller need to be cabled to one another?

aborzenkov · ‎2016-02-27

Please show "ifconfig -a" and /etc/exports from each controller.

SeanHatfield · ‎2016-02-28

Where to begin....

How about the date. The clock on netapp-ctrl00-dev is way off. Its set to 4 years before this version of ontap was released.

Waiting for giveback
Disk reTue Apr 3 08:22:38 GMT
...
Tue Apr 3 08:25:42 UTC 2007

And this shelf appears to have failed

[ses.status.temperatureWarning:warning]: DS4243 (S/N SHU0954292N0285) shelf 0 on channel 0a temperature warning for 
Temperature sensor 1: communication error. Current temperature: <N/A> C (<N/A> F). This module is on the front side 
of the shelf, at the on the left, on the OPS panel.
Tue Apr 3 08:23:35 GMT [ses.status.displayWarning:warning]: DS4243 (S/N SHU0954292N0285) shelf 0 on channel 0a 
display warning for Display 1: not installed or failed; display panel failed. This module is on the front side 
of the shelf, at the at the left.

The disk population in shelf 1 is unsupported:

Tue Apr 3 08:25:41 GMT [localhost: sas.config.mixed.detected:warning]: SAS shelf "1" attached to adapter "0b" contains a mixture of drive types. Mixed configurations are not supported.

Once you see the login prompt its not in failover anymore:

vlan: ten_gig_vif-30 has been created
CIFS local server is running.

login:

While its still in "Waiting for giveback", you can look at the partner resources by prefixing the command with the partner keyword. i.e "partner vol status"

na7m1a(takeover)> partner vol status
         Volume State           Status                Options
           vol0 online          raid0, flex           root
                                64-bit                
na7m1a(takeover)>

Back in the day, 7-mode HA pairs used to be called clusters. These days clusters refer to cluster mode, but old style references persist in some corners of the 7 mode CLI even today. This is some really old code you're running.

In addition to what Andrey mentioned it would help to see the /etc/rc files and the /etc/hosts files. But at least try setting the clock and trying another TO/GB. If the time doesn't persist you need hardware support on the controller as well as the shelf.

If this post resolved your issue, help others by selecting ACCEPT AS SOLUTION or adding a KUDO.

aaronswtelogis · ‎2016-02-28

Hello

I did set the clock last night. No quick access to an NTP source, so did it manually. What the client was seeing afterward was after the date had been changed.

The fault comes from the power indicator being knocked off in shipping.

To our best knowledge, all of this worked before shipping :).

netapp-ctrl00-dev> rdfile /etc/rc
hostname netapp-ctrl00-dev
ifconfig e0M 10.139.20.40 netmask 255.255.255.0 partner e0M
route add default 10.139.20.1
options out.domain.dev
options dns.enable on
ifgrp create single ten_gig_vif e1a e1b
vlan create ten_gig_vif 30
ifconfig ten_gig_vif-30 10.139.30.11 netmask 255.255.255.0
netapp-ctrl00-dev>

netapp-ctrl00-dev> rdfile /etc/hosts
127.0.0.1 localhost localhost-stack
127.0.10.1 localhost-10 localhost-bsd
127.0.20.1 localhost-20 localhost-sk
10.139.30.11 netapp-ctrl00-dev ten_gig_vif-30
10.139.20.40 netapp-ctrl00-dev-e0M
netapp-ctrl00-dev>

netapp-ctrl01-dev> rdfile /etc/rc
hostname netapp-ctrl01-dev
ifconfig e0M 10.139.20.41 netmask 255.255.255.0 partner e0M
route add default 10.139.20.1
options our.domain.dev
options dns.enable on
ifgrp create single ten_gig_vif e1a e1b
vlan create ten_gig_vif 30
ifconfig ten_gig_vif-30 10.139.30.11 netmask 255.255.255.0

netapp-ctrl01-dev> rdfile /etc/hosts
127.0.0.1 localhost localhost-stack
127.0.10.1 localhost-10 localhost-bsd
127.0.20.1 localhost-20 localhost-sk
10.139.30.11 netapp-ctrl01-dev ten_gig_vif-30
10.139.20.41 netapp-ctrl01-dev-e0M
netapp-ctrl01-dev>

I can't think of anything else. I've looked into WAFL_iron. My understanding is that it might address volume inconsistencies.

I believe you need to original install media if a reinstallation were to be done. We don't have that on hand. It might be possible to reach out to the group that gave this to us.

thanks

aborzenkov · ‎2016-02-28

Your network configuration is invalid. You have the same IP on both controllers. It is unpredictable which controller will be contacted. This could *not* work before except by sheer luck.

You need to set two different addresses and add partner statements so addresses are available after takeover as well.

aaronswtelogis · ‎2016-02-28

Hello

thanks I will give that a try.

aaronswtelogis · ‎2016-02-28

Hello

Thank you both. This has been an intensive NetApp education.

It'd be nice to provide credit to both of you.