Subscribe
Accepted Solution

hw_assist test fails 'timeout'

We recently had an issue where our 'hw_assist' IPs were on a network that experienced some downtime, and (possibly) as a result caused our filers to panic and then reboot.

We're still investigating the coredump, but in the meantime we want to connect our filers to directly each other (filerA and filerB) on an unused onboard port (e0a, since e0M isn't in use) and used that for the 'cf.hw_assist.cf.hw_assist.partner.address' IP.

I've already configured e0a on filerA as: 172.16.3.111/24 and e0a on filerB as: 172.16.3.113/24.  Here is how cf.hw_assist is configured on both systems:

filerA

filerA> ifconfig e0a

e0a: flags=0x6f48867<UP,BROADCAST,RUNNING,MULTICAST,TCPCKSUM,NOWINS> mtu 1500

          inet 172.16.3.111 netmask 0xffffff00 broadcast 172.16.3.255

          ether 00:a0:98:0d:eb:30 (auto-1000t-fd-up) flowcontrol full


filerA> ping 172.16.3.111

172.16.3.111 is alive


filerA> options cf.hw_assist                            

cf.hw_assist.enable          on        

cf.hw_assist.partner.address 172.16.3.113

cf.hw_assist.partner.port    4444      

filerA> cf hw_assist status

Local Node(filerA) Status:

          Active: filerA monitoring alerts from partner(filerB)

          port 4444 IP address 172.16.3.111

          Missed keep alive alert from partner(filerB).

                    Last keep alive alert received on

                     Tue Oct  4 16:57:20 PDT 2011

Partner Node(filerB) Status:

          Active: filerB monitoring alerts from partner(filerA)

          port 4444 IP address 172.16.3.113

filerA> cf hw_assist test

cf hw_assist Error: No response from partner(filerB), timed out.

filerA> rlm status

          Remote LAN Module           Status: Online

                    Part Number:        110-00057

                    Revision:           F0

                    Serial Number:      48XXXX

                    Firmware Version:   3.0

                    Mgmt MAC Address:   00:A0:98:10:0C:2B

                    Ethernet Link:      up

                    Using DHCP:         no

          IPv4 configuration:

                    IP Address:         10.100.1.111

                    Netmask:            255.255.255.0

                    Gateway:            10.100.1.2

filerB

filerB> ifconfig e0a

e0a: flags=0x6f48867<UP,BROADCAST,RUNNING,MULTICAST,TCPCKSUM,NOWINS> mtu 1500

          inet 172.16.3.113 netmask 0xffffff00 broadcast 172.16.3.255

          ether 00:a0:98:10:2d:d0 (auto-1000t-fd-up) flowcontrol full


filerB> ping 172.16.3.113

172.16.3.113 is alive



filerB> options cf.hw_assist

cf.hw_assist.enable          on        

cf.hw_assist.partner.address 172.16.3.111

cf.hw_assist.partner.port    4444

filerB> cf hw_assist status

Local Node(filerB) Status:

          Active: filerB monitoring alerts from partner(filerA)

          port 4444 IP address 172.16.3.113

          Missed keep alive alert from partner(filerA).

                    Last keep alive alert received on

                     Tue Oct  4 18:09:02 PDT 2011

Partner Node(filerA) Status:

          Active: filerA monitoring alerts from partner(filerB)

          port 4444 IP address 172.16.3.111

filerB> cf hw_assist test

cf hw_assist Error: No response from partner(filerA), timed out.

filerB> rlm status

          Remote LAN Module           Status: Online

                    Part Number:        110-00057

                    Revision:           F0

                    Serial Number:      48XXXX

                    Firmware Version:   3.0

                    Mgmt MAC Address:   00:A0:98:0F:8C:15

                    Ethernet Link:      up

                    Using DHCP:         no

          IPv4 configuration:

                    IP Address:         10.100.1.113

                    Netmask:            255.255.255.0

                    Gateway:            10.100.1.2

Cluster is currently enabled and up and RLM is configured.  Any ideas as to why the 'cf hw_assist test' fails?  I've set the e0a interface to be trusted.  We're running DOT 8.0.1 7-mode.

Re: hw_assist test fails 'timeout'

Hw_assist requires connectivity between filer head on one side and partner RLM on another side. So direct connection between two onboard ports is not going to work for obvious reasons. You would need to use small switch to connect two RLM and two dedicated ports together.

Re: hw_assist test fails 'timeout'

Thanks, sounds like the proper way to move forward is:

- rewire / reconfigure e0a (on both filers) to unique 10.100.1.0/24 address and VLAN1 (same as RLMs)

- update cf.hw_assist.partner.address

- run 'cf.hw_assist test' again

I'll give this a shot and report back.

Re: hw_assist test fails 'timeout'

I just thought I would add to this post to save people some time resolving hw_assist timeout issues with a Service Processor (SP):

 

Firstly check SP speed / duplex, type 'sp status' and check SP has negotiated 100Mb / Full, if not reconfigure SP network switch ports to auto / auto i.e. speed / duplex.

 

Once this has been completed type 'sp status' to confirm 100Mb / full duplex, if the output still shows 100Mb / half duplex, type sp reboot and use sp status to confirm reboot has completed and speed / duplex is set correctly.

 

Another reason for getting time out messages is if the SP has not been configured properly. This may be observed by a SP prompt without hostname i.e. 'SP>'. The SP prompt should be 'SP hostname>'

 

To fix this issue use the following commands:

 

sp status
options sp.setup off
sp setup (using info from sp status)
cf hw_assist test
cf hw_assist status

 

I hope this helps.