Hi, I've opened a case on this, but wanted to get the community's feedback
Summary:
Last night just after initiating 2 vfiler migrations, we experienced an outage on our new 3170 running 7.3.3
It dropped off the network around 10:50pm – I could not ssh to it – I logged in via the RLM and found the cf status said up, but all the network and NFS services were down.
I ended up initiating a takeover from its partner, then a giveback after the problem head came up clean
It was logging messages like these on the RLM console:
ping: wrote 17.6.6.1 64 chars, error=No buffer space available
na01> Wed Nov 10 23:00:07 PST [irt-na01: Java_Thread:info]: Lookup of time.school.edu failed with DNS server 17.6.7.77: No buffer space available.
syslogd: Could not forward message to host 17.2.65.2: No buffer space available
Nov 10 23:00:07 [na01: Java_Thread:info]: Lookup of time.school.edu failed with DNS server 17.6.7.77: No buffer space available.
syslogd: Could not forward message to host 17.2.65.2: No buffer space available
CPU was high before and after the outage
NFS ops were low at the outage time
Questions:
1) is this network related buffer space?
2) How can I track the buffer space usage? (is there an SNMP exposed metric?)
3) if its network buffers - is it related to a 10GigE bug? (I thought Netapp had worked those out for the most part)
many thanks for any info / experience,
http://vmadmin.info
Fletcher