2010-03-27 05:42 AM
I recently encountered burt 369072 (during intense inode cleaning, CP coordinator thread fails to yield processor) which caused a panic to my backup filer...Panic Message:'process on cpu1 hung (coordinator) for 5002 milliseconds! in process coordinator on release NetApp Release 7.3.2P3.'
At the same time of the panic, there was a disk failure...to be able to 'cf giveback' I had to 'cf giveback -f', which was successful. Since the giveback, I’ve not been able to add a Protection Policy to any of unprotected datasets. When I attempt to add the backup provisioning policy, my resource pool gets a blue '?' and will not allow the provisioning of the required volumes...I get a message stating:
Reason: Storage system: 'filername'(16689):Active/Active failover: Take Over by partner disabled.
Suggestion: "Storage system: 'filername(16689):Enable Active/Active failover on the partner.
I've verified CF stats is enable and the partner is up...I’ve cleared all events in DFM....not sure what else I can do here.
p.s. I can send a screenshot to any of the Engineer folks that want to have a look the exact output...will not post it as it contains fqdn.
2010-03-27 12:16 PM
Your Backup provisioning policy is enabled for "Controller Failure Reliability"
I.e. for provisioning only on active active clusters.
So can you run dfm host discover on both your active-active pair?
So that dfm can discover that they are in active-active state.
Close and open the NMC and try again?
2010-03-28 06:15 PM
Thanks for the response. I did a 'dfm host discover <filer>' and logged out of NMC as suggested.... it would appear as though the is part worked (Refreshing data from host <filername> (24232) now.). However I’m faced with the same error when attempting to apply a protection policy to a dataset. Now, I've learned over the last while that DFM is extremely slow picking up changes. I waited about 30 minutes for the host discover to take affect; should this be sufficient?
2010-03-29 02:17 AM
All dfm monitors have a default monitoring interval.
So it takes time as its not event driven rather periodic polling.
NMC take time to refresh it or you will have to move to some other page to make it refresh or close and open for immediate change in NMC.
2010-03-29 08:30 AM
Looking into the DFM Server logs, I see that' I’m getting '[dfmserver:ERROR]: Thread 0x102c: cf settings is not 2, instead its 4'.
Speaking with Justin Parisi (NetApp support), we tried a number of different thing. Justin mentioned this is a similar issue to that of burt: 382019, whereas the snmp traps are not sending the correct info.
2010-03-29 08:28 PM
It appears that the root cause of the issues was running out of space... I went below 5% free space on the C:\ of the DFM server causing DFM to stop updating and .
I didn't notice but the event I configured were not triggering emails when thresholds were reached...this was in parallel to the issue of assigning a Protection policy. Once I realize that I was almost out of space and cleaned up (10-15%), I started receiving alert email and OM updated with the fact that CF was enabled and up.
During the time that I was running low on disk space, OM stopped gathering data so I how have a gap in my usage statistics.
2010-03-31 10:02 AM
The best practice is to setup email alerts for the management station event.
Below is the important list of management station events.
[root@lnx ~]# dfm eventtype list | grep -i free
management-station:enough-free-space Normal dfm.free.space
management-station:filesystem-filesize-limit-reached Error dfm.free.space
management-station:not-enough-free-space Error dfm.free.space
management-station:perf-advisor-enough-free-space Normal dfm.perfAdvisor.free.space
management-station:perf-advisor-not-enough-free-space Error dfm.perfAdvisor.free.space
The other places to look for the same is output of dfm about. O/p Sanitized to post here.
[root@lnx ~]# dfm about
Version 4.0 (4.0D1)
Serial Number XXXXXXXXXXXX
Administrator Name root
Host Name lnx186-223
Host IP Address XXXXXXXXXXXX
Host Full Name lnx186-223.XXXXXXXXXXXXXXXX
Operations Manager Node limit 999 (currently managing 3)
Provisioning Manager Node Limit 999 (currently managing 0)
Protection Manager Node Limit 999 (currently managing 2)
Operating System Red Hat Enterprise Linux AS release 4 (Nahant Update 4) 2.6.9-42.ELsmp i686
CPU Count 1
System Memory 2026 MB (load excluding cached memory: 58%)
Installation Directory /opt/NTAPdfm
3.94 GB free (31.0%)<<<<<<<<<<<<<<<<<<<<<<<<<Here you would find Error Instead of free.
The next place is dfm diag | grep -i management which shows the events that are generated on the management stations.
The dfmserver.log also has this info of monitoring is suspended to due the space issue.