I narrowly averted an outage on one of our most important servers today - a Windows Oracle server. It has several iSCSI LUNs, each in their own volume. DFM started alerting >90% utilisation on the volume that contains the C:\ drive LUN just after midday. The snapshots were huge. I deleted all the snapshots and gave the volume another 50GB to make sure it didn't stop. Now utilisation is down to ~25%.
But now I have to figure out what caused this.
A regular snapshot and snapmirror update happens at midday. And then the volume utilisation explodes. Here's the graph from DFM : http://imgur.com/2zA1VdP
The volume utilisation at about 2PM:
# df -g |grep por01_vol0
/vol/uraoly_por01_vol0/ 60GB 55GB 4GB 93% /vol/uraoly_por01_vol0/
/vol/uraoly_por01_vol0/.snapshot 0GB 19GB 0GB ---% /vol/uraoly_por01_vol0/.snapshot
And the snapshots:
# snap list uraoly_por01_vol0
%/used %/total date name
---------- ---------- ------------ --------
0% ( 0%) 0% ( 0%) Aug 16 14:37 uraoly-nas03(0118074784)_uraoly_por01_vol0.99312 (snapmirror)
34% (34%) 31% (31%) Aug 16 12:00 hourly.0
34% ( 0%) 31% ( 0%) Aug 16 08:00 hourly.1
34% ( 1%) 31% ( 0%) Aug 16 00:00 nightly.0
34% ( 0%) 32% ( 0%) Aug 15 20:00 hourly.2
35% ( 1%) 32% ( 0%) Aug 15 16:00 hourly.3
35% ( 1%) 32% ( 0%) Aug 15 12:00 hourly.4
35% ( 0%) 32% ( 0%) Aug 15 08:00 hourly.5
35% ( 1%) 33% ( 0%) Aug 15 00:00 nightly.1
Then after I deleted all the snapshots:
# df -g uraoly_por01_vol0
Filesystem total used avail capacity Mounted on
/vol/uraoly_por01_vol0/ 60GB 36GB 23GB 60% /vol/uraoly_por01_vol0/
/vol/uraoly_por01_vol0/.snapshot 0GB 0GB 0GB ---% /vol/uraoly_por01_vol0/.snapshot
So, I am trying to work out WHEN whatever activity caused this actually happened. The snapmirror happened at midday, then the volume utilisation exploded.
Am I right in thinking that whatever caused this snapshot growth happened AFTER the midday snapshot, rather than before midday?
There's nothing unusual in the event logs, so I'm going to have to ask the general population if anyone was doing any changes on that server at the time, and want to make sure I have my timeframes right.
Thanks,
Dean