Solved: Re: Upgrading Ontap to 9.3 with CIFS Auditing enabled may cause a serious issue

Deligatedgeek · ‎2018-04-12

A short while ago we updated our NetApps from 9.2 to 9.3, the upgrade was done using NetApp upgrade instructions and was successful, except.....

About a week and a bit later users could no longer save files, I managed to fix the symptom and give users back the ability to save files, but the problem remains.

CIFS Auditing Simple Overview

First create a CIFS/NFS share where the audit logs will be saved, then create an auditing policy which specifies what should be logged, the configuration should reference the CIFS/NFS share which you have created and what actions you would like to audit.

Auditing is then enabled on the NetApp, this action causes the creation of a "hidden" staging volume and an Audit Log Consolidation job is created.

Finally set the permissions on files/folders in a CIFS share which need to be audited.

Now when you perform specific actions on a file (Actions that you have listed in the audit policy) data is written to the staging volume and then the consolidation job runs and converts this data into a log file.

This has been working very well since and we chose to output XML which eventually ends up in Elastic and viewed in Kibana. So well did this work that we starting a process of rolling out the permission across the environment.

Then the upgrade to 9.3

We decided to upgrade Ontap to 9.3 as there are some nice new features with increase storage capacity efficiency.

We followed the required NetApp upgrade process and it was successful, there were no outages and all services were as they should be. a perfect upgrade.

Then about 10 days later I started receiving emails from OpenNMS about traps it was receiving from the NetApp appliances, quite a deluge of stuff I hadn't seen before. A staging volume was 95% full? Not the kind of emails you want to see on a Friday afternoon! Some investigation showed that a system volume was near full and CIFS access would be affected if it wasn't resolved.

The Staging Volume

One staging volume is created per aggregate the names are "MDV_aud_" and the Aggregate UUID. They can only be seen on the command line, the System Manager and OCUM web GUI's will not show them and OCUM will not generate volume full (near full, etc) alarms for them.

You can only see issues with them in syslog and traps directly from the NetApp heads, you will get no other warning until you have CIFS issues, probably because staging volumes simply just shouldn't fill up.

Further investigation showed that the Log Consolidation Job that creates the actual log files and effectively manages the data on the staging volumes had been deleted by the upgrade process somehow.

Check your staging volumes

To check if your staging volumes are filling up just run "df -h MDV*" on your NetApp cluster as an account with cluster admin privileges. e.g.

toaster::> df -h MDV*
Filesystem                                              total  used   avail capacity Mounted on Vserver
/vol/MDV_aud_76a5ebd8a0a41189181cf40336ea04f4/          1945MB 1004MB 0MB   52%      ---        toaster
/vol/MDV_aud_76a5ebd8a0a41189181cf40336ea04f4/.snapshot 102MB  50MB   102MB 49%      ---        toaster

If your capacity is above 80% used, check if your consolidation job is still running (as below) and call your support if its missing.

toaster::> job show -name Cons*
There are no entries matching your query.

The workarounds

In an emergency, to continue service there are two workarounds, the first is to increase the volume size and the second is to turn off audit guarantee. If you are using auditing for compliance then you only have one option, increase volume size.

Both options involve diagnostic mode, which should only be used in emergency, go back into admin ASAP.

1) Increase staging volume size

When increasing the volume size, remember the volume lives on the cluster "vserver".

toaster::> set diag
toaster::*> volume size -vserver toaster -volume MDV_aud_76a5ebd8a0a41189181cf40336ea04f4 3g
toaster::*> set admin

2) Disable audit guarantee (Will break compliance)

The below shows the process of checking audit guarantee of vserver svm01, setting to audit guarantee to false, checking audit guarantee again, and finally leaving diag privilege before doing anything else.

toaster::> set diag
toaster::*> vserver audit show -fields audit-guarantee
vserver        audit-guarantee 
-------------- --------------- 
svm01          true

toaster::*> vserver audit modify -vserver svm01 -destination /audit_log -audit-guarantee false
toaster::*> vserver audit show -fields audit-guarantee
vserver        audit-guarantee 
-------------- --------------- 
svm01          false

toaster::*> set admin

Deligatedgeek · ‎2018-04-12

There is now a fix for this, but because of the nature of the fix It would be irresponsible to publish it. Please contact Netapp support

Bug Ref: 1150270

View solution in original post

Deligatedgeek · ‎2018-04-12

There is now a fix for this, but because of the nature of the fix It would be irresponsible to publish it. Please contact Netapp support

Bug Ref: 1150270

donlab · ‎2018-07-31

It is not a fix, its a workaround.

The BURT is not fixed yet.

Deligatedgeek · ‎2018-08-28

If by burt you mean the reporting engine, I found one jar file references the old netapp data directory.

Create a symbolic link from /data to /opt/netapp/data/ and it should work

bigboas · ‎2019-10-03

I am about to create an audit policy and enable CIFS auditing on a NetApp system that we just upgraded to 9.3P16. I have a few questions:

1. Do the auditing log files have to reside at the root of the SVM or can they reside in a completely different volume that I have configured? If at root, then how much space did you give your root volume to start?

2. Based upon this experience of yours, do you think it would have worked without issue if you had disabled auditing prior to your upgrade? I am curious about this for my own future reference when I go to upgrade again soon and auditing is now enabled.

Thank you!

Lou

donlab · ‎2018-08-28

No, I mean the BURT 1150270 there’s no fix yet only a workaround.
https://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=1150270