Server's periodically stop writing to CIFS shares after OnTap upgrade to 7.3.2

christien · ‎2010-08-06

Hello all,

I have an issue after upgrading our filers to OnTap 7.3.2 from 7.2.6. We have a 3070 and 6080 FAS cluster, the 3070 supplies its storage via FCP to the servers and the 6080 by both FCP and CIFS. While running 7.2.6 all was working ok on both FAS systems, after the upgrade, the servers using the CIFS shares on the 6080 have an issue where they just stop writing data to the CIFS share throughout the day anly a restart to the servers kick the process back in.

NetApp have been investigating this issue for the past couple of weeks but have not found anything wrong.

Guys looking after the application on the servers say nothing has changed and the only change on the SAN has been the upgrade.

I will attempt to run a cifs terminate and cifs restart the next time this issue occurs before performing the reboot to see if this enables writing to the share again. Then send the details in the log files to NetApp, 2 lots before and after.

Has anyone else had or seen a similar issues?

Cheers

reinoud7 · ‎2010-08-06

Hi,

I have never seen this before. Some questions that maybe help to see where the problem can be:

is at one share that has the problem, or all the shares?
has all the clients the problem on the same moment? Is there any cifs activities on the system at that moment?
can you access that share with your PC on the moment that the problem occur?
do you use virus scanning on the filer (vscan)?

Depending on the anwers, I would play with some options:

no_caching on the cifs share
disable vscan
oplocks off
...

And when this doesn't work, the hard way, network trace.

A lot of success,

Reinoud

christien · ‎2010-08-06

Hi Reinoud,

Yes its a weird one, we carry out upgrades on a test FAS 2020 and this issue did not show, or the test team were doing it on a Friday afternoon.

We have multiple CIFS shares setup on multiple vfilers, but only one share to a vfiler. The error happens accross all shares and at different times. We can actually run a "tail follow" command on the application writing the data and see the data suddenly stop. We have had the application developers look into it but its the usual situation where they blame the storage and we blame the application.

When the error occurs the SAN shows all ok, the server can see the share and we can write and read from it fine. No virus scanning on the filer also.

Thanks,

Christien

reinoud7 · ‎2010-08-06

Hi Christien,

If you find the cause, I want to know what is was. It's indeed weird.

What you can do, is compare the "options cifs" on the your "test" filer (2040) and this one.

There is no clue in the event viewer of the servers?

Do you use a vif on the filer for your cifs trafic? If so, you can try to run your traffic on 1 interface (multi mode) or to switch interface (single mode).

No further suggestions on this moment.

Reinoud

christien · ‎2010-11-12

Just thought I'd put an update on this just in-case it happens to someone else.

So we called in NetApp, as per the "request" of the application team as they did not believe us when we stated that the storage was ok. NetApp check all the setting with CIFS and still found nothing wrong. Next a support call was raised with Microsoft to investigate the CIFS usage on our system, again no fault found. All the time these support calls have been open we have prut in network probes and have been running network traces to discover why this issue was occuring.

After alot of headaches, and the probes being expanded to the wider network we appear to have found the culprit. A second DC and DNS server...

The original network design has the DC and DNS on a different subnet also specified on the Vfilers, this was 7 hops away from the server. Another project installed another DC and DNS server onto the network only 2 hops away, this appears to have been done as a work around by the TDA and was not known to the "wider" operations team. Even though these boxes are on the same domain they sit on different subnets, as this new box was 2 hops away and the election process on the network choose this, the DNS issues meant that CIFS was trying to authenticate with the wrong DNS server. As the primary DNS server had no knowledge of the "new" DNS server, once all the vfilers tried to authenticate with the "new" DNS server and failed (12 mins exact) it dropped the CIFS connection. While doing this, all the CIFS requests were being queued up and then failed.

The quick solution was to tell the primary DNS server what the "new" DNS server was, adding an A record has meant the CIFS authentication has gone from from an average of 33 seconds to 0.0002 secs and the "max auth gqueue" had gone from 400 ish to 9 ish and even the 9 is due to the number of vfilers running on our box.

So the issue has been rectified but we are still investigating what to do with the "new" DNS server, as its in the wrong location and removing it may impact other solutions.

Just to conclude the "light bulb moment" came after extending the network trace wider than the server and the SAN onto nearly everything, and we saw DNS authentication issues and a separate subnet every 4hrs and 12 mins. Wireshark has become a prize tool amoung us SAN/Server techies, who normally left that side to the network team. Yes it was a server engineer who after spending weeks looking at traces found the issue.

We would have liked to interigate the TDA who designed and implimented this, but he left the company a few weeks after the "new" DNS solution was moved into place.

Hope I dont find another "work around" lurking elsewhere.