Just thought I'd put an update on this just in-case it happens to someone else.
So we called in NetApp, as per the "request" of the application team as they did not believe us when we stated that the storage was ok. NetApp check all the setting with CIFS and still found nothing wrong. Next a support call was raised with Microsoft to investigate the CIFS usage on our system, again no fault found. All the time these support calls have been open we have prut in network probes and have been running network traces to discover why this issue was occuring.
After alot of headaches, and the probes being expanded to the wider network we appear to have found the culprit. A second DC and DNS server...
The original network design has the DC and DNS on a different subnet also specified on the Vfilers, this was 7 hops away from the server. Another project installed another DC and DNS server onto the network only 2 hops away, this appears to have been done as a work around by the TDA and was not known to the "wider" operations team. Even though these boxes are on the same domain they sit on different subnets, as this new box was 2 hops away and the election process on the network choose this, the DNS issues meant that CIFS was trying to authenticate with the wrong DNS server. As the primary DNS server had no knowledge of the "new" DNS server, once all the vfilers tried to authenticate with the "new" DNS server and failed (12 mins exact) it dropped the CIFS connection. While doing this, all the CIFS requests were being queued up and then failed.
The quick solution was to tell the primary DNS server what the "new" DNS server was, adding an A record has meant the CIFS authentication has gone from from an average of 33 seconds to 0.0002 secs and the "max auth gqueue" had gone from 400 ish to 9 ish and even the 9 is due to the number of vfilers running on our box.
So the issue has been rectified but we are still investigating what to do with the "new" DNS server, as its in the wrong location and removing it may impact other solutions.
Just to conclude the "light bulb moment" came after extending the network trace wider than the server and the SAN onto nearly everything, and we saw DNS authentication issues and a separate subnet every 4hrs and 12 mins. Wireshark has become a prize tool amoung us SAN/Server techies, who normally left that side to the network team. Yes it was a server engineer who after spending weeks looking at traces found the issue.
We would have liked to interigate the TDA who designed and implimented this, but he left the company a few weeks after the "new" DNS solution was moved into place.
Hope I dont find another "work around" lurking elsewhere.