ONTAP Hardware
ONTAP Hardware
Hi everyone,
I have the strangest behaviour on a V-series appliance.
First some introduction to my infrastructure setup is in order.
I have an old SUSE 9 server which is a client for NFS shares on a FAS2020 appliance. This appliance is being retired and the shares are moved to a new V-series Netapp appliance. The majority of the content of the shares I have copied via SMB and a third party server onto the new appliance.
I have then mounted the new NFS shares onto the SUSE server and reading and writing to the mounts is working with no issues ... except when I want to copy certain XML files that have been generated by our ERP system.
Strange thing is most of the XML files I can copy with no issue, only copying certain XML files hangs. What is worse the whole mount then hangs too i.e. the whole ERP/finance system hangs. The server itself remains fully manageable but disk IO operations hang. I can recover the system by forcibly unmounting the NFS share but the copy process does not finish/ finishes with errors.
I have blamed this behaviour on a firewall but I was proven an idiot by the network guys as the issue persists even when the server and the appliance are directly connected (no firewall inbetween).
I even did a network sniff on the issue (see attached picture) and it shows that the NETAPP is refusing to service the write request, which just ends in retransmissions for eternity... thus a hang.
The strange thing is ... the whole process works and has worked for a decade now on the old FAS2020 appliance.
Both appliances are being accessed via NFS v3... so no difference there.
I would be grateful for any help at this point!
Oh and I have forgotten to mention this experiment that I performed.
I have tried zipping the XML files and then copying them onto the NFS share ... works with no problem.
Unzipping the files straight to the new destination ... hangs!
I guess the V-series appliance just hates the content of my XML files.
Solved! See The Solution
Nevermind. In the traces you are using UDP. We have a couple specific bugs https://mysupport.netapp.com/site/bugs-online/product/ONTAP/BURT/1196031 and https://mysupport.netapp.com/site/bugs-online/product/ONTAP/BURT/1384969
We'd recommend switching to TCP for NFS traffic as UDP NFS has multiple issues and won't likely be fixed until ONTAP 9.9+ (9.9.1 likely).
You can upgrade to a release between the two bugs but that's a narrow window. I would suggest disabling the cluster firewall for now:
::> system services firewall show
::> system services firewall modify -node * -enabled false
> I guess the V-series appliance just hates the content of my XML files.
Yep, I guess it does.. there shouldn't be anything that causes that behaviour, I assume you don't have a fpolicy virus scanner in there either.
Doesn't contain the EICAR string by any chance?
No not really, it is a bog standard XML file containing a financial transaction.
I have visually inspected the "good" XML files and the "bad" XML files ... nothing sticks out.
Naturally I cannot have them uploaded to an online scanner for virus detection, sensitive data and all, but my local Sophos AV does not find anything.
By the way how do I check if I actually have a fpolicy virus scanner? And even if I did have one, would it prevent the write on the fly?
Thanks for replying!
Ok, I'd try a binary search next - split the file in half with a text editor, and then try putting each half on there.. if one half fails, split that in half, and try again.. repeat until you find out if there is a string that can magically make ONTAP refuse to write a file, and respond back here
Where was the packet trace taken? Try to get one from ONTAP and see if it is even getting the call to begin with.
The packet trace was taken as a tcpdump from the source server.
Is there a way to take a packet trace on a V-series appliance?
If not I would have to run a network tap and sniff. Which is doable but I will have to engage my network guys ... after blaming their devices for interefering .. and proving myself wrong.
Any ideas how to perform this?
So, there are two options I can see. One is this is a new bug we haven't found, so you'll need to open a case to get it fixed. The other is to upgrade and see if the issue is resolved. I'd open a case either way so we can identify the specific bug or open a new one. They may ask for debug sktrace logs. Something is broken in ONTAP here.
Please reply with the case number and I can follow up internally once opened. Also please provide both packet traces.
Hi Paul,
I have opened a support case for the issue. The case number is: Case # 2008742244
It has been opened for almost 8 hours now but the status is Unassigned so far.
I can see that it has been passed around by the support people, but I have thus far no feedback from them.
Thanks for looking into this!
Acknowledged. I have followed up and added this thread to the case notes internally.
Please go ahead and upload your traces (assuming captured at same time, if not if not too much trouble to recapture please do). https://upload.netapp.com/sg and put in your case #.
Nevermind. In the traces you are using UDP. We have a couple specific bugs https://mysupport.netapp.com/site/bugs-online/product/ONTAP/BURT/1196031 and https://mysupport.netapp.com/site/bugs-online/product/ONTAP/BURT/1384969
We'd recommend switching to TCP for NFS traffic as UDP NFS has multiple issues and won't likely be fixed until ONTAP 9.9+ (9.9.1 likely).
You can upgrade to a release between the two bugs but that's a narrow window. I would suggest disabling the cluster firewall for now:
::> system services firewall show
::> system services firewall modify -node * -enabled false
Hi again,
That is a great idea and I just tried it. I regret to say results are inconclusive.
I splt the "bad" file into 3 files copied them over .. no issues, tried copying the offending file as a whole ... hangs.
here is the shell ouput for posterity:
db03:/home/zope/transfer/split # split -b 500 split.xml
db03:/home/zope/transfer/split # ls
. .. split.xml xaa xab xac
db03:/home/zope/transfer/split # cp xaa /mnt/nas04/sharepoint-ns04
db03:/home/zope/transfer/split # cp xab /mnt/nas04/sharepoint-ns04
db03:/home/zope/transfer/split # cp xac /mnt/nas04/sharepoint-ns04
db03:/home/zope/transfer/split # cp split.xml /mnt/nas04/sharepoint-ns04
cp: closing `/mnt/nas04/sharepoint-ns04/split.xml': Input/output error (the IO error is an effect of me forcefully dismounting the NFS share, otherwise it just hangs)
I don't know what to make of the results of this test.
You didn't say if it was ONTAP 9.2+ or 9.1 or older or 7-mode. Search the KB site for "pktt" and "tcpdump" and you'll see the appropriate articles (tcpdump if ONTAP 9.2+).
It is a 9.3 ONTAP.
Thanks for the information!
You're welcome. If it is making it to ONTAP, please take a look because we may have to enable some debugging. That seems odd. I suspect the packet is never making it to ONTAP.
I checked ... I have no fpolicies defined.