Poor NFS synchronous performance on FAS2552

shider · ‎2015-09-04

I have opened a support case #2005847149 but I haven't received any help to speak of over the last week plus. My reseller has been providing us with some help but we have been unable to come to a solution. So here goes:

We are replacing a FAS2040 system with a FAS2552 system. The new system has more disks, SSD used for cache, and is running on 10Gbe, the FAS2040 is running on 1Gbe. We primarily use NFS shares to host VMDKs. Our NFS synchronous performance issue was discovered in this configuration but also is easily duplicated on NFSv3 shares. A RH server has one NFS share mounted from the FAS2040 (7 mode) and one mounted from the FAS2552 (cluster mode), same mount options, same everything...

FAS2040:

administrator$ time dd if=empty.b1 of=empty.dd oflag=dsync bs=8k

12304+0 records in

12304+0 records out

100794368 bytes (101 MB) copied, 4.76166 s, 21.2 MB/s

real 0m4.806s

user 0m0.000s

sys 0m0.328s

FAS2552:

administrator$ time dd if=empty.b1 of=empty.dd oflag=dsync bs=8k

12304+0 records in

12304+0 records out

100794368 bytes (101 MB) copied, 38.0357 s, 2.6 MB/s

real 0m38.064s

user 0m0.024s

sys 0m0.400s

So the question is: does cluster mode have this much overhead or is there something fundamentally wrong with our FAS2552 or our configuration?

Any help appreciated.

parisi · ‎2015-09-08

Check out TR-4067:

http://www.netapp.com/us/media/tr-4067.pdf

Specifically, page 15 regarding RPC slots. Try adjusting those to use 128 slots.

Also, be sure you are mounting directly to the data LIF on the node that owns the volume.

Sounds like this isn't an ESX NFS issue but a RHEL guest mounting to the cluster issue. Accurate?

shider · ‎2015-09-09

Thanks for taking an interest.

The issue (slow synchronous writes) presents itself on RHEL 7.1, 7.0 and an older Ubuntu server. It is also present on our ESX servers with NFS datastores. We don't have the issue from any of those with the older FAS2040 so I assume (dangerous) it isn't client related.

It is present on any of our mount points on any aggregate on any LIF on the new filer. (Yes I triple checked that the LIFs / aggregates / nodes line up.)

We do have VSC installed on vCenter with the VAAI plugin installed and have optimized the settings.

Config Advisor did suggest that we turn off flow control and this has been done on all 10Gbe ports. (Made no discernable difference.)

For various reasons we still have MTU set to 1500 for the clients and the filers. Although this may not be optimal it shouldn't have a 7x impact on performance (from what I've read) and it is the same for the old filer that performs well enough.

AutoSupport is turned on but we do not have OnCommand Unified and Performance manager set up.

Thanks,

Derek

parisi · ‎2015-09-09

Right, but the 2040s ran 7mode, which has much different NFS architecture than cDOT. For instance, there is a notion of NFS flow control in cDOT that doesn't exist in 7mode.

The RPC slot issue has been known to cause perf issues and could cause a domino effect if another client is eating up all the RPC slots. (bully client)

wlorenzo · ‎2015-09-08

Sorry to hear that your 2552 is not working as well as it could right now. As stated above the TR likely will help find what is causing the issue. I know this is presenting in RHEL vs VMware but they are very similar and easy to debug from that end as well.

A few things to take a look at:

Do you have Virtual Storage Console (VSC) installed on vCenter? If so have you optimized the hosts and installed the VAAI plugin?

Have you ran Config Advisor against the system to make sure there are no issues?

Another possibility is ensuring that you are have disabled flow control on the 10g interfaces and that the MTU matches on the NetApp/Switch/VMware/Hosts.

Do you have OnCommand Unified and Performance manager deployed or AutoSupport turned on?

Regards,

Bill

shider · ‎2015-09-10

Current thinking is that this is related to http://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=896685

This is addressed in 8.3.1. We will be updating tonight to see if it fixes the issue.

Fingers crossed.

parisi · ‎2015-09-10

The good news is that 8.3.1 is GA today. 🙂

Hope it works out for you!