2015-04-29 12:30 PM
We have a "compute cluster" of about 100 machines that do a read-only NFS mount to a NetApp FAS6280. The jobs running on these boxes are analysis/simulation jobs that constantly read data off the NAS.
We recently upgraded all these machines from RHEL 5.7 to RHEL 6.5. We did a "piecemeal" upgrade, usually upgrading five or so machines at a time, every few days. We noticed improved performance on the RHEL 6 boxes. But as the number of RHEL 6 boxes increased, we actually saw performance on the RHEL 5 boxes decrease. By the time we had only a few RHEL 5 boxes left, they were performing so badly as to be effectively worthless.
What we observed in parallel to this upgrade process was that the read latency on our FAS6280 skyrocketed. This in turn caused all compute jobs to actually run slower, as it seemed to move the bottleneck from the client servers' OS to the NetApp. This is somewhat counter-intuitive: RHEL 6 performs faster, but actually results in net performance loss because it creates a bottleneck on our centralized storage.
All indications are that RHEL 6 seems to be much more "aggressive" in how it does NFS reads. And likewise, RHEL 5 was very "polite", to the point that it basically got starved out by the introduction of the 6.5 boxes.
Has anyone seen anything like this? I suspect there are some "deep" or "behind the scenes" changes to NFS implementation between RHEL 5 and RHEL 6. Or maybe this is due to a change in the TCP stack? Or maybe the scheduler? We've tried a lot of sysctl tcp tunings, various nfs mount options, anything that's obviously different between 5 and 6... But so far we've been unable to find the "smoking gun" that causes the obvious behavior change between the two OS versions.