We have outgrown our 2050 and I have concerns with the sizing of the replacement. We have two aggregates' of sas drives that produce 3000+ iops each. The filer drops of the network at 1000 iops and the processor is constantly at 100%. cp stats show that it is constantly flushing. The network is never above 50 mbits/sec. I have previously had 6000+ iops from using iometer from several windows servers with multiple threads. The processor maxes when certain volumes are being heavily used. When I look at the iops of one of the offending volumes there are little read or writes, but 95% of the iops are listed as nfs_other. Also these volumes have massive inodes of 7-9 million and we are having to increase these as time goes on. Does anyone know why these particular volumes are killing our filer and if we upgrade how do we know the next filer won't display the same problem? Any comments or suggestions would be appreciated.
From your description, I don't think the issue is your filer, but rather in the applications that are accessing those two volumes. The high IOPS and the size of your file systems, combined with the majority of NFS_other calls leads me to believe you have an application, or applications, that are constantly trolling the file system. If you can, enable per-client NFS stats, and then do an nfsstat -l. That's an ell. I suspect you will find one or two machines account for the majority of NFS calls. I've seen exactly this kind of situation in experience, and it is usually because of poorly writte code in an application or script.
NOTE: There is a nice script tool called nfstop in the toolkit on the NOW site. It can help you quite a bit in these kinds of examinations.
I will give this a go, thanks. Later on today I will be running iometer again from windows 2003 vms, using nfs mounts. The application does indeed behave the way you described. I was under the impression however that an iop is an iop. These volumes are killing the processor at far less than maximum iops. We cannot change the applications behaviour quickly, it's our of our hands. Will we see the behaviour if we use iSCSI for these volumes?
I had one hour of time when users would be unaffected and got 3000 iops from the aggr using iometer with 50% read 50% write using built in random profile of iometer. The processor hovered around 80%. This leads me to the conclusion that an iop is not an iop and it depends on the characteristic of the iop as to how many you will get before the filer creaks. The only clue I have is that nfs_other seems to be a netapp killer. I have read that very small iops will be stored in metadata. Is this the problem? I have a two hour window tomorrow night to perform more tests. Any ideas?
I will provide sysstat results but will have to do this tomorrow. We are careful to follow Netapp best practices and we do check for aligned volumes in the OS's. The rouge volumes are mounted as direct NFS mounts from within SLES10 SP3 physical machines. I will also post the fstab entries for these volumes. The significant difference between these volumes is that they have huge inodes and the data on these volumes is made up of many thousands of directories within which are thousands of small files. My conclusions are based on the differences of these volumes and the relatively low iops they produce having a direct detrimental impact on the cpu of the filer. The sysstat and other information I post tomorrow might make this a little clearer.
I am attaching the sysstat results and nfs client stats for busiest client ip address'. 10.xxx.2.5 is a server with one of the volumes in question. The developement team where running a "Product Load" at the time of these stats.
We now have a loan filer from Netapp that we are trailing prior to the purchase of a new one in the near future. The volumes that have high nfs_other iops have been temporarily moved to SSD's hosted on separate Linux boxes. Having done this has relieved the pressure on the 2050 and since then it hasn’t dropped of the network. So onto the next filer. Our Netapp service provider and a Netapp Engineer confirmed that we where maxing the CPU with small random reads. The 2050 has a single core 2.4Mhz Celeron Processor.