ONTAP Discussions
ONTAP Discussions
Hello, I hope I am asking this in the right area. I have a FAS2020 single controller, 12 300GB 15k SAS, NICs aggregated in a vif, a bunch of servers and 2 stations all running linux, everything GB LAN, connected to a SG300-28 Cisco switch.
Now problem is I am trying to make use of NFS namely vs4 and I get always 95-100% CPU load during transfers, and it's dramatic on files over 5GB. My client machines (a CentOS 6.4 and a openSUSE 12.3 used for testing) get kind of stuck and they freeze in the process. With NFS v3 it's around 60-70% but this is a no no, the speed is halved. As my filer has a single core, no M option for me, and I really think this is a load pure and simple. But NFS shouldn't exercise this on a filer with a single file copied over from a single station.
It has been recently wiped clean, installed Ontap 7.3.7, configured - and now I am just testing to see why is this happening before moving a few KVM machines on the filer. So nothing else is hammering the filer. Tried many NFS options; stopped for now at:
fstab on the client
192.168.1.200:/vol/vol_1 | /mnt/nfs/StupidNFS | nfs4 | rw,bg,hard,rsize=65536,wsize=65536,timeo=60,actimeo=1,intr,sync,tcp 0 0 |
options nfs
nfs.acache.persistence.enabled on | ||
nfs.always.deny.truncate | on | |
nfs.assist.queue.limit | 40 | |
nfs.export.allow_provisional_access on | ||
nfs.export.auto-update | on | |
nfs.export.exportfs_comment_on_delete on | ||
nfs.export.harvest.timeout 1800 | ||
nfs.export.neg.timeout | 3600 | |
nfs.export.pos.timeout | 36000 | |
nfs.export.resolve.timeout 6 | ||
nfs.hide_snapshot | off | |
nfs.ifc.rcv.high | 66340 | |
nfs.ifc.rcv.low | 33170 | |
nfs.ifc.xmt.high | 16 | |
nfs.ifc.xmt.low | 8 | |
nfs.ipv6.enable | off | |
nfs.kerberos.enable | off | |
nfs.locking.check_domain | on | |
nfs.max_num_aux_groups | 32 | |
nfs.mount_rootonly | on | |
nfs.mountd.trace | off | |
nfs.netgroup.strict | off | |
nfs.notify.carryover | on | |
nfs.ntacl_display_permissive_perms off | ||
nfs.per_client_stats.enable off | ||
nfs.require_valid_mapped_uid off | ||
nfs.response.trace | off | |
nfs.response.trigger | 60 | |
nfs.rpcsec.ctx.high | 0 | |
nfs.rpcsec.ctx.idle | 360 | |
nfs.tcp.enable | on | |
nfs.thin_prov.ejuke | off | |
nfs.udp.enable | on | |
nfs.udp.xfersize | 32768 | |
nfs.v2.df_2gb_lim | off | |
nfs.v2.enable | on | |
nfs.v3.enable | on | |
nfs.v4.acl.enable | on | |
nfs.v4.enable | on | |
nfs.v4.id.domain | ||
nfs.v4.read_delegation | on | |
nfs.v4.write_delegation | on | |
nfs.webnfs.enable | off | |
nfs.webnfs.rootdir | XXX | |
nfs.webnfs.rootdir.set | off |
df -h
Filesystem | total | used | avail capacity Mounted on | |
/vol/vol0/ | 9216MB | 1109MB | 8106MB | 12% /vol/vol0/ |
/vol/vol0/.snapshot | 1024MB | 73MB | 950MB | 7% /vol/vol0/.snapshot |
/vol/vol_1/ | 921GB | 623MB | 920GB | 0% /vol/vol_1/ |
/vol/vol_1/.snapshot | 102GB | 8222MB | 94GB | 8% /vol/vol_1/.snapshot |
/vol/vol_2/ | 693GB | 622MB | 692GB | 0% /vol/vol_2/ |
/vol/vol_2/.snapshot | 77GB | 8284MB | 68GB | 11% /vol/vol_2/.snapshot |
sysconfig -r
Aggregate aggr0 (online, raid_dp) (block checksums)
Plex /aggr0/plex0 (online, normal, active, pool0)
RAID group /aggr0/plex0/rg0 (normal)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
dparity 0c.00.0 0c 0 0 SA:B 0 SAS 15000 272000/557056000 274845/562884296
parity 0c.00.1 0c 0 1 SA:B 0 SAS 15000 272000/557056000 274845/562884296
data 0c.00.2 0c 0 2 SA:B 0 SAS 15000 272000/557056000 274845/562884296
data 0c.00.3 0c 0 3 SA:B 0 SAS 15000 272000/557056000 274845/562884296
data 0c.00.4 0c 0 4 SA:B 0 SAS 15000 272000/557056000 274845/562884296
data 0c.00.5 0c 0 5 SA:B 0 SAS 15000 272000/557056000 274845/562884296
data 0c.00.6 0c 0 6 SA:B 0 SAS 15000 272000/557056000 274845/562884296
data 0c.00.7 0c 0 7 SA:B 0 SAS 15000 272000/557056000 274845/562884296
data 0c.00.8 0c 0 8 SA:B 0 SAS 15000 272000/557056000 274845/562884296
data 0c.00.9 0c 0 9 SA:B 0 SAS 15000 272000/557056000 274845/562884296
Pool1 spare disks (empty)
Pool0 spare disks
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
Spare disks for block or zoned checksum traditional volumes or aggregates
spare 0c.00.10 0c 0 10 SA:B 0 SAS 15000 272000/557056000 274845/562884296 (not zeroed)
spare 0c.00.11 0c 0 11 SA:B 0 SAS 15000 272000/557056000 274845/562884296
aggr status -v
Aggr State | Status | Options | |
aggr0 online | raid_dp, aggr | root, diskroot, nosnap=off, | |
raidtype=raid_dp, raidsize=16, | |||
ignore_inconsistent=off, | |||
snapmirrored=off, | |||
resyncsnaptime=60, | |||
fs_size_fixed=off, | |||
snapshot_autodelete=on, | |||
lost_write_protect=on |
Volumes: vol0, vol_1, vol_2 |
Plex /aggr0/plex0: online, normal, active | |
RAID group /aggr0/plex0/rg0: normal |
nfsstat (really don't know how to read this one, any hints there?) is attached. See also the 2 files attached for the load. Systems manager roughs this at 100% CPU load, wich is not ok.
The hardware seems fine - no warning anywhere, all tests passed. I configured NFS, cifs, ssh... I don't use the filer for anything else, just for testing these days.
What could cause this and what can I do about it? Please tell me what other relevant info should I post to nail this down.
Thank you.
Solved! See The Solution
The misaligned stats show a count of 4KB (or some increment of 4KB) IOs and the offset they had into a WAFL block. In your case all IOs are arriving in BIN-0 which means they are aligned. If BIN-1-7 were incrementing then you have unaligned IO. I think the write performance you observe is normal for this platform. When you do the big file copy from the syssstat v4 earlier you are reaching CPU bottleneck, and with v3 you were reaching some other bottleneck. Perhaps the use of the sync mount option in your v3 trial caused IOs to be issued serially and was throttling work requested of the filer. In the v4 example though CPU resources are exhausted so any request is going to have to queue CPU which drives up response times. Using jumbo frames helped reduce CPU (more data, less metadata, per frame) which enables more CPU for other work. I can't think of anything else you can enable that will reduce CPU. As I mentioned earlier, the v4 write throughput you observed is about what can be expected from this platform. Disk type doesn't matter for this workload because you are hitting CPU bottleneck. I also mentioned you could get a little more write throughput using FCP. The reason is that with Ethernet protocols the CPU is used for building Ethernet frames and IP packets compared with FCP where the equivalent work is being done in hardware on the fibre channel adapter itself.
I really don't think you're missing something. The only suggestion I have is to open a support case and provide them a perfstat for analysis.
Good luck!
Chris
I think this is the normal max write throughput for this platform with IP protocols. FC can get you a bit higher, maybe 90MB/s sequential writes. For sequential reads with IP protocols you get 2x the write throughputs, for FC a little more than 2x. This platform is showing its age, and with 1 x 32-bit CPU and 1 GB of RAM just isn't very powerful.
It is an older device, yes. But I am pretty sure this is not the case: it's the not the speed over LAN and not the writes wich are fine, but the CPU load and the throttling of the clients during a simple copy of one bigger file, that gives me the headaches. A friend of mine has the same FAS (but with 7200 sata drives... a slower one) and no problems. I keep suspecting this can be delt with.
BTW here is a exportfs:
/vol/vol_1 -sec=sys,rw,root=192.168.1.50,anon=65535,nosuid
/vol/vol_2 -sec=sys,rw,root=192.168.1.50,anon=65535,nosuid
With NFSv3 I get around 25-30MB/s and slower CPU load (mount reads: nfs (rw,relatime,sync,vers=3,rsize=65536,wsize=65536,namlen=255,acregmin=1,acregmax=1,acdirmin=1,acdirmax=1,hard,proto=tcp,timeo=60,retrans=2,sec=sys,mountaddr=192.168.1.200,mountvers=3,mountport=4046,mountproto=tcp,local_lock=none,addr=192.168.1.200)):
> sysstat 1
CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache
in out read write read write age
56% 715 0 0 24682 676 524 31744 0 0 4s
49% 820 0 0 28237 775 236 39680 0 0 3s
44% 806 0 0 27775 760 259 8682 0 0 4s
57% 743 0 0 25632 703 383 52339 0 0 4s
45% 856 0 0 29511 807 620 24968 0 0 4s
56% 710 0 0 24490 670 288 34494 0 0 4s
What is wrong with NFSv4 where I get at least double the speed, but also the load, plus the client problems (mount reads: nfs4 (rw,relatime,vers=4.0,rsize=65536,wsize=65536,namlen=255,acregmin=1,acregmax=1,acdirmin=1,acdirmax=1,hard,proto=tcp,port=0,timeo=60,retrans=2,sec=sys,clientaddr=192.168.1.50,local_lock=none,addr=192.168.1.200)) ?
> sysstat 1
CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache
in out read write read write age
1% 15 0 0 2 2 800 400 0 0 2s
0% 14 0 0 2 1 0 0 0 0 2s
74% 2987 0 0 68615 1689 68 1285 0 0 2s
97% 2697 0 0 61904 1526 2002 76771 0 0 >60
97% 2361 0 0 54236 1338 1112 73964 0 0 >60
98% 2457 0 0 56519 1393 308 67556 0 0 2s
98% 2491 0 0 57144 1410 1271 67860 0 0 2s
98% 2451 0 0 56369 1390 1220 69872 0 0 2s
98% 2358 0 0 54167 1336 1028 76088 0 0 2s
98% 2576 0 0 59152 1458 1125 67430 0 0 2s
Thank you for your time madden!
PS: zeroed the spares and reallocated everything, the volumes and the aggregate - was a small problem with the aggr0.
nfs stat -d
Server rpc:
TCP:
calls badcalls nullrecv badlen xdrcall
129782 0 0 0 0
UDP:
calls badcalls nullrecv badlen xdrcall
0 0 0 0 0
IPv4:
calls badcalls nullrecv badlen xdrcall
129782 0 0 0 0
IPv6:
calls badcalls nullrecv badlen xdrcall
0 0 0 0 0
Server nfs:
calls badcalls
129782 0
Server nfs V2: (0 calls)
null getattr setattr root lookup readlink read
0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
wrcache write create remove rename link symlink
0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
mkdir rmdir readdir statfs
0 0% 0 0% 0 0% 0 0%
Read request stats (version 2)
0-511 512-1023 1K-2047 2K-4095 4K-8191 8K-16383 16K-32767 32K-65535 64K-131071 > 131071
0 0 0 0 0 0 0 0 0 0
Write request stats (version 2)
0-511 512-1023 1K-2047 2K-4095 4K-8191 8K-16383 16K-32767 32K-65535 64K-131071 > 131071
0 0 0 0 0 0 0 0 0 0
Server nfs V3: (0 calls)
null getattr setattr lookup access readlink read
0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
write create mkdir symlink mknod remove rmdir
0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
rename link readdir readdir+ fsstat fsinfo pathconf
0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
commit
0 0%
Read request stats (version 3)
0-511 512-1023 1K-2047 2K-4095 4K-8191 8K-16383 16K-32767 32K-65535 64K-131071 > 131071
0 0 0 0 0 0 0 0 0 0
Write request stats (version 3)
0-511 512-1023 1K-2047 2K-4095 4K-8191 8K-16383 16K-32767 32K-65535 64K-131071 > 131071
0 0 0 0 0 0 0 0 0 0
Server nfs V4: (129782 calls, 388016 ops)
null compound badproc2 access close commit
0 129782 0 0% 23 0% 1 0% 0 0%
create delegpurge delegret getattr getfh link
0 0% 0 0% 0 0% 129283 33% 5 0% 0 0%
lock lockt locku lookup lookupp nverify
0 0% 0 0% 0 0% 8 0% 0 0% 0 0%
open openattr open_confirm open_downgrade putfh putpubfh
1 0% 0 0% 0 0% 0 0% 129130 33% 0 0%
putrootfh read readdir readlink remove rename
162 0% 0 0% 3 0% 0 0% 1 0% 0 0%
renew restorefh savefh secinfo setattr setclntid
329 0% 0 0% 0 0% 0 0% 2 0% 161 0%
setclntid_cfm verify write rlsowner
161 0% 0 0% 128746 33% 0 0%
Read request stats (version 4)
0-511 512-1023 1K-2047 2K-4095 4K-8191 8K-16383 16K-32767 32K-65535 64K-131071 > 131071
0 0 0 0 0 0 0 0 0 0
Write request stats (version 4)
0-511 512-1023 1K-2047 2K-4095 4K-8191 8K-16383 16K-32767 32K-65535 64K-131071 > 131071
0 0 0 1 0 2 1 6 128736 0
Misaligned Read request stats
BIN-0 BIN-1 BIN-2 BIN-3 BIN-4 BIN-5 BIN-6 BIN-7
0 0 0 0 0 0 0 0
Misaligned Write request stats
BIN-0 BIN-1 BIN-2 BIN-3 BIN-4 BIN-5 BIN-6 BIN-7
128745 0 0 0 0 0 0 0
NFS V2 non-blocking request statistics:
null getattr setattr root lookup readlink read
0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
wrcache write create remove rename link symlink
0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
mkdir rmdir readdir statfs
0 0% 0 0% 0 0% 0 0%
NFS V3 non-blocking request statistics:
null getattr setattr lookup access readlink read
0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
write create mkdir symlink mknod remove rmdir
0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
rename link readdir readdir+ fsstat fsinfo pathconf
0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
NFS reply cache statistics:
TCP:
InProg hits Misses Cache hits False hits
0 129073 0 0
UDP:
In progress Misses Cache hits False hits
0 0 0 0
nfs reply cache size=89600, hash size=1553
flows alloc'd=46, max flows=896
flows used=2, flows free=44
reserve entries=22, nflow LRU=0, grow LRU=0, opinfo releases=0
entry alloc fail=0, reply alloc fail=0, flow alloc fail=0, connection drops=0
Connection drops because of in progress hits:
v3 conn dropped=0
v4 conn dropped, no reconnect=0
num msg=0, too many mbufs=0, rpcErr=0, svrErr=0
no msg queued=0, no msg re-queued(xfc)=0
no msg unqueued=0, no msg discarded=0
no msg dropped=0, no msg unallocated=0
no msg dropped from vol offline=0, no deferred msg processed=0
sbfull queued=0, sbfull unqueued=0, sbfull discarded=0
no mbuf queued=0, no mbuf dropped=0
no mbuf unqueued=0, no mbuf discarded=0
(cumulative) active=0/143 req mbufs=0
tcp no msg dropped=0, no msg unallocated=0
tcp no resets after nfs off=0
tcp input flowcontrol receive=67542, xmit=0
tcp input flowcontrol out, receive=67542, xmit=0
Errors in the blocking export access check = 0
No of RAID errors propagated by WAFL = 0
sockets zapped nfs=0, tcp=0
reply cache entry updated on socket close=0
no delegation=0, read delegation=0, write delegation=0
v4 acls set=0
nfs msgs counts: tot=143, free=64, used=0, VM cb heard=0, VM cb done=0
nfs msgs counts: on assist queue=0, max on assist queue = 1, cut off for assist queue=57
nfs msgs counts: waiting for access resolution=0, cut off for access resolution=57
v4 reply cache opinfo: tot=89957, unallocated=89701, free=56, normal=0, rcache=200
v4 reply cache complex msgs: tot=33733, unallocated=33605, free=28, normal=0, rcache=100
v4 wafl request msgs: tot=133, unallocated=69, free=64, used=0
v1 mount (requested, granted, denied, resolving) = (0, 0, 0, 0)
v1 mount (frozen vol pending, frozen vol exceeded) = (0, 0)
v1 unmount (requested, granted, denied) = (0, 0, 0)
v1 unmount all (requested, granted, denied) = (0, 0, 0)
v2 mount (requested, granted, denied, resolving) = (0, 0, 0, 0)
v2 mount (frozen vol pending, frozen vol exceeded) = (0, 0)
v2 unmount (requested, granted, denied) = (0, 0, 0)
v2 unmount all (requested, granted, denied) = (0, 0, 0)
v3 mount (requested, granted, denied, resolving) = (0, 0, 0, 0)
v3 mount (frozen vol pending, frozen vol exceeded) = (0, 0)
v3 unmount (requested, granted, denied) = (0, 0, 0)
v3 unmount all (requested, granted, denied) = (0, 0, 0)
admin requested rmtab entry flushes = 0
mount service requests (curr, total, max, redriven) = (0, 0, 0, 0)
access cache lookup requests (curr, total, max) = (0, 20, 1)
access cache (hits, partial misses, misses) = (132095, 0, 1)
access cache nodes(found, created) = (132095, 1)
access cache requests (queued, unqueued) = (0, 0)
access cache requests unqueued by (flush, restore) = (0, 0)
access cache read requests (queued, unqueued) = (0, 0)
access cache write requests (queued, unqueued) = (0, 0)
access cache root requests (queued, unqueued) = (0, 0)
access cache expired hits (total, read, write, root) = (0, 0, 0, 0)
access cache inserts (full, partial, dup, subnet, restore) = (0, 0, 0, 0, 1)
access cache refreshes requested (total, read, write, root) = (0, 0, 0, 0)
access cache attribute resolutions requested but not scheduled because we are over the threshold(total, read, write, root) = (0, 0, 0, 0)
access cache refreshes done (total, read, write, root) = (0, 0, 0, 0)
access cache errors (query, insert, no mem) = (0, 0, 0)
access cache nodes (flushed, harvested, harvests failed) = (0, 0, 0)
access cache nodes (allocated, free) = (2000, 1998)
access cache qctx (allocated, free) = (500, 500)
access cache persistence errors (total) = (0)
access cache persistence nodes handled (restored, saved) = (1, 4)
access cache persistence rules deleted (total) = (1)
access cache persistence rules with mismatched schema (total) = (0)
access cache persistence memchunks (allocated, freed) = (6, 6)
assist queue (queued, split mbufs, drop for EAGAIN) = (0, 11144, 0)
NFS re-drive queue(curr, max, total) = (0, 0, 0)
Direct NFS re-drive(memory, webNFS) = (0, 0)
RPCSEC_GSS context limit=0
current context count=0, maximum context count=0
context reclaim callbacks=0, context idle/expired scans=0
vm pressure callbacks=0
contexts created=0, contexts deleted=0
contexts deleted due to vm pressure=0
contexts deleted due to context limit=0
contexts deleted due to idle/expiration=0
requests exceeding timeout=0
Files Causing Misaligned IO's
Later edit by: Horia Negura
Started everything again from scratch. Same issue, but the jumbo frames (mtu 9000 on the filer, switch, and the CentOS client) and the priority setting helped about 10% with the CPU load problem. Now, how can I get rid of the missaligned requests, is this normal with a nfs setup? The reallocate command doesn't help. I keep getting them, they keep increasing, it is obvious as I did some nfsstat -z:
Misaligned Read request stats
BIN-0 BIN-1 BIN-2 BIN-3 BIN-4 BIN-5 BIN-6 BIN-7
1 0 0 0 0 0 0 0
Misaligned Write request stats
BIN-0 BIN-1 BIN-2 BIN-3 BIN-4 BIN-5 BIN-6 BIN-7
349274 0 0 0 0 0 0 0
What I did:
-reinitialised with option 4a (created a 3 disk aggr with 1 vol0 for root);
-upgraded the OS due to missing files (7.3.7); the following were done via Systems Manager;
-added 7 disks all at once to the aggr;
-resized to minimum the default root vol0 to 20 GB as per best practices;
-created 2 flex volumes in the aggr;
-resized them a bit to fit my needs and disabled thin provisioning on those 2;
-free space for the aggr left around 10%;
-tested again.
Is there something wrong with this approach?
The misaligned stats show a count of 4KB (or some increment of 4KB) IOs and the offset they had into a WAFL block. In your case all IOs are arriving in BIN-0 which means they are aligned. If BIN-1-7 were incrementing then you have unaligned IO. I think the write performance you observe is normal for this platform. When you do the big file copy from the syssstat v4 earlier you are reaching CPU bottleneck, and with v3 you were reaching some other bottleneck. Perhaps the use of the sync mount option in your v3 trial caused IOs to be issued serially and was throttling work requested of the filer. In the v4 example though CPU resources are exhausted so any request is going to have to queue CPU which drives up response times. Using jumbo frames helped reduce CPU (more data, less metadata, per frame) which enables more CPU for other work. I can't think of anything else you can enable that will reduce CPU. As I mentioned earlier, the v4 write throughput you observed is about what can be expected from this platform. Disk type doesn't matter for this workload because you are hitting CPU bottleneck. I also mentioned you could get a little more write throughput using FCP. The reason is that with Ethernet protocols the CPU is used for building Ethernet frames and IP packets compared with FCP where the equivalent work is being done in hardware on the fibre channel adapter itself.
I really don't think you're missing something. The only suggestion I have is to open a support case and provide them a perfstat for analysis.
Good luck!
Chris
Thank you for clearing this up - was becoming obsessed with those horror stories about misaligns ! I also misunderstood you answer regarding FCP, thought you were refering solely to the bandwidth gains and forgot about the extrawork that TCP/IP throws to the CPU.
This is it then, I will try to overcome my NFS4 fears.
PS Do you know exactly what CPU is? Wich Mobile Celeron? Is there any way to find out more about the hardware, RAM also, than sysconfig -a?
Hi,
If you prefer nfs v3 I think you should be able to get similar performance as v4 with the right mount options. Maybe check the Oracle on NetApp best practices tech report for some Linux + NFS optimization tips: http://www.netapp.com/us/media/tr-3633.pdf
Regarding CPUs and memory I don't think more information is available in a Data ONTAP command output. Maybe if you boot into diags (the one from control-c at boot) there is an option to show more details about components.
Cheers,
Chris