2010-08-08 09:10 AM
Hi, we’ve aligned all our Vmware vmdk’s according to the Netapp best practices while tracking the pw.over_limit counter
Counters that indicate improper alignment ( ref: ftp://service.boulder.ibm.com/storage/isv/NS3593-0.pdf)
“There are various ways of determining if you do not have proper alignment. Using perfstat counters, under the wafl_susp section, “wp.partial_writes“, “pw.over_limit“, and “pw.async_read,“ are indicators of improper alignment. The “wp.partial write“ is the block counter of unaligned I/O. If more than a small number of partial writes happen, then IBM® System StorageTM N series with WAFL® (write anywhere file layout) will launch a background read. These are counted in “pw.async_read“; “pw.over_limit“ is the block counter of the writes waiting on disk reads.”
So the pw.over_limit counter is still recording an 5 minute average of 14 with 7-10 peaks in the 50-100 range at certain times of the day.
If I look at the clients talking to the Netapp those times its mostly Oracle RAC servers with storage for data and voting disks on NFS.
This leads me to the question: What if any are the other possible sources for unaligned IO on Netapp?
All references I find are vmware vmdk – but are there others like Oracle which may be doing block IO over NFS?
2010-09-15 03:48 PM
Hi, a followup:
we recently modified the Oracle RAC on Netapp NFS workload (disabled a high IO operation) and immediately noticed a reduction in the pw.over_limit level (13/sec -> 5/sec average) and time of day pattern.
Are these .nfs files using some kind of block level IO that is unaligned?
[db-03 ~]$ sudo /sbin/fuser -c /oracrs/vote03/.nfs0000000000b1298500000005
/oracrs/vote03/.nfs0000000000b1298500000005: 5487 13216 13249 14304 14408 14442 14659
[db-03 ~]$ ls -la /oracrs/vote03
drwxrwxrwx 3 oracle oinstall 4096 Jun 20 2008 .
drwxr-xr-x 5 root root 4096 Mar 14 2008 ..
-rw-r----- 1 oracle oinstall 21004288 Mar 17 2008 .nfs0000000000b1298500000005
drwxrwxrwx 3 root root 4096 Sep 14 14:08 .snapshot
-rw-r----- 1 oracle dba 21004288 Sep 14 14:18 irtcrs_dss03.ora
[db-03 ~]$ ps -ef | egrep "5487| 13216 |13249| 14304| 14408 |14442 |14659"
oracle 5487 1 0 Aug11 ? 00:00:00 /export/app/crs/11.1.0/bin/oclskd.bin
oracle 13216 13207 0 Jun02 ? 00:21:38 /export/app/crs/11.1.0/bin/evmd.bin
root 13249 11730 0 Jun02 ? 08:57:02 /export/app/crs/11.1.0/bin/crsd.bin reboot
oracle 14304 13309 0 Jun02 ? 00:00:00 /export/app/crs/11.1.0/bin/diskmon.bin -d -f
oracle 14408 13358 0 Jun02 ? 04:04:41 /export/app/crs/11.1.0/bin/ocssd.bin
oracle 14442 14414 0 Jun02 ? 00:00:00 /export/app/crs/11.1.0/bin/oclsomon.bin
root 14659 1 0 Jun02 ? 00:00:00 /export/app/crs/11.1.0/bin/oclskd.bin
oracle 14715 13216 0 Jun02 ? 00:01:37 /export/app/crs/11.1.0/bin/evmlogger.bin -o /export/app/crs/11.1.0/evm/log/evmlogger.info -l /export/app/crs/11.1.0/evm/log/evmlogger.log
Its hard to find any info on the alignment of Oracle RAC files on NFS.
Anyone with insight to share?
2010-10-27 11:28 PM
We had an Oracle DB outage today and noticed the partial writes were zero during the outage
We now have strong evidence the Oracle on NFS is doing some unaligned IO
Q: What can we do about it?
Open an Oracle case?
2010-10-28 01:08 AM
if you are using NFS directly, NOT storing vmdks or something on it, there is no unaligned io possible as there are no partitions of any kind involved.
2010-10-28 01:36 PM
It looks those files are Oracle Clusteware 11g voting disks stores as files on NFS, RAC always does "heartbeats" on those files/disks. It is to avoid
instance split brain problems/etc.
Get output of:
crsctl query css votedisk
mount | grep -i vote
Voting mechanics is internal to RAC and you can do nothing about it other than bringing those voting disks under iSCSI or FC. Basically Clusterware might do some read()/write() syscalls to this file and you cannot control the offset from the start of the file. I would open Oracle case but just to confirm that you are supported (something working != something supported) as I'm not sure if this kind of hack is supported and how 11.2 changed it (if at all).
1) how you noticed it that it was filling w/ zeros?
2) do you have statspack/AWR report to confirm your performance findings about "slow storage"? (of course in theory this is unrelated to the voting disks, but i guess vols are on the same aggr?)
3) slow performance on voting disks would rather cause serious node evictions from clusters rather long query replies..
4) paste us e.g. iostat -x -d 5 5 here please
2010-10-28 01:46 PM
We are tracking unaligned IO by the only method/best method I know (partial writes over limit) counter.
We'd like this to be zero for overall Netapp health.
There is no indication the oracle performance is suffering unduly, but we have been told any partial writes will affect performance of the whole system.
The point of starting this thread was to identify other sources of unaligned IO (besides the most common VMware cases)
Oracle seems to be confirmed as one of these sources - although I'd prefer a more better method than a global counter to specifically identify the clients
Ideally there'd be a way to align this IO once identified.
In the vmware case, you shutdown your vm and run the Netapp tool mbralign.
Not sure what the oracle fix would be?
2010-10-28 02:49 PM
First thing, this pure NFS, so misalligment almost cannot happen (you don't have layers that could introduce this unaligment; you are trying to solve problem that I guess doesn't exist for your users - but it is still fun for both of us to learn ).
But OK, let's do a thinking experiment: you have an app that performs this: requests read of 100 bytes from some file from offset 65444 to 65544.
What OS can do depends on many factors, but let's say you have NFS read size buffer (mount option rsize set to 8192). The OS is going to get one syscall read() request from your app and request the data for those offset only. Now it comes at Neapp via TCP/UDP and let's say that Netapp stripes at every 64k.. so because single 100bytes read via NFS it has to read 2x64kb (offset covers 2 stripes) from disks... for me that is not a problem unless it doesn't happen all the time (voting disks are mostly idle).
Message was edited by: jakub.wartak
BTW: please provide the commands outputs, so we are really talking about RAC voting disks on NFS files.
2010-10-28 10:13 PM
[root@db-03 ~]# mount | grep -i vote
ora64-vf-01:/vol/ora64net/vote01 on /oracrs/vote01 type nfs (rw,bg,hard,nointr,tcp,nfsvers=3,timeo=300,rsize=32768,wsize=32768,actimeo=0,addr=220.127.116.11)
ora64-vf-01:/vol/ora64net/vote02 on /oracrs/vote02 type nfs (rw,bg,hard,nointr,tcp,nfsvers=3,timeo=300,rsize=32768,wsize=32768,actimeo=0,addr=18.104.22.168)
ora64-vf-01:/vol/ora64net/vote03 on /oracrs/vote03 type nfs (rw,bg,hard,nointr,tcp,nfsvers=3,timeo=300,rsize=32768,wsize=32768,actimeo=0,addr=22.214.171.124)
[root@db-03 ~]# crsctl query css votedisk
0. 0 /oracrs/vote01/crs_dss01.ora
1. 0 /oracrs/vote02/crs_dss02.ora
2. 0 /oracrs/vote03/crs_dss03.ora
Located 3 voting disk(s).
I agree - there should need to be another layer for there to arise unaligned IO.
I was thinking maybe these files are handled specially somehow via block level IO instead of direct NFS
2010-10-29 06:58 AM
Data OnTap is designed to handle a certain number of unaligned IO (there is likely to be some in most workloads). It is possible that the activity you're observing is small block IO, or what is referred to as partial writes. This occurs when the application (Oracle RAC) writes less than 4KB to an existing file over NFS. While a zero in pw.over_limit is desirable, a small number in pw.over_limit may be normal for some workloads. I strongly suggest you open a performance case with NetApp Global Support. They have the resources to sort this out for you.
The reason guest filesystem alignment in virtual disks (vmdk, vhd, img, etc) is particularly bad (on any storage system) is that 100% of the workload will be misaligned if the guest filesystem starts misaligned.
Hope this helps.
2011-02-25 01:51 AM
Oracle redo log writes are not guanteed to be aligned and is discussed in this Oracle whitepaper:
Since all database writes in Oracle are aligned on 4K boundaries (as long as the default block size is at
least 4K), using flash for database tablespaces should never result in slower performance due to
misaligned writes. Redo writes however, are only guaranteed to fall on 512 byte boundaries . Redo
writes also have a quasi-random access pattern when async I/O is employed. These two properties
contribute to performance degradations for some workloads. Data illustrating this is shown in Table
 This has been changed in Oracle 11R2, with the addition of a 'BLOCKSIZE' option to the
'ALTER DATABASE ADD LOGFILE' command. This option, which is (as of October 2010) not
available for Oracle on versions of Solaris, guarantees that redo writes will be a multiple of
BLOCKSIZE, and thus aligned on BLOCKSIZE boundaries. The availability of this option will likely
change the stated conclusions about flash based redo logs.
Perhaps on your Oracle version you can update to make your redo writes aligned.
Also, of interest, starting with Data ONTAP 126.96.36.199 (but not yet in 8.x family yet) the nfs stats (use command 'nfsstat -d') have been extended to categorize IO by offset, and to shows the NFS files with the most misaligned IOs:
Misaligned Read request stats
BIN-0 BIN-1 BIN-2 BIN-3 BIN-4 BIN-5 BIN-6 BIN-7
1474405719 47648 5472 5331 3192 3192 2843 2080
Misaligned Write request stats
BIN-0 BIN-1 BIN-2 BIN-3 BIN-4 BIN-5 BIN-6 BIN-7
302208520 6965622 6541184 6586093 6532810 6558522 6563999 6570036
Files Causing Misaligned IO's
Info on how to interpret the stats is in the man page but it's quite helpful to use this technique to understand (a) how much misaligned IO is occurring and (b) which NFS files receive the most misaligned IOs.
In the above output we can conclude that 87% of our NFS 4k increment writes are aligned and 13% are unaligned, and checking the file list we see that by far the biggest cuprits are the redo logs. Now without looking at pw.over_limit (I don't have wafl_susp output for this snippet above) I can't say if there'd be much positive affect by reducing these misaligned writes, but in any case you can use the technique above to better understand the workload arriving at the system and where to focus if needed.