Sources of unaligned IO other that Vmware? - pw.over_limit persists

fletch2007 · ‎2010-08-08

Hi, we’ve aligned all our Vmware vmdk’s according to the Netapp best practices while tracking the pw.over_limit counter
see: http://www.vmadmin.info/2010/07/quantifying-vmdk-misalignment.html

Counters that indicate improper alignment ( ref: ftp://service.boulder.ibm.com/storage/isv/NS3593-0.pdf)
“There are various ways of determining if you do not have proper alignment. Using perfstat counters, under the wafl_susp section, “wp.partial_writes“, “pw.over_limit“, and “pw.async_read,“ are indicators of improper alignment. The “wp.partial write“ is the block counter of unaligned I/O. If more than a small number of partial writes happen, then IBM® System StorageTM N series with WAFL® (write anywhere file layout) will launch a background read. These are counted in “pw.async_read“; “pw.over_limit“ is the block counter of the writes waiting on disk reads.”

--

So the pw.over_limit counter is still recording an 5 minute average of 14 with 7-10 peaks in the 50-100 range at certain times of the day.
If I look at the clients talking to the Netapp those times its mostly Oracle RAC servers with storage for data and voting disks on NFS.

This leads me to the question: What if any are the other possible sources for unaligned IO on Netapp?
All references I find are vmware vmdk – but are there others like Oracle which may be doing block IO over NFS?

Many thanks
--
Fletcher Cocquyt
http://vmadmin.info

fletch2007 · ‎2010-09-15

Hi, a followup:

we recently modified the Oracle RAC on Netapp NFS workload (disabled a high IO operation) and immediately noticed a reduction in the pw.over_limit level (13/sec -> 5/sec average) and time of day pattern.

Are these .nfs files using some kind of block level IO that is unaligned?

[db-03 ~]$ sudo /sbin/fuser -c /oracrs/vote03/.nfs0000000000b1298500000005
/oracrs/vote03/.nfs0000000000b1298500000005: 5487 13216 13249 14304 14408 14442 14659
[db-03 ~]$ ls -la /oracrs/vote03
total 41124
drwxrwxrwx 3 oracle oinstall     4096 Jun 20 2008 .
drwxr-xr-x 5 root   root         4096 Mar 14 2008 ..
-rw-r----- 1 oracle oinstall 21004288 Mar 17 2008 .nfs0000000000b1298500000005
drwxrwxrwx 3 root   root         4096 Sep 14 14:08 .snapshot
-rw-r----- 1 oracle dba      21004288 Sep 14 14:18 irtcrs_dss03.ora

[db-03 ~]$ ps -ef | egrep "5487| 13216 |13249| 14304| 14408 |14442 |14659"
oracle    5487     1 0 Aug11 ?        00:00:00 /export/app/crs/11.1.0/bin/oclskd.bin
oracle   13216 13207 0 Jun02 ?        00:21:38 /export/app/crs/11.1.0/bin/evmd.bin
root     13249 11730 0 Jun02 ?        08:57:02 /export/app/crs/11.1.0/bin/crsd.bin reboot
oracle   14304 13309 0 Jun02 ?        00:00:00 /export/app/crs/11.1.0/bin/diskmon.bin -d -f
oracle   14408 13358 0 Jun02 ?        04:04:41 /export/app/crs/11.1.0/bin/ocssd.bin
oracle   14442 14414 0 Jun02 ?        00:00:00 /export/app/crs/11.1.0/bin/oclsomon.bin
root     14659     1 0 Jun02 ?        00:00:00 /export/app/crs/11.1.0/bin/oclskd.bin
oracle   14715 13216 0 Jun02 ?        00:01:37 /export/app/crs/11.1.0/bin/evmlogger.bin -o /export/app/crs/11.1.0/evm/log/evmlogger.info -l /export/app/crs/11.1.0/evm/log/evmlogger.log

Its hard to find any info on the alignment of Oracle RAC files on NFS.

Anyone with insight to share?

thanks

fletch2007 · ‎2010-10-27

We had an Oracle DB outage today and noticed the partial writes were zero during the outage

We now have strong evidence the Oracle on NFS is doing some unaligned IO

Q: What can we do about it?

Open an Oracle case?

thanks

thomas_glodde · ‎2010-10-28

fletch,

if you are using NFS directly, NOT storing vmdks or something on it, there is no unaligned io possible as there are no partitions of any kind involved.

Kind regards

Thomas

jakub_wartak · ‎2010-10-28

It looks those files are Oracle Clusteware 11g voting disks stores as files on NFS, RAC always does "heartbeats" on those files/disks. It is to avoid

instance split brain problems/etc.

Get output of:

crsctl query css votedisk

mount | grep -i vote

Voting mechanics is internal to RAC and you can do nothing about it other than bringing those voting disks under iSCSI or FC. Basically Clusterware might do some read()/write() syscalls to this file and you cannot control the offset from the start of the file. I would open Oracle case but just to confirm that you are supported (something working != something supported) as I'm not sure if this kind of hack is supported and how 11.2 changed it (if at all).

BTW:

1) how you noticed it that it was filling w/ zeros?

2) do you have statspack/AWR report to confirm your performance findings about "slow storage"? (of course in theory this is unrelated to the voting disks, but i guess vols are on the same aggr?)

3) slow performance on voting disks would rather cause serious node evictions from clusters rather long query replies..

4) paste us e.g. iostat -x -d 5 5 here please

fletch2007 · ‎2010-10-28

Hi Jakub,

Please see: http://www.vmadmin.info/2010/07/quantifying-vmdk-misalignment.html

We are tracking unaligned IO by the only method/best method I know (partial writes over limit) counter.

We'd like this to be zero for overall Netapp health.

There is no indication the oracle performance is suffering unduly, but we have been told any partial writes will affect performance of the whole system.

The point of starting this thread was to identify other sources of unaligned IO (besides the most common VMware cases)

Oracle seems to be confirmed as one of these sources - although I'd prefer a more better method than a global counter to specifically identify the clients

Ideally there'd be a way to align this IO once identified.

In the vmware case, you shutdown your vm and run the Netapp tool mbralign.

Not sure what the oracle fix would be?

thanks

jakub_wartak · ‎2010-10-28

First thing, this pure NFS, so misalligment almost cannot happen (you don't have layers that could introduce this unaligment; you are trying to solve problem that I guess doesn't exist for your users - but it is still fun for both of us to learn :)).

But OK, let's do a thinking experiment: you have an app that performs this: requests read of 100 bytes from some file from offset 65444 to 65544.

What OS can do depends on many factors, but let's say you have NFS read size buffer (mount option rsize set to 8192). The OS is going to get one syscall read() request from your app and request the data for those offset only. Now it comes at Neapp via TCP/UDP and let's say that Netapp stripes at every 64k.. so because single 100bytes read via NFS it has to read 2x64kb (offset covers 2 stripes) from disks... for me that is not a problem unless it doesn't happen all the time (voting disks are mostly idle).

Message was edited by: jakub.wartak

BTW: please provide the commands outputs, so we are really talking about RAC voting disks on NFS files.

fletch2007 · ‎2010-10-28

[root@db-03 ~]# mount | grep -i vote
ora64-vf-01:/vol/ora64net/vote01 on /oracrs/vote01 type nfs (rw,bg,hard,nointr,tcp,nfsvers=3,timeo=300,rsize=32768,wsize=32768,actimeo=0,addr=17.5.64.166)
ora64-vf-01:/vol/ora64net/vote02 on /oracrs/vote02 type nfs (rw,bg,hard,nointr,tcp,nfsvers=3,timeo=300,rsize=32768,wsize=32768,actimeo=0,addr=17.5.64.166)
ora64-vf-01:/vol/ora64net/vote03 on /oracrs/vote03 type nfs (rw,bg,hard,nointr,tcp,nfsvers=3,timeo=300,rsize=32768,wsize=32768,actimeo=0,addr=17.5.64.166)

[root@db-03 ~]# crsctl query css votedisk
0.     0    /oracrs/vote01/crs_dss01.ora
1.     0    /oracrs/vote02/crs_dss02.ora
2.     0    /oracrs/vote03/crs_dss03.ora
Located 3 voting disk(s).

I agree - there should need to be another layer for there to arise unaligned IO.

I was thinking maybe these files are handled specially somehow via block level IO instead of direct NFS

thanks

forgette · ‎2010-10-29

Data OnTap is designed to handle a certain number of unaligned IO (there is likely to be some in most workloads). It is possible that the activity you're observing is small block IO, or what is referred to as partial writes. This occurs when the application (Oracle RAC) writes less than 4KB to an existing file over NFS. While a zero in pw.over_limit is desirable, a small number in pw.over_limit may be normal for some workloads. I strongly suggest you open a performance case with NetApp Global Support. They have the resources to sort this out for you.

The reason guest filesystem alignment in virtual disks (vmdk, vhd, img, etc) is particularly bad (on any storage system) is that 100% of the workload will be misaligned if the guest filesystem starts misaligned.

Hope this helps.

madden · ‎2011-02-25

Oracle redo log writes are not guanteed to be aligned and is discussed in this Oracle whitepaper:

Since all database writes in Oracle are aligned on 4K boundaries (as long as the default block size is at
least 4K), using flash for database tablespaces should never result in slower performance due to
misaligned writes. Redo writes however, are only guaranteed to fall on 512 byte boundaries [5]. Redo
writes also have a quasi-random access pattern when async I/O is employed. These two properties
contribute to performance degradations for some workloads. Data illustrating this is shown in Table
6.6.

[5] This has been changed in Oracle 11R2, with the addition of a 'BLOCKSIZE' option to the
'ALTER DATABASE ADD LOGFILE' command. This option, which is (as of October 2010) not
available for Oracle on versions of Solaris, guarantees that redo writes will be a multiple of
BLOCKSIZE, and thus aligned on BLOCKSIZE boundaries. The availability of this option will likely
change the stated conclusions about flash based redo logs.

Perhaps on your Oracle version you can update to make your redo writes aligned.

Also, of interest, starting with Data ONTAP 7.3.5.1 (but not yet in 8.x family yet) the nfs stats (use command 'nfsstat -d') have been extended to categorize IO by offset, and to shows the NFS files with the most misaligned IOs:

Misaligned Read request stats
BIN-0    BIN-1    BIN-2    BIN-3    BIN-4    BIN-5    BIN-6    BIN-7
1474405719 47648    5472     5331     3192     3192     2843     2080
Misaligned Write request stats
BIN-0    BIN-1    BIN-2    BIN-3    BIN-4    BIN-5    BIN-6    BIN-7
302208520 6965622 6541184 6586093 6532810 6558522 6563999 6570036

...

Files Causing Misaligned IO's
[Counter=285899], Filename=DKOOP_E02/DKOOP_E02_NAS1/ooperfbau/ooperfbau_g1_m1.redo
[Counter=323208], Filename=DKOOP_E02/DKOOP_E02_NAS1/oradata2/ooperfbau/ooperfbau_g3_m2.redo
[Counter=257224], Filename=DKOOP_E02/DKOOP_E02_NAS1/oradata2/ooperfbau/ooperfbau_g4_m2.redo
[Counter=319141], Filename=DKOOP_E02/DKOOP_E02_NAS1/ooperfbau/ooperfbau_g3_m1.redo
[Counter=283732], Filename=DKOOP_E02/DKOOP_E02_NAS1/ooperfbau/ooperfbau_g2_m1.redo
[Counter=259950], Filename=DKOOP_E02/DKOOP_E02_NAS1/ooperfbau/ooperfbau_g4_m1.redo
[Counter=280414], Filename=DKOOP_E02/DKOOP_E02_NAS1/oradata2/ooperfbau/ooperfbau_g2_m2.redo
[Counter=288506], Filename=DKOOP_E02/DKOOP_E02_NAS1/oradata2/ooperfbau/ooperfbau_g1_m2.redo
[Counter=605], Filename=DKOOP_E02/DKOOP_E02_NAS1/oraarch/ooperfbau/ooperfbau_1_742556409_0000001756.arch
[Counter=20], Filename=DKOOP_E02/DKOOP_E02_NAS1/oraarch/ooperfbau/ooperfbau_1_742556409_0000001842.arch
[Counter=315170], Filename=DKOOP_E02/DKOOP_E02_NAS1/oradata2/ooperfbau/ooperfbau_g1_m2.redo
[Counter=173], Filename=DKOOP_E02/DKOOP_E02_NAS1/oraarch/ooperfbau/ooperfbau_1_742556409_0000001817.arch
[Counter=304773], Filename=DKOOP_E02/DKOOP_E02_NAS1/ooperfbau/ooperfbau_g2_m1.redo

Info on how to interpret the stats is in the man page but it's quite helpful to use this technique to understand (a) how much misaligned IO is occurring and (b) which NFS files receive the most misaligned IOs.

In the above output we can conclude that 87% of our NFS 4k increment writes are aligned and 13% are unaligned, and checking the file list we see that by far the biggest cuprits are the redo logs. Now without looking at pw.over_limit (I don't have wafl_susp output for this snippet above) I can't say if there'd be much positive affect by reducing these misaligned writes, but in any case you can use the technique above to better understand the workload arriving at the system and where to focus if needed.

Cheers,

Chris

fletch2007 · ‎2011-02-25

Wow, thanks for the very informative reply

We will investigate the options in new Oracle versions to avoid misaligned IO on NFS files

Thanks also for pointing out the new nfsstat options for tracking misaligned IO at the file level!

I've added a comment to point back to this thread:

http://www.vmadmin.info/2010/07/quantifying-vmdk-misalignment.html

madden · ‎2011-02-27

I was talking to some colleagues and they said they've never encountered an environment where the misalignment in a redo logging workload affected system performance. The main reason for this is this workload is sequential and Data ONTAP will collect adjacent writes vs. writing them individually. So if 2k, 4k, 6k was written sequentially to a redo log the system will combine them in memory and flush to disk as two 4k IOs. Now if you have other random misalignment (ex. Oracle data file misalignment, LUN misalignment, etc) than the array will have to do extra work, and in extreme cases you will result in higher write latencies. In those cases the redo log misalignment would be additive to your base alignment problem and it could be beneficial to align the redo log writes if possible, but otherwise there is no benefit to align your redo log IOs.

aborzenkov · ‎2011-02-28

For the sake of correctness ☺

When client writes 2K, DOT must read whole 4K from disk to update partial content. If these 2K are not aligned at WAFL block boundary, DOT must read two whole 4K blocks, which potentially may be spread far away on disk.

Of course, redo logs tend to be relatively small and are written sequentially and so have good chances of being cached anyway; if not, at least UI rate is probably lower than from other sources.

forgette · ‎2011-02-28

Your statement is correct in some cases, but not all.

In the case of sequential writes, the missing pieces of the block we need to fill in are likely on their way. If they aren't (ie: this is the end of this logging activity) there is nothing to read, so Data OnTap simply fills the remainder of the block with zeros. This is why most logging activity, even the unaligned case, wont suffer any performance penalty, nor will it consume additional resources on the controller.

Also, system memory and PAM can reduce the need to read from disk.

irakli_natsvlishvili · ‎2011-09-14

That is great info, thanks!

Couple questions regarding the statisics and most effecient ways to find worst offenders. Could you tell how to corralate data of BIN-X output and the files causing misaligned IO? To use your example:

Misaligned Read request stats
BIN-0    BIN-1    BIN-2    BIN-3    BIN-4    BIN-5    BIN-6    BIN-7
1474405719 47648    5472     5331     3192     3192     2843     2080
Misaligned Write request stats
BIN-0    BIN-1    BIN-2    BIN-3    BIN-4    BIN-5    BIN-6    BIN-7
302208520 6965622 6541184 6586093 6532810 6558522 6563999 6570036

...

Files Causing Misaligned IO's
[Counter=285899], Filename=DKOOP_E02/DKOOP_E02_NAS1/ooperfbau/ooperfbau_g1_m1.redo

Is there a way to point out which misaligned IO from ooperfbau_g1_m1.redo enaded up in of BIN-X? Also, Files Causing Misaligned IO's statistics count all misaligned IOs. Is there a way to find out which is is read and which - write?

Last question - how BIN-X and Files Causing Misaligned IO's output correlate to “wp.partial_writes“, “pw.over_limit“ and “pw.async_read“ statistics?