Network and Storage Protocols

Lun aligment with Solaris ZFS

hdiseguros

Could anyone help me with this issue.

Currently we are suffering from low performance during some peak times during comercial hours.

Searching for a cause, I started a Perfstat and found in aligment session that all my Solaris ZFS luns shows a behaviour of spread read and writes IOps in all buckets, and all my Windows Hosts are hitting 80-100% os read/writes ops in bucket 4 or 7 even with partitions offset divisible to 4096.

Dont know what to do...

Solaris ZFS example :

lun:/vol/corretorbancos/bancos-W-DJKoXT6c8S:read_align_histo.0:12%
lun:/vol/corretorbancos/bancos-W-DJKoXT6c8S:read_align_histo.1:11%
lun:/vol/corretorbancos/bancos-W-DJKoXT6c8S:read_align_histo.2:11%
lun:/vol/corretorbancos/bancos-W-DJKoXT6c8S:read_align_histo.3:9%
lun:/vol/corretorbancos/bancos-W-DJKoXT6c8S:read_align_histo.4:13%
lun:/vol/corretorbancos/bancos-W-DJKoXT6c8S:read_align_histo.5:14%
lun:/vol/corretorbancos/bancos-W-DJKoXT6c8S:read_align_histo.6:11%
lun:/vol/corretorbancos/bancos-W-DJKoXT6c8S:read_align_histo.7:13%
lun:/vol/corretorbancos/bancos-W-DJKoXT6c8S:write_align_histo.0:23%
lun:/vol/corretorbancos/bancos-W-DJKoXT6c8S:write_align_histo.1:11%
lun:/vol/corretorbancos/bancos-W-DJKoXT6c8S:write_align_histo.2:4%
lun:/vol/corretorbancos/bancos-W-DJKoXT6c8S:write_align_histo.3:7%
lun:/vol/corretorbancos/bancos-W-DJKoXT6c8S:write_align_histo.4:2%
lun:/vol/corretorbancos/bancos-W-DJKoXT6c8S:write_align_histo.5:6%
lun:/vol/corretorbancos/bancos-W-DJKoXT6c8S:write_align_histo.6:17%
lun:/vol/corretorbancos/bancos-W-DJKoXT6c8S:write_align_histo.7:4%
lun:/vol/corretorbancos/bancos-W-DJKoXT6c8S:read_partial_blocks:0%
lun:/vol/corretorbancos/bancos-W-DJKoXT6c8S:write_partial_blocks:20%

Windows Host SQL Server example :

lun:/vol/teamworks/db-W-DJKoX2Y8N1:read_align_histo.0:0%
lun:/vol/teamworks/db-W-DJKoX2Y8N1:read_align_histo.1:100%
lun:/vol/teamworks/db-W-DJKoX2Y8N1:read_align_histo.2:0%
lun:/vol/teamworks/db-W-DJKoX2Y8N1:read_align_histo.3:0%
lun:/vol/teamworks/db-W-DJKoX2Y8N1:read_align_histo.4:0%
lun:/vol/teamworks/db-W-DJKoX2Y8N1:read_align_histo.5:0%
lun:/vol/teamworks/db-W-DJKoX2Y8N1:read_align_histo.6:0%
lun:/vol/teamworks/db-W-DJKoX2Y8N1:read_align_histo.7:0%
lun:/vol/teamworks/db-W-DJKoX2Y8N1:write_align_histo.0:0%
lun:/vol/teamworks/db-W-DJKoX2Y8N1:write_align_histo.1:99%
lun:/vol/teamworks/db-W-DJKoX2Y8N1:write_align_histo.2:0%
lun:/vol/teamworks/db-W-DJKoX2Y8N1:write_align_histo.3:0%
lun:/vol/teamworks/db-W-DJKoX2Y8N1:write_align_histo.4:0%
lun:/vol/teamworks/db-W-DJKoX2Y8N1:write_align_histo.5:0%
lun:/vol/teamworks/db-W-DJKoX2Y8N1:write_align_histo.6:0%
lun:/vol/teamworks/db-W-DJKoX2Y8N1:write_align_histo.7:0%
lun:/vol/teamworks/db-W-DJKoX2Y8N1:read_partial_blocks:0%
lun:/vol/teamworks/db-W-DJKoX2Y8N1:write_partial_blocks:0%

Need help

1 ACCEPTED SOLUTION

mitchells

priv set diag; stats show lun

View solution in original post

20 REPLIES 20

cschnidr

I have a customer with the same problem. The alignment is correct, the starting offset is correct. The problem seem that ZFS randomly changes the block sizes throughout its work from as small as 512Bytes on upward.  This issue came into play starting with a certain Solaris 10 Update 7 (Update 3 and 5 show different behaviour).  There is no fix for this that I am aware of beyond not using ZFS if using the Solaris Versions
beyond 10 Update 7. You might wanna check an ZFS expert if this is a bug in ZFS, since this is affecting the performance of any array. Also, changing the recordsize doesn't seem to help here.

hdiseguros

Thankyou  cschnidr , its not the best answer I would expect but is an aswer anyway.

Im opening a case with Oracle/Sun to check if there is a way to force ZFS to use 4k blocks instead of ramdomly 512 sectors

aborzenkov

You could try setting zfs recordsize; it is actually recommended if zfs is used for databases with fixed block.

It can be set at runtime but will affect only files created after value had been changed.

hdiseguros

We did set 4k recordsize and run a batch of processs while running perfstat in a second try to fix the problem. It did showed a improved performance, just a bit faster and responsive but didint fix the aligment problem.

It seems that ZFS works with disks as "harded set" 512 bits sectors, regardless it is a real hard drive or a logical lun, and randomize its start sector. That would not be a problem in today hard drives, but with logical luns  that works with 4k blocks.....

mitchells

There is a TR for ZFS http://media.netapp.com/documents/tr-3603.pdf that has not been updated in 3 years.  If you followed all the steps in the TR, I would try contacting the author of the article to see if they can provide any further assistance.

Thanks,

Mitchell

hdiseguros

Thanks man, will check immediatly

mitchells

It might also help to setup your ZIL so that writes have more time to be aligned prior to being flushed.

hdiseguros

Bullseye :

"

There are two ways that you can provision a disk for use with ZFS.

• VTOC - Use the normal VTOC label and provision a single slice for use with ZFS. Create a slice that encompasses the entire disk and provision that single slice to ZFS.

• EFI – Invoke the advanced "format –e" command and label the disk with an EFI label. Create a slice starting from sector 40 so that it is aligned with WAFL® and provision that slice for use with ZFS.

There is no separate LUN type on the storage array for EFI labels or EFI offsets, so use the "Solaris" LUN type and manually align a slice encompassing the entire disk when using EFI labels. The LUN type "Solaris" is also used for the normal VTOC label.

If we provision the entire disk instead of a slice for use with ZFS, then ZFS will format the disk with an EFI label, which will be unaligned with WAFL. For this reason, it is recommended to use whole "aligned" slices (for EFI labels) that encompass the size of the disk for use with ZFS. We will have to manually align the slice and then provision the slice for use with ZFS if we want to use EFI labels. EFI disks add support for disks greater than 1TB. There is no need to "align" the slice if VTOC labels are used; just create a slice that is the entire size of the disk and provision that slice to ZFS."

Thanks a lot Mitchells, you saved the day !

mitchells

The solaris _efi lun type did get added to the filer in 7.3.1.  You just need to make sure that either you implement the offset manually or if the filer implements the offset with the multiprotocol image type.

hdiseguros

I sent an e-mail to Ganesh asking what would be the best or easyest way to proper align zfs luns, he is OOF until tomorrow, until then tests can be done to be sure.

Now  Im only concern about my windows luns that shows me a diffrent behaviur but mislagned too.

Do you know if is possible to gater reads/writs histo by command ? Without the need to run perfstat ?

mitchells

priv set diag; stats show lun

cschnidr

just for completion here. Yes, alignment really matters!

http://wikis.sun.com/pages/viewpage.action?pageId=186241353

hdiseguros

Thank you man!

We're involving Oracle and Netapp's support to help us with this case.

cschnidr
Some news here:
Somebody filed a bug at opensolaris.com that got closed with the reason "not a bug". However, it seems to be obvious that ZFS won't perform on any SAN storage.
Check: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6978878

In addition NetApp Engineering is working on the topic. Outcome will be posted on tje following page. However, it seems that NetApp cannot do anything about it.
Check:
http://now.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=407376

Here is my personal view: ZFS is built for physical disks, don't use it on top of SAN storage. Besides of it our Snap* Products don't support it.

aborzenkov

Apparently zfs has parameter to force alignment:

http://www.solarismen.de/archives/5-Solaris-and-the-new-4K-Sector-Disks-e.g.-WDxxEARS-Part-2.html

if you have possibility to verify it and report, would be great!

brunopad87

Hi there Christoph and aborzenkov.

Thank you very much for your reply. Ricardo and I are co-workers so we're working together in this issue.

Well, as always Oracle told us that we have to install some patches that have some ZFS fixes.

I'll read the article and see if it works correctly.

Thanks again, guys!

hdiseguros

Hi there,

It worked. We used a modified zpool command on OpenSolaris/OpenIndiana to create the zpool, exported and imported in the Solaris server. We got the modified command from: http://digitaldj.net/2010/11/03/zfs-zpool-v28-openindiana-b147-4k-drives-and-you/

Thank you very much guys!

aborzenkov
However, it seems that NetApp cannot do anything about it.

Apparently it can, as long as I can believe publicly visible comments here: http://now.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=554725

Hopefully, it will "just work" ...

And BTW if it is implemented, it will benefit not only Solais ZFS. Here is excellent overview of SCSI capabilities Linux could use to optimize storage access: http://oss.oracle.com/~mkp/docs/linux-advanced-storage.pdf

john_giordano

Hi again,

(I posted this same question in a similarly named thread)....

Hi,

Where to begin?  We are also experiencing this issue/bug/pain/whatever between our Oracle T3-1 server and our NetApp FAS 2020 connected via Fiber Channel.

We have engaged both Oracle and NetApp support but what is still outstanding is that neither company seems to be able to agree on how to create the zpool.

Right now we have an Oracle 10 DB on the zpool we have to destroy and then re-create in order to get proper IO alignment.  As this system runs a critical NMS for our company we have to be sure about what we are doing here.

The Oracle T3-1 server has been patched per Oracle so that it can communicate with the NetApp using a 4k block size.  Apparently, we have to edit our ssd.conf file, add the line:

ssd-config-list="NETAPP LUN","physical-block-size:4096";

and then reboot the T3-1.

What we are not sure on is the recreation of the zpool after we copy the data off and then destroy it.  Does anyone know for sure how to do this?  I have spent hours Googling, reading whitepapers and talking to support people and it still seems to be "grey area".

Oracle says we should do it thusly:

zpool create ntapcontroller1lun0 c6t60A9800050336873575A587739787A76d0

Which means it using the whole disk and no slice.  This is fine as we don't need any other slices on here and this will also enable ZFS write caching which sounds good to me.  By doing this the start sector size is at 256.

His exact quote (and he seemed pretty savvy) was:

"Further ZFS will not use the write cache unless you pass a "whole disk" name

like c6t60A9800050336873575A587739787A76d0 to zpool create (note the

s6 has been removed). When passed a whole disk ZFS will create a slice 0

starting at sector 256 (a multiple of 8 sectors) that will preserve the

4KB zfs io alignment when it reaches the logical unit."

-------------

NOW, NetApp seems to say that we need to manually create a slice, manually align it so it starts at sector 40 and then do some sort of dance.

I should note the LUN type is Solaris_EFI.

What do you guys think?

Thanks,

jg

 

Hi,

 

For anyone out there suffering from this problem of LUN misalignment between their NetApp and Solaris ZFS there is a write up I did on how we fixed it here:

 

http://samurai-sysadmin.com/?p=22

 

Hope that helps.

 

Cheers,

 

jg

 

Public