Legacy Product Discussions
Legacy Product Discussions
I am very unable with my $35000 FAS2020 purchase. Unfortunately, I bought two of them.They are populated with 12 1TB Seagate ES.2 7200K drives.
I only getting 30MB/s throughput from it, everyone in my company is complaining about it performance. And the NetApp Tech support says it's normal!!!??
Even my single disk Seagate Barracula.12 (100MB/s Read) is faster than it.
Man!!!
Solved! See The Solution
Hi,
The Windows DSM supports, ALUA, by the way... and it is part of win2008. We use it with FCP and it seems to do things correctly with MPIO.
If you don't think you need aggregate snapshots, just set the schedule to 0 0 0 ... 'snap sched -A <aggr_name> 0 0 0 ', then you can reduce the snap reserve down to basically nothing... 'snap reserve -A <aggr_name> 0'. The process is bascially the same for volumes. You can also use the option "nosnap on" on volumes. If you use that with aggregates, it just balloons the reserve it needs and you bascially lose space.
Beating the system on raidgroup restrictions is a bit of a hack (and whoever uses this uses it at their own peril) but there's an option for this 'options raid.raid4.raidsize.override on' . Then you will be able to do an 'aggr options <aggr_name> raid4' ... you'll get a "disk missing" error... ignore it, zero the newly freed parity disk, add it to your aggregate. Do a full reallocate on your volumes/luns.
Mis-aligned disks refers to filesystem alignment between your upper layer NTFS and the underlying WAFL file system for block access. If their block boundaries aren't aligned, then you cause extra I/O on the WAFL filesystem. There are a number of TR's on file system alignment. Support should have seen any problems there from the perfstat output. You can check yourself too by looking at LUN stats in the perfstat output. It should look something like this:
Having your "read_align_histo" and your "write_align_histo" in bucket "0" is a good thing. If most of your writes are in other buckets or a majority in one of the other buckets, then you are causing more blocks to be read in the WAFL file system than necessary and causing artificially high I/O load.
If your problems come from disk I/O then you probably won't improve the situation any other way then having more disks until you are CPU bound. Do try to upgrade to 7.3.5.1 .
Good Luck.
It looks to me that it is a confirmed "mis-aligned" filer for sure!
All read and write are in bucket 6.
What do you think?
Yeah, you're lun in unaligned.
If its aligned it should show most of the I/O at _align_histo.0
Delete the lun and recreate it using Snapdrive
Not meaning to be rude, but there's some issues with what you're saying.
The fact that your LUN volumes are filling and disconnecting the LUN's indicates to me that your arrays arent being effectively managed
You claim that your other array has 5Tb useable with 12x 500Gb drives. Thats not physically possible, your disks are actually 413Gb multiply that out by 12 (no accounting for raid or formating ) and you've got about 4.9 TB
If you want some help post up your perfstat and i can analyse it for you.
But to be honest a single stream copy over cifs using SMB1 30MB/s is about right. that not a filer limitation, thats an SMB1 limitation.
Netapp arrays are smart, but you have to manage them. Keep and eye on your space usage, snaps etc. I've been using Netapp arrays for over 8 years and the only time i've ever seen a snapmirror fail is of it runs out of space or someone starts messing around with the snaps.
Hi there,
Actually the NetApp filer crashed, after I brought it up online, I see it's not full, it's still has 1TB left. But I assume it might have clean itself up? So I tried again to manually fill up the filer, it goes offline when it's filled and hang my windows 2008 server.
For my other array, yes I get ~4.92TB so I round it up to 5TB.
BTW, I get more than 413GB, I get like 454GB on each disk.
Here's the link to download my perfstat,
http://www.megaupload.com/?d=HI1QDMF7
TIA
To me so far what you're describing sounds alot like config problems, and lack of maintance.
Give me a 30minutes and i'll pull your perfstat apart.
So, this other array you're comparing is 12 disk in what would have to be a raid 0 stripe to give you 5Tb usable, its not really comparing apples to apples, or any other fruit.
What you're describing sounds a little like you have snaps turned on at the volume level (which is not configured right). Netapp are very data protection centric, they drop the lun to protect the data. Partially commited writes corrupt datasets. The better option is to drop the lun to protect the data. If its full there's no where to write. It also highlights the need to keep an eye on your storage, You shouldnt be letting it fill up. Thats what monitoring is for, to alert you to this situation before it becomes an issue.
This is the spec of my other array
RAID Level :5
Drives Used for Data :11 (-1%)
Array Size :5444 GB (5,446,216 MB)
Drives Used for Redundancy :1
Chunk Size :256 KB --> Stripe Size :2816 KB
Initialized :Yes
So... no it's not raid 0 stripe. It's raid 5 with 1 hot spare, exactly what I hope my FAS2020 to be.
Actually I do get Just to prove that I do 5070.79MB, that's a full 5TB, I have attached a screenshot of my server host. I get 4.95TB after formatted to NTFS, mount via GPT.
10 data, 1 parity, 1 hot spare, best I/O throughput, maximum utilization of space, never fails for the last 6 years, no monitoring needed (I am the only guy in my company to manage the network, backups, VMware, Citrix farm, filers, and 100 servers, 50 support nodes, so basically I can only afford to spend less than 5 mins to monitor each device)
It's the best purchase I ever made. I would higher recommend it to anyone, but I shouldn't be bragging too much of my array here at NetApp.com
I know NetApp sells extreme protection centric storage with sacrifying speed and storage space. I don't blame NetApp, I just blame the sale person who sold me something that I didn't ask for in the first place. But since, there's no refund, I would just have to go through the hard way and customize my NetApp into something that will meet our requirements.
It seems very weird
The issue with disks is the marketing figure is based off base10 math in reality it like all computer components is base2 math which means 500GB in base10 math is actually 476.8GB in base2 math (i.e. real ter
ms)
I'd be interested in finding out where they are finding these magic extra gigabytes, might have to do some reading
If you want a dumb, cheap set and forget storage array, you shouldnt have purchased a Netapp. Netapp are quite bad at cheap and stupid.
Netapp, like any enterprise class solution needs to be looked after, thats only if you want to do snapshots of course. If you dont want them, then turn them off and set your volumes to max size and off you go.
But iMO that completely removes the value in the Netapp offering
To be honest if think the claim that its protection at the expense of performance is abit off, for instance the arrays i work on quite frequently break 3GB/s reads via NFS.
Is there an overhead for snapshots and data protection? Of course there is. The easiet way to explain this would be put your two arrays side by side, pull the first two disks out of the trays and then run a performance test on them.
Which one would win?
If you're going to attempt to configure a netapp i really recommend getting some training, they are quite simple boxes, but only after you understand them
I'd be interested in finding out where they are finding these magic extra gigabytes
It does not appear like this array has any spare (using industry standard definition of "spare" at least 😞
Drives Used for Data :11 (-1%)
Drives Used for Redundancy :1
So assuming this is really RAID5, array has 12 disks with effective data disks number being 11. The rest is due to differences in meaning of "GB" as used by various verndors.
1. Your system isnt being taxed at all, CPU when you're loading it is about 5% disk about 15%
2. You dont need 2 spares on a system that small.
3. What mount options where you using
4. nfs response times are all very good
5. By the look of the data you're doing 4k-8k reads.
6. Given how low the utilisation is on your NIC's its somewhat concerning the number of discards and drops you're showing.
All and all it looks like the filer is doing what you're asking it to do, I'd try asking it to do more. And i'd probably take a look at the network infrastructure
Single host Consumer grade pc disk benchmarking tools really dont work testing arrays
Hi there,
Here's replies to your points
1. Your system isnt being taxed at all, CPU when you're loading it is about 5% disk about 15%
Yes, that's what the tech told me.
2. You dont need 2 spares on a system that small.
I don't think I need 2 spares, I ask the tech to remove it and add it to the aggr. He said the filer needs two spares for the two parity disk fail over to, otherwise the filer will fail when there's a failed disk. I can't argue with him, it's the NetApp expert
3. What mount options where you using
I am using GPT mount via FCP for my lun to the backup server (5TB), and NFS to my VMware server (0.5TB)
4. nfs response times are all very good
Good to know,
5. By the look of the data you're doing 4k-8k reads.
Maybe because it's sitting idle
6. Given how low the utilisation is on your NIC's its somewhat concerning the number of discards and drops you're showing.
Maybe switch flow control to full will help? Or enable jumbo frame?
2. You dont need 2 spares on a system that small.
I don't think I need 2 spares, I ask the tech to remove it and add it to the aggr. He said the filer needs two spares for the two parity disk fail over to, otherwise the filer will fail when there's a failed disk. I can't argue with him, it's the NetApp expert
There was probably some misunderstanding.
Number of spares has no relation to number of parity disks.
Two spares are recommended because it is the minimal number for disk maintenance center to work. It allows Data ONTAP to proactively replace suspected disk with spare and perform series of tests on suspected disk; depending on outcome disk could be unltimately failed or returned to operation. This feature decreases reconstruction times (disk copy is performed much faster) and increases data protection (disk copy is performed while raid is still intact).
Now if your raid group is in degraded mode (meaning - number of failed disks is 1 for RAID4 or 2 for RAID_DP) and no spare is available, by default array will shutdown in 24 hours to protect against possible data loss due to double disk failures. Again - for RAID_DP one spare is enough as raid group is not put in dergaded mode with one failed disk. So dependning on your support agreement you may want to have more than one spare if spare delivery takes long time.
Storage Management Guide contains pretty good description how NetApp handles disk failures.
You can run RAID_DP without spares. Not that it I really recommend it ...
1. What the Tech told you is correct. The system is responding to requests in a timely fashion. nfs response time is like 3ms.
2. You dont need 2 spares unless you want to use the mantainance garage feature, which to be honest you cant afford the disk. That last disk cant be added from the GUI but can be from the command line.
3. The perfstat you sent me doesnt seem to have any fcp traffic on it.
fcp:fcp:fcp_ops:0/s
fcp:fcp:fcp_latency:0ms
fcp:fcp:fcp_write_data:0b/s
fcp:fcp:fcp_read_data:0b/s
Hard to say, resetting the stats and running it again might check to see if is historical or not. Jumbos i doubt will help.
You'd have to send me a perfstat while the FCP connected LUN is being tested.
Hi Shane,
I have uploaded a perfstat again with the IOMeter while running (it's attached to one of the replies to shaun.jurr). So that should able to show some FCP traffic. Would you able to see if you can tell if my filer is mis-aligned?
Thanks in Advance
What this number is?
When using Windows 2008 and any version of DOT 7.2.5 and above
Lun type 'windows_2008' is available
windows_2008 lun type supports the new geometry that Microsoft uses in Windows 2008 server
Default settings can be used when creating partitions, either through the CLI or the MMC
The lun will be aligned to a 4K boundary and will not result in performance degradation.
Note: Windows 2008 Server defaults for partition starting offset alignment are:
LUNs less than 4GB in size will align to 64KB
LUNs larger than 4GB will align to 1MB
Hi Ivissupport,
Here's the whole thing
I believe DOT 7.2.3 (my version) also has the options of lun type "windows_2008"
I went for GPT because I still have a few windows 2003 servers in our environment that might need to FCP to the filer and access it as lun. (the filer is connected to a fibre switch)
However, if great performance can be achieved by switching to windows_2008 type, I will consider. I assume it means the current lun has to be destroyed?
First
From a quick review on your ethernet stats it seems that
Bus overruns is more than 5% of the total number of frames in an interval and this is a performance issue (Open a new technical case and report for this error)
also you have set flow control to "none"
What type/vendor/model of ethernet switch do you have?
Second
go to msinfo32.exe on your Windows Machine
In Windows, click Start > Run > then enter msinfo32.exe to run System Information.
Select Components > Storage > Disks, find your LUN!
scroll to the bottom and you will see the Partition Starting Offset information.
What this number is?
When using Windows 2008 and any version of DOT 7.2.5 and above
Lun type 'windows_2008' is available
windows_2008 lun type supports the new geometry that Microsoft uses in Windows 2008 server
Default settings can be used when creating partitions, either through the CLI or the MMC
The lun will be aligned to a 4K boundary and will not result in performance degradation.
Note: Windows 2008 Server defaults for partition starting offset alignment are:
LUNs less than 4GB in size will align to 64KB
LUNs larger than 4GB will align to 1MB
I think there is also an alignment issue with wrong type lun selected.
LUN type must configured as windows_2008
Hi Ivis,
Here's the info
scroll to the bottom and you will see the Partition Starting Offset information.
What this number is
Partition Starting Offset 135,266,304 bytes
135MB??
also you have set flow control to "none"
Do you mean on my W2008S server box or the NetApp, on my NetApp I see this..
What type/vendor/model of ethernet switch do you have?
We have a HP J4904A ProCurve Switch 2848 for our gigabit ethernet network.
We have a Qlogic SANbox 5802V to connect our W2008S R2 server box, our two FAS2020, and two HP tape drives.
We have two Qlogic QLE2460-CK 4GB SINGLE PORT FC HBA PCIE4 LC MMF OPTIC HBAs on our W2008S R2 server box
How to troubleshoot network connectivity with Gigabit Ethernet Controller II
Confirm that the Gigabit Ethernet Controller II card is setup and configured properly.
NOTE: Manual and auto negotiation MUST be set on the attached network device (switch or peer network card), not on the filer's Gigabit card. The Gigabit Ethernet Controller II is "self-configuring" when negotiating speed and duplex. To modify network device negotiation please refer to the network device manual or contact your network administrator
That this is a production line, open a new technical case, or re-open the old case,, and provide with this information.
You need an escalation to verify all this system configuration, but from experience this configuration isn't normal. Mis-alignment on luns can lead
in performance degradation.
We have many customers with experience this kind of issues..
Also you have to update your Data Ontap to a version = or > to v7.2.7 to support interoperability with Windows 2008..
Data Ontap 7.3.5.1 is the minimum version for Windows 2008 R2 Domain Controllers and Windows Client Configuration
Your network configuration maybe is wrong on switch or storage (flow control must be full)
As you can see below the Partition Starting Offset information on GPT partitions it should be either 33,571,840MB to LUNs with size < 16GB or 134,235,136 for LUNs with size > = 16GB..
Your disk LUN don't have this value...
As i said before, you have to escalate this information to NetApp to verify that your current system configuration is normal.
Also on 7.2.2 the correct LUN type is GPT for Win2k8 but if you update on v7.3 you have to change this type to Windows_2008.
From my experience you have to configure the Ethernet switch with flow control set to enabled.
with the port mode set to Auto (the default) and Flow Control enabled on storage that has a result on the storage
system to "full".
In some cases you have to set this setting manual on your ethernet switch
Another thing is that the default MTU size on the Series 2800 switches is 1522 bytes.
The Maximum Transmission Unit (MTU) is the maximum size IP packet the
switch can receive for Layer 2 packets inbound on a port. The switch drops
any inbound packets larger than the MTU allowed on the port. On ports
operating at 10 Mbps or 100 Mbps, the MTU is fixed at 1522 bytes. However,
ports operating at 1 Gbps or 10 Gbps speeds accept forward packets of up to
9220 bytes (Jumbo Frames)
You can either have Jumbo frames on 10gbps or set the Flow control to Full "Enabled" on your switch, but not both settings enabled.
Download your manual for your switch here:
http://ftp.hp.com/pub/networking/software/Mgmt-Oct2005-59906023-Chap10.pdf
and validate the settings