Data Backup and Recovery
Data Backup and Recovery
Hi,
Current situation :
We recently bought a NetApp FAS 3140 Metrocluster with 26 x 500Gb sata disk for each side. This system is
used to store all our VMware server in NFS volumes. We have a few iSCSI connected LUN's for the Exchange
and SQL Server 2008 (requirement for snapmanagers).
Database info :
Since we migrated our WMS and Dynamics databases to the new SQL Server (on the LUN's) we started having
performance problems when a high load was created on the Microsoft Dynamics system. The WMS and
Dynamics database both slowed down below useable limits.
I have contacted support and got a lot of performance tracking information. At the moment I'm monitoring network,
cpu, disk I/O on the SQL Server and the NetApp system and got lot's of data, but can really firgure out what is
causing the problems so far. So I would be interessed in hearing from anyone who has experience with Micrsoft
Dynamics NAV on NetApp volumes and its performance.
What I have tried so far :
The results :
Questions :
Thanks in advance for any extra information on this topic...
Hi and welcome to the Communities!
This is a tricky one, as there may be a gazillion of reasons why performance is poor - e.g. it may be storage, but equally it may be iSCSI connectivity (say a lack of jumbo frames).
In general using SATA disks for running any sort of databases where a lot of random IO activity occurs is not recommended.
Can you post a small sample of your sysstat -x 1 output, ideally taken when performance sucks?
Regards,
Radek
Hi,
The peak loads for today had already past by, but I repeated one of the tests with copying a file to the affected volume.
The output of the command can be seen in this table and look the same as what we see when performance is low.
I will capture again during peak load tomorrow (around 9AM belgian time) and update this info, maybe you already see
something strange on the data below ?
The iSCSI colum is showing latency of the disks ?
6% 560 1 0 561 496 16417 13460 23152 0 0 32 98% 100% :f 29% 0 0 0 0
3% 75 1 0 84 148 571 1760 10204 0 0 32 87% 76% : 42% 0 8 0 0
2% 124 1 0 129 1158 2749 2540 0 0 0 32 95% 0% - 8% 0 4 0 0
7% 46 1 0 153 27884 550 132 0 0 0 32 96% 0% - 4% 0 106 0 0
3% 35 1 0 59 6413 202 72 48 0 0 32 97% 0% - 11% 0 23 0 0
9% 61 1 0 190 29446 712 196 0 0 0 32 96% 0% - 4% 0 128 0 0
CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk FCP iSCSI FCP kB/s
in out read write read write age hit time ty util in out
12% 31 1 0 164 34715 765 1528 3048 0 0 34 98% 11% Mn 13% 0 132 0 0
25% 101 1 0 233 35267 722 4996 144216 0 0 20 98% 100% :s 79% 0 131 0 0
11% 22 1 0 152 34998 683 572 39464 0 0 20 98% 100% :f 45% 0 129 0 0
23% 18 1 0 146 34179 711 2540 131608 0 0 20 97% 99% Mn 81% 0 127 0 0
17% 37 1 0 157 35126 705 1156 91584 0 0 20 98% 100% :v 82% 0 119 0 0
10% 50 1 0 175 29853 605 912 6744 0 0 20 98% 21% Mn 18% 0 124 0 0
24% 82 1 0 100 11843 403 7572 172752 0 0 20 100% 100% :f 88% 0 17 0 0
9% 19 1 0 148 29070 568 824 18704 0 0 20 93% 61% : 48% 0 128 0 0
8% 11 1 0 143 35020 699 188 0 0 0 20 94% 0% - 5% 0 131 0 0
32% 24 1 0 153 34914 749 6176 161456 0 0 20 98% 98% Mf 93% 0 128 0 0
12% 21 1 0 150 34963 734 784 37240 0 0 20 92% 66% : 62% 0 128 0 0
12% 41 1 0 170 34394 708 3356 19608 0 0 20 96% 14% Mn 16% 0 128 0 0
27% 35 1 0 164 35448 711 1548 177400 0 0 20 98% 100% :f 90% 0 128 0 0
9% 29 1 0 161 34982 804 396 168 0 0 20 96% 24% Mn 27% 0 131 0 0
29% 17 1 0 146 34699 675 4032 174432 0 0 20 99% 100% :f 92% 0 128 0 0
10% 53 1 0 164 35225 694 864 22440 0 0 20 94% 57% : 39% 0 110 0 0
2% 49 1 0 72 5830 144 24 48 0 0 20 98% 0% - 11% 0 22 0 0
11% 36 1 0 165 28677 577 3584 16208 0 0 20 98% 16% Mn 14% 0 128 0 0
29% 31 1 0 162 35510 711 3228 181952 0 0 20 98% 98% : 85% 0 130 0 0
10% 29 1 0 158 34903 732 384 1726 0 0 20 95% 5% Mn 13% 0 128 0 0
CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk FCP iSCSI FCP kB/s
in out read write read write age hit time ty util in out
27% 43 1 0 172 36628 807 4963 158817 0 0 20 98% 100% :f 90% 0 128 0 0
11% 55 1 0 184 35113 830 756 36312 0 0 20 95% 84% : 55% 0 128 0 0
14% 50 1 0 165 34250 746 6326 19285 0 0 20 98% 21% Mn 24% 0 114 0 0
23% 38 1 0 56 6389 151 4540 180120 0 0 20 99% 100% :v 87% 0 17 0 0
2% 73 1 0 77 425 525 552 0 0 0 20 93% 3% : 5% 0 3 0 0
1% 56 1 0 61 174 78 120 64 0 0 20 100% 0% - 10% 0 4 0 0
1% 51 1 0 55 73 33 108 0 0 0 20 97% 0% - 3% 0 3 0 0
1% 16 1 0 18 57 41 44 0 0 0 20 100% 0% - 3% 0 1 0 0
1% 43 1 0 47 330 83 120 48 0 0 20 94% 0% - 11% 0 3 0 0
iSCSI column shows the number of iSCSI operations per second.
Quick analysis:
Although you are saying this is not during peak, disks are rather busy I would say (Disk util column), so it *may* simply mean you have too few / too slow spindles to cope with the performance requirement.
Hi,
You really only have S-ATA disks attached? They arent the right disk type for database applications as their read latency is quite high and slow write performance (your issues as it seems).
Besides taht, you hit quite a few M CP types:
M CP caused by low mbufs; writes data to the disk in order to prevent an out of memory buffers situation.
Possibly something netapp global support should take care of.
Kind regards
Thomas
We only have S-ATA disk connected to our NetApp. Out hardware vendor did assure us
that the performance on these disk would do for our implementation. I'm trying to figure
out if this is indead a problem with the SATA disk or if something else is causing the
current issues ...
With what I hear now, adding an extra shelf with smaller, but ofcourse, faster FC disk could
be the simple solution (If the money is made available ..).
I will try to recreate these stats tommorow during a peak load on the SQL Server itself and post
them back at this location. Maybe you guys can then have a look at that data again. I will
capture as much data as possible during that period.
Thanks for the extra information so far.
Hi,
Today I tried to create the logs during the periode that the batch process is running. Unfortunatly the performance problems where not that bad
as the previous days so I don't know if anything will be visible on this output.
NORMAL LOAD @ 8:30 AM :
You can see a lot of NFS traffic at that moment. About 2 hours later this traffic goes below 100ops for almost the rest of the working day.
CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk FCP iSCSI FCP kB/s
in out read write read write age hit time ty util in out
8% 745 1 0 750 2149 16003 15524 12776 0 0 2 88% 88% : 47% 0 4 0 0
8% 1484 1 0 1493 6045 6213 7652 0 0 0 2 89% 0% - 26% 0 8 0 0
10% 1646 1 0 1647 6410 5234 13008 48 0 0 2 85% 0% - 28% 0 0 0 0
12% 2514 1 0 2515 9203 12811 16100 0 0 0 2 80% 0% - 25% 0 0 0 0
8% 1412 1 0 1413 5121 29599 9396 0 0 0 2 91% 0% - 28% 0 0 0 0
5% 487 1 0 489 564 15234 16916 64 0 0 5 73% 0% - 23% 0 1 0 0
3% 337 1 0 338 337 5083 5804 0 0 0 5 62% 0% - 24% 0 0 0 0
3% 275 1 0 276 667 3462 3952 0 0 0 5 69% 0% - 23% 0 0 0 0
4% 279 1 0 280 227 8092 10740 48 0 0 5 61% 0% - 23% 0 0 0 0
4% 313 1 0 314 271 10446 11876 0 0 0 5 64% 0% - 34% 0 0 0 0
7% 525 1 0 528 531 23875 25428 24 0 0 5 79% 7% Tn 31% 0 2 0 0
7% 398 1 0 399 250 17349 23188 13368 0 0 5 86% 100% :n 46% 0 0 0 0
8% 264 1 0 265 191 12111 14456 51560 0 0 5 89% 100% :s 74% 0 0 0 0
8% 491 1 0 492 712 22276 26940 9504 0 0 2 83% 100% :v 60% 0 0 0 0
3% 283 1 0 284 338 5606 8236 48 0 0 2 65% 4% : 30% 0 0 0 0
6% 555 1 0 556 567 15916 19336 16 0 0 5 70% 0% - 32% 0 0 0 0
5% 469 1 0 470 342 22930 23652 0 0 0 5 81% 0% - 22% 0 0 0 0
7% 515 1 0 516 463 25817 27928 48 0 0 2 79% 0% - 27% 0 0 0 0
8% 745 1 0 746 825 36942 35008 0 0 0 5 89% 0% - 30% 0 0 0 0
8% 857 1 0 858 479 7441 21836 9156 0 0 2 83% 100% :f 84% 0 0 0 0
8% 1010 1 0 1012 381 8164 25584 64 0 0 2 65% 53% : 100% 0 1 0 0
15% 1439 1 0 1440 1319 87287 82236 0 0 0 2 97% 0% - 28% 0 0 0 0
15% 1420 1 0 1421 1631 84154 78760 0 0 0 2 98% 0% - 29% 0 0 0 0
15% 1402 1 0 1403 2694 84899 79608 64 0 0 2 98% 0% - 26% 0 0 0 0
8% 651 1 0 652 707 32614 33692 0 0 0 7 70% 0% - 45% 0 0 0 0
9% 1338 1 0 1339 1032 33431 34968 48 0 0 7 76% 0% - 38% 0 0 0 0
19% 2480 1 0 2489 6439 46294 52404 3344 0 0 6 82% 14% Tn 47% 0 8 0 0
17% 2370 1 0 2371 8721 34540 37732 19560 0 0 6 87% 100% :f 71% 0 0 0 0
17% 2902 1 0 2903 7834 53625 48624 5024 0 0 2 86% 58% : 53% 0 0 0 0
14% 2144 1 0 2145 6831 47622 41276 0 0 0 2 83% 0% - 56% 0 0 0 0
10% 1201 1 0 1202 632 50712 37496 0 0 0 1 75% 0% - 47% 0 0 0 0
13% 1451 1 0 1452 890 61862 60376 48 0 0 1 78% 0% - 44% 0 0 0 0
8% 827 1 0 828 591 30466 32964 16 0 0 1 70% 0% - 48% 0 0 0 0
11% 1199 1 0 1200 772 43039 42096 0 0 0 8 65% 0% - 43% 0 0 0 0
10% 859 1 0 860 571 41098 40604 48 0 0 6 68% 0% - 38% 0 0 0 0
7% 627 1 0 628 681 31307 32304 0 0 0 6 71% 0% - 39% 0 0 0 0
8% 582 1 0 583 447 28556 27900 0 0 0 6 68% 0% - 40% 0 0 0 0
11% 875 1 0 876 805 47706 53906 64 0 0 1 84% 0% - 39% 0 0 0 0
HIGHER LOAD @ 09:30 AM :
This is when the Dynamics system was running one of it's dayly batch processes.
Here you can see the iSCSI traffic from the SQL server is doing something, but less then I expected to see.
CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk FCP iSCSI FCP kB/s
in out read write read write age hit time ty util in out
20% 682 1 0 881 1716 82960 83396 13356 0 0 5 72% 75% Tv 61% 0 198 0 0
17% 627 0 0 847 1146 92676 89804 36 0 0 17s 60% 14% : 47% 0 220 0 0
15% 131 2 0 391 864 63381 62456 64 0 0 5 42% 0% - 47% 0 258 0 0
15% 193 1 0 483 809 67962 68804 0 0 0 3s 42% 0% - 50% 0 289 0 0
16% 304 1 0 568 1051 72821 70496 0 0 0 3s 45% 0% - 41% 0 263 0 0
16% 239 1 0 523 840 71232 68920 48 0 0 3s 44% 0% - 50% 0 283 0 0
14% 199 1 0 424 1044 66370 64080 0 0 0 3s 42% 0% - 47% 0 224 0 0
16% 256 1 0 545 1992 74446 72452 16 0 0 5 43% 0% - 41% 0 288 0 0
15% 276 1 0 496 974 66720 65076 48 0 0 3s 43% 0% - 51% 0 219 0 0
16% 244 1 0 466 1193 67027 64692 0 0 0 2s 43% 0% - 42% 0 221 0 0
13% 205 1 0 425 1522 58624 56116 0 0 0 3s 39% 0% - 39% 0 219 0 0
16% 382 1 0 535 1540 55811 60020 13872 0 0 3s 70% 84% Tf 65% 0 152 0 0
15% 153 1 0 389 808 66030 64792 4624 0 0 2s 40% 46% : 53% 0 235 0 0
20% 221 1 0 549 1236 91968 89680 0 0 0 2s 40% 0% - 40% 0 327 0 0
18% 137 1 0 468 1413 82047 79788 48 0 0 2s 37% 0% - 38% 0 330 0 0
18% 308 1 0 581 2641 73683 71204 0 0 0 1s 48% 0% - 61% 0 272 0 0
18% 267 1 0 556 1309 82919 80448 0 0 0 2s 42% 0% - 58% 0 288 0 0
21% 339 1 0 672 1334 94787 91556 64 0 0 1s 40% 0% - 51% 0 332 0 0
18% 185 1 0 492 973 84921 84532 0 0 0 5 39% 0% - 51% 0 306 0 0
19% 183 1 0 507 1328 84116 81860 0 0 0 1s 37% 0% - 42% 0 323 0 0
22% 225 1 0 580 1386 99981 95552 48 0 0 1s 37% 0% - 50% 0 354 0 0
22% 319 1 0 674 1302 107788 105248 0 0 0 1s 43% 0% - 42% 0 354 0 0
21% 139 1 0 454 956 83748 83740 9832 0 0 1s 51% 53% Tn 53% 0 314 0 0
19% 153 1 0 495 905 78429 75816 12284 0 0 1s 49% 100% :f 63% 0 341 0 0
16% 206 1 0 484 832 67512 67864 64 0 0 1s 40% 19% : 52% 0 277 0 0
17% 918 1 0 1207 1191 73375 69785 0 0 0 1s 35% 0% - 70% 0 288 0 0
23% 196 1 0 587 1209 106831 103288 48 0 0 1s 36% 0% - 50% 0 390 0 0
19% 629 1 0 902 1068 81943 79732 16 0 0 1s 38% 0% - 71% 0 272 0 0
23% 203 1 0 626 1382 110019 106493 0 0 0 0s 33% 0% - 46% 0 422 0 0
22% 283 1 0 620 1192 100199 97040 48 0 0 0s 38% 0% - 53% 0 336 0 0
20% 197 1 0 542 1422 92367 89644 0 0 0 1s 34% 0% - 46% 0 344 0 0
22% 215 1 0 571 1723 101591 97360 0 0 0 0s 39% 0% - 59% 0 355 0 0
Do you guys see anything strange on the NetApp's performance ?
thanks !
Quick shot at these stats:
Interestingly enough, disk utilisation today @9:30 is lower with more iSCSI operations, than yesterday (with less iSCSI ops) & we can say the filer is coping just fine.
My informed guess is that disk reads vs. writes make the whole difference - there were mostly reads in today's stats & many writes yesterday.
That would somewhat fit into the nature of SATA drives which doesn't cope well with random writes.
Regards,
Radek
One thing that I didn't see in the thread was the type and size of SQL server. One thing that really affects the amount of disk load SQL places onto the storage system is the amount of RAM in the box. In general, more RAM for SQL means lower IOPS pushed down to the storage subsystem because SQL can be smarter about caching and write ordering and etc.
Are you running x64 SQL? How much RAM is assigned to the VM? These days, I almost always run x64 SQL and then tune RAM based on observed performance. If adding RAM does not help performance, then you can back off a bit. May require some tweaking.
Alex
Our server is running Windows Server 2008 x64 and SQL Server 2008 x64. The machine has 12GB of RAM assigned to it and is running as only VM on that physical server (Server has 24GB, so 12GB unused at the moment, as part of the disaster recovery plan).
The amount of RAM assigned to each Instance is :
So let's way we up the memory of the complet server with 6GB and assign 4GB of our Dynamics database.
We would need to see a drop in IOPS to the NetApp ?
So let's way we up the memory of the complet server with 6GB and assign 4GB of our Dynamics database.We would need to see a drop in IOPS to the NetApp ?
I am a tad skeptical about the outcome, as writes, not reads seem to be the main culprit - for that reason host-side caching (achieved by increasing available memory) not necessarily will improve a lot (if anything).
Having said that, this is a virtual environment, so I see some sense in at least trying this.
Regards,
Radek
Your NetApp machine seems fine, there is no disk utilization above 80%. Adding more memory wont help you with writing, helps with read only usualy.
Your disk read/write amount is around 80-100MB each, which fits exactly to 2 x 1GBit ethernet links. What & how many links do you use for iSCSI traffic? You might hit a link saturation. Have you read the Best Practices for VMware & NetApp iSCSI connectivity?
Besides that, your "sysstat -x 1" does not show iSCSI kB/s in/out statistics, means you are on an older version of data ontap. Consider upgrading to the latest 7.3.4 release.
16GB is a pretty healthy size for SQL. Do you have any idea how large your working set is? If it's less than that, adding more RAM won't help. If you're willing to take a reboot, adding more RAM won't HURT anything. SQL certainly LOVES RAM.
Also, is SQL in Simple Recovery mode or standard?
Increasing your RAM will increase your checkpoint depth which means that DB writes will be held in memory longer. However, Log writes will always continue immediately. Are you writing logs to the same LUN as your database? How many logs are you generating per hour?
What is perfmon reporting as your average disk latency on your logical disk object? I'd love to see a dump of the whole logical disk object. How many IOPS report that you are consuming?
As others have pointed out, you may simply be exceeding the IOPS budget for these relatively slow spindles. Separating log writes from the DB onto a separate aggregate may help you get around this if you haven't done that already.
Alex
Seems like the 12 GB is all allocated to SQL server instances. There need to be some memory for Windows too. With 12 GB total no more than 10 GB should be for SQL server (see http://www.sqlservercentral.com/blogs/glennberry/archive/2009/10/29/suggested-max-memory-settings-for-sql-server-2005_2F00_2008.aspx)
I often hear people complaining about the SAN/storage when their SQL server performs badly. Most of the times it can be fixed by optimizing the Windows, SQL server or connectivity to the storage.
Do you have several vCPU's and files for the databases and tempdb?
A great tool for troubleshooting a SQL server problem is Spotlight from Quest. It gives you a great overview (almost on a manager can understand ) and enables you to drill down where the red lights/problems are. There's a 30 days trial version on their website.
-Carsten
Problem solved : Replaced NetApp with HP P4500 Multi-Site SAN
And that has SAS disks, right?
indeed, 4 x P4500 Controller, with each 12 x 600GB 15K rpm SAS disks,
hey,
dont get me wrong but you had a problem with an apple and now bought an orange to solve it ...?!
No worries, he replaced a wrong sized NetApp with a "storage system" from HP... Poor guy