High Read latency on volume - 150-200 m/s

iqbalasif_jd · ‎2019-04-15

We are seeing lots of performance issue from one week especially read latency on filers and one of the volume. Volume size is approx 9TB and many application hosted on this volume. Earlier performance was good within 20 m/s but now it is upto 200 m/s. We have migrated few application to the other filers and did not see the performance.

On April 6th, we have just added one entry on the netgroup, earlier we had a limitation with netgroup filers which was resolved a month ago.

Just want to know what is causing the Read performace issue on this volume ?

GidonMarcus · ‎2019-04-16

Hi

Do you expierence the actual latency and impact on the apps?

You can try and get per-client statistics with nfsstat -h command to see if it's prhaps correlate with the netgroup change you did. you can also show us the change you made (or try to roll back), and we can maybe see some issues with it.

Also, You didn't provide any throughput information. was the throughput increased/decreased on the same time (decreasing throughput can point to numbers that were high before - was getting average down due to lot of other low latency requests)

You can also try and monitor the latency on the aggregate to understand better if it's a protocol latency or if the underline system struggles.

if you want to provide us more info. i can suggest collecting perfstat in minimal mode (so it dosen't expose your configuration. just statistics)

Gidi

Gidi Marcus (Linkedin) - Storage and Microsoft technologies consultant - Hydro IT LTD - UK

iqbalasif_jd · ‎2019-04-16

Hi Gidi,

Thanks for the prompt response and your suggestions.

Q)Do you expierence the actual latency and impact on the apps?
A)Yes, many application see the extreme slowiness and they are not able to work which is causing the production impact.

Q)You can try and get per-client statistics with nfsstat -h command to see if it's prhaps correlate with the netgroup change you did. you can also show us the change you made (or try to roll back), and we can maybe see some issues with it.
A)When users reported the issue, the first things we did is Roll-back but this also did not fix the issue. I have attached the NFS statistic for more details.

Q)Also, You didn't provide any throughput information. was the throughput increased/decreased on the same time (decreasing throughput can point to numbers that were high before - was getting average down due to lot of other low latency requests)
A) Actually there is spiked at 00:30 AM on 04/06/2019 till next day at 06 AM. I have attached the graphs for more details.

Q)You can also try and monitor the latency on the aggregate to understand better if it's a protocol latency or if the underline system struggles.
A) Aggregate total throughput is between 2k-3k but on reported that it was high for a few hours. Though, the associated aggregate performance is 2-3k but performance is still bad. Users reads seen more on the graphs.

Q)if you want to provide us more info. i can suggest collecting perfstat in minimal mode (so it dosen't expose your configuration. just statistics)

iqbalasif_jd · ‎2019-04-16

Attached the NFS statistic ....

Esxi_host_cache · ‎2020-05-19

There is a category of software called Host Side Caching software which will cache VM reads and writes to in-host SSD or RAM, and by doing so it will mask any issues you might have in the storage subsystem, whether the root cause of high latency is the storage network, controllers, disks or any other component in the storage IO path. Full disclosure that I work for Virtunet Systems that has such a software for ESXi.

Ontapforrum · ‎2019-04-16

Hi,

I can suggest some pointers as I have dealt with latency issues on NetApp filers lately. As I see this is a 7-mode Filer, so you may need more efforts to pin down the root cause(s). In cDOT the way latency is reported is completely changed the game and made it much easier for the storage admins to determine at which layer the latency is building up.

For 7-mode filer and NFS protocol, I will look at following items first, later we can delve into other protocol layer latencies.

1) Whats the aggregate capacity where this volume is hosted?
output of : filer>df -Ah

2) Whats the Avg CPU usage on this FILER, along with Peaks?
Grafana can give you this figure / Or simply run

filer>sysstat -m 1 [for 2-5 minutes] if the CPU is too high then stop it after a minute.

3) Collect the disk statistics
filer>privs set diag
filer>statit -b
let it run for a 1 minute
filer>statit -e

All these information is good enough to come to some conlcusion, plus we need to remember - Is this NFS vol presented as ESX Datasotre, in that case the Avg IO size is between 50 to 64KB. And, if you have a 1500-mtu environment, it means you will need :

IO Size = 64K
1500-mtu = Actual payload could be up to = 1460bytes = 1.4k

= > Total packets required before you complete a single IO => 64/1.4 = 45 Packets on avg for an acknowledgement. For NFS applications that read/write with smaller IOs, may have low latency despite other factors such as disk/cpu usage high.

Let us know the output of those items.

Thanks,
-Ash