Solved: CIFS performance struggling

evilclivemott · ‎2020-06-07

We've got an ageing IBM N6250, aka FAS3250 I think, running 7-Mode 8.2.1

We want to extract a lot of data from this filer using CIFS - we've got 200T+ in about 44 shares on a two headed controller.

Our problem is that the file copies just aren't going fast enough. Individually they work fine, and if I set off a copying job, it works well for a while. But it will then collapse - the transfers simply sit there waiting, being very slow indeed. 15 seconds to shift a 40K file.

The copying job is in powershell on a Win2016 VM, doing CopyItem \\filer\share\file destination. There is 10GbE between the filer and the VM.

What sort of things can cause that? I realise this is a vague question, so please ask me for further information and I'll try and supply it.

cheers,

clive

evilclivemott · ‎2020-06-19

Hi -

Well, we have an answer 🙂

It was the networking after all. The pair of network links are IP balanced, and clearly the snapmirror must have been doing down the other one, hence my previous diagnosis was incorrect.

No networking errors shown on the netapp side, however when we got access to the switches, one of the ports was showing a lot of CRC errors.

Last night we got the offending port turned off, so all the traffic is now going down a single path, and performance is now up to 100%.

Obviously we're doing to get the hardware looked at - we expect that one of the SFPs or the fibre between them needs replacing, and that's under way, because we do want our resilience 🙂

So there's the answer : CIFS performance problems, worth checking the network, including at the switch end, because the underlying network problems won't show on the netapp itself.

cheers,

clive

View solution in original post

paul_stejskal · ‎2020-06-08

I'd get a storage side packet trace and open in Wireshark and see where delays are happening. Likely it's not storage side if transfers are fast normally. Without having some basic perf info it's hard to say.

I'd get stats start cifs, let run for 5 minutes, then stats stop. That will tell you the latency at the controller level. If you have some, check volume level with stats start volume, then stats stop.

evilclivemott · ‎2020-06-08

CIFS stats nice and easy :

Head 1 :

cifs:cifs:instance_name:cifs
cifs:cifs:node_name:
cifs:cifs:node_uuid:
cifs:cifs:cifs_ops:3835/s
cifs:cifs:cifs_latency:1.34ms
cifs:cifs:cifs_read_ops:2095/s
cifs:cifs:cifs_write_ops:1156/s

Head 2 :

cifs:cifs:instance_name:cifs
cifs:cifs:node_name:
cifs:cifs:node_uuid:
cifs:cifs:cifs_ops:114/s
cifs:cifs:cifs_latency:9.87ms
cifs:cifs:cifs_read_ops:39/s
cifs:cifs:cifs_write_ops:0/s

Head 2 going way slower.

Vol stats are, um, a little more verbose. 3.5MB, 4.0MB output on the two heads. read_latency is highest on the CIFS volumes being read (bulk data, on slower disk), but no instantly obvious difference between the two heads.

So I think it is somewhere inside the n-series/netapp.

cheers,

clive

paul_stejskal · ‎2020-06-08

9 ms isn't bad. I'd expect something higher if it's freezing for a few seconds. I think a packet trace is next step to confirm latency on the network.

evilclivemott · ‎2020-06-08

Those stats weren't gathered while it was stuck, it's still working at the moment.

The freeze when it happened wasn't a few seconds, it was up to a few hours.

evilclivemott · ‎2020-06-08

It's now slowed down a bit - not frozen, but still not shovelling data that quickly (total 125Mb/s or so)

cifs:cifs:instance_name:cifs
cifs:cifs:node_name:
cifs:cifs:node_uuid:
cifs:cifs:cifs_ops:454/s
cifs:cifs:cifs_latency:12.95ms
cifs:cifs:cifs_read_ops:268/s
cifs:cifs:cifs_write_ops:7/s

It'll go up again in due course.

paul_stejskal · ‎2020-06-08

Alright then those stats will need to be gathered when it is stuck. 5 minutes is long enough to confirm if there is a latency issue.

evilclivemott · ‎2020-06-08

A couple of minutes at quite slow :

cifs:cifs:instance_name:cifs
cifs:cifs:node_name:
cifs:cifs:node_uuid:
cifs:cifs:cifs_ops:300/s
cifs:cifs:cifs_latency:16.95ms
cifs:cifs:cifs_read_ops:161/s
cifs:cifs:cifs_write_ops:4/s

Compared with the other head, showing 1-2 ms latency, that doesn't seem great.

paul_stejskal · ‎2020-06-08

It could be outliers. You might try disabling vscan or fpolicy as a test, but without a support case/perfstat I can't really tell more.

paul_stejskal · ‎2020-06-08

Harvest may be helpful here if you haven't set it up already. https://nabox.org/

evilclivemott · ‎2020-06-08

Outliers? Can you tell me more about this? I know what an outlier is in statistics, just wanted to know what you mean with regards to my problem.

I'm pretty sure we've got no vscan on there, and I've made sure to exclude it from the servers I'm working with. (the data I'm copying does potentially include malware and I want to retain it perfectly - I am scanning it later on in the process and flagging, but that's after it's off the netapp)

We're also not using fpolicy - these CIFS shares are used by a document management system where the permissions are stored in a database and handled by the application. End users have no access to these shares (or indeed any on these filers).

The higher latency does seem to correspond to the lower performance - ie when it's working well, the latency is low.

Obvious question : could it simply be sheer load on the filer?

tahmad · ‎2020-06-08

You may check if you have any performance issue on NetApp:

Diagnosing a performance issue

Also you can open a technical case with the support team for further troubleshooting.

A packet trace between the filer and the client is needed. Below the instruction:

😆 node run -node <node_name> pktt start e0a -i <ip_addr> -i <ip_addr> -d /etc/crash

How to capture packet traces (PKTT) on Data ONTAP 8 Cluster-Mode systems

evilclivemott · ‎2020-06-08

I'd have opened a support case already if I could 🙂 IBM N-Series, Netapp refuse to take our money to support this.

paul_stejskal · ‎2020-06-09

Does IBM not have an active contract? Just curious.

I'm trying to think how we could check. We could just do a basic system check. A few commands:

priv set admin;stats start volume;priv set diag; statit -b; sysstat -c 30 -M 1; sysstat -c 30 -x 1; stats stop; statit -e

That should give some output. If you can attach here feel free. The volume names can be ommitted but maybe something like "vol1" then "vol2" would help. Then you can track if we say vol1, vol2 to your names.

If you want, you can send the file and link via PM if security is an issue. One other command if you do:

wafl_susp -z

Wait 5 minutes

wafl_susp -w

That should get most of it.

evilclivemott · ‎2020-06-10

Apologies for not getting back to this sooner.

IBM abandoned their partnership with Netapp a few years ago, and support went with it. Which is a pity, because despite being 6 or so years old, the hardware is still pretty effective (as is the 9yo backup filer, though that one does struggle a bit with load). But I reckon Netapp think the sort of people who buy their kit will be willing to pay the money to keep newer hardware.

I will look at the detailed logging things you've given me, though it might not be instant. However I am starting to wonder if it's something as simple as load. I keep finding other heavy loads on the disks with the CIFS volumes (SCOM agent going mental on a couple of servers, a big database on SATA disk when it should probably be on SAS, a very busy database I can probably move to an SSD based SAN), and I'm slowly working my way through these.

Nobody's yet mentioned cifs.per_client_stats.enable 🙂 That was on, and I turned it off a few days ago (before writing this post). I think I might no longer be getting the dead stops, just go-slows now.

Thanks for sticking with it so far.

cheers,

clive

evilclivemott · ‎2020-06-12

A little more data - not the data I've been asked for, but I thought it might be interesting.

If I create a LUN on the same disks as the CIFS volume on each controller, it seems to perform at the same speed on both.

If I snapmirror to a second netapp (we've got an old FAS3250 too, also unsupported by netapp, this time due to being second hand), fast disk on the second one, mirror initialize of 50GB or so takes the same speed on both controllers - 7 minutes or so, roughly 1.3-1.4 Gb/s over the network.

That seems to eliminate networking and the disks as the source of the problem, and points to the protocol itself.

evilclivemott · ‎2020-06-19

Hi -

Well, we have an answer 🙂

It was the networking after all. The pair of network links are IP balanced, and clearly the snapmirror must have been doing down the other one, hence my previous diagnosis was incorrect.

No networking errors shown on the netapp side, however when we got access to the switches, one of the ports was showing a lot of CRC errors.

Last night we got the offending port turned off, so all the traffic is now going down a single path, and performance is now up to 100%.

Obviously we're doing to get the hardware looked at - we expect that one of the SFPs or the fibre between them needs replacing, and that's under way, because we do want our resilience 🙂

So there's the answer : CIFS performance problems, worth checking the network, including at the switch end, because the underlying network problems won't show on the netapp itself.

cheers,

clive

paul_stejskal · ‎2020-07-06

Sorry I was away for a couple weeks, but glad you got this resolved. It's definitely worth checking a packet trace with your network team to see if there is loss or anything similar.