We've got an ageing IBM N6250, aka FAS3250 I think, running 7-Mode 8.2.1
We want to extract a lot of data from this filer using CIFS - we've got 200T+ in about 44 shares on a two headed controller.
Our problem is that the file copies just aren't going fast enough. Individually they work fine, and if I set off a copying job, it works well for a while. But it will then collapse - the transfers simply sit there waiting, being very slow indeed. 15 seconds to shift a 40K file.
The copying job is in powershell on a Win2016 VM, doing CopyItem \\filer\share\file destination. There is 10GbE between the filer and the VM.
What sort of things can cause that? I realise this is a vague question, so please ask me for further information and I'll try and supply it.
I'd get a storage side packet trace and open in Wireshark and see where delays are happening. Likely it's not storage side if transfers are fast normally. Without having some basic perf info it's hard to say.
I'd get stats start cifs, let run for 5 minutes, then stats stop. That will tell you the latency at the controller level. If you have some, check volume level with stats start volume, then stats stop.
Vol stats are, um, a little more verbose. 3.5MB, 4.0MB output on the two heads. read_latency is highest on the CIFS volumes being read (bulk data, on slower disk), but no instantly obvious difference between the two heads.
So I think it is somewhere inside the n-series/netapp.
Outliers? Can you tell me more about this? I know what an outlier is in statistics, just wanted to know what you mean with regards to my problem.
I'm pretty sure we've got no vscan on there, and I've made sure to exclude it from the servers I'm working with. (the data I'm copying does potentially include malware and I want to retain it perfectly - I am scanning it later on in the process and flagging, but that's after it's off the netapp)
We're also not using fpolicy - these CIFS shares are used by a document management system where the permissions are stored in a database and handled by the application. End users have no access to these shares (or indeed any on these filers).
The higher latency does seem to correspond to the lower performance - ie when it's working well, the latency is low.
Obvious question : could it simply be sheer load on the filer?
That should give some output. If you can attach here feel free. The volume names can be ommitted but maybe something like "vol1" then "vol2" would help. Then you can track if we say vol1, vol2 to your names.
If you want, you can send the file and link via PM if security is an issue. One other command if you do:
IBM abandoned their partnership with Netapp a few years ago, and support went with it. Which is a pity, because despite being 6 or so years old, the hardware is still pretty effective (as is the 9yo backup filer, though that one does struggle a bit with load). But I reckon Netapp think the sort of people who buy their kit will be willing to pay the money to keep newer hardware.
I will look at the detailed logging things you've given me, though it might not be instant. However I am starting to wonder if it's something as simple as load. I keep finding other heavy loads on the disks with the CIFS volumes (SCOM agent going mental on a couple of servers, a big database on SATA disk when it should probably be on SAS, a very busy database I can probably move to an SSD based SAN), and I'm slowly working my way through these.
Nobody's yet mentioned cifs.per_client_stats.enable 🙂 That was on, and I turned it off a few days ago (before writing this post). I think I might no longer be getting the dead stops, just go-slows now.
A little more data - not the data I've been asked for, but I thought it might be interesting.
If I create a LUN on the same disks as the CIFS volume on each controller, it seems to perform at the same speed on both.
If I snapmirror to a second netapp (we've got an old FAS3250 too, also unsupported by netapp, this time due to being second hand), fast disk on the second one, mirror initialize of 50GB or so takes the same speed on both controllers - 7 minutes or so, roughly 1.3-1.4 Gb/s over the network.
That seems to eliminate networking and the disks as the source of the problem, and points to the protocol itself.