Linux client : find/tree command extremly slow

JoergR · ‎2020-09-10

Hi folks,

After migrating local file systems to NFS share on NetApp we found out the runnig for example "find /<directory> -type f | wc -l" for just counting number of files in a directory is extremly slow compared to "normal" local filesystem

Example from local filesystem:

tchsl037:~ # time find /alt -type f | wc -l
619084

real 0m3.033s
user 0m1.365s
sys 0m1.768s

and from NFS (copied from the above):

tchsl037:~ # time find /usr/sap/BPT -type f | wc -l
619084

real 1m7.174s
user 0m2.078s
sys 0m7.430s

Is there a way how to speed this up?

Thanks and regards

Joerg

Stiegelis · ‎2020-09-10

Can you please tell us more about your NetApp system, model and what port used for your NFS lif? Is the storage on a all flash system or SATA disks. Check if the lif is on the home port.

JoergR · ‎2020-09-15

OK, asked my colleague from storage management. I got following informations (FYI - I'm a UNIX admin)

We have a 14 nodes cluster with models from : FAS 8060, AFF 700 and FAS 9000

lif is on home port.

OnTap version is 9.6p6

The particular share for what I'm asking is on SAS disks with flashpool & flashcache (SSD Disks for read cache, FlashPool SSD storagepools)

Hope this helps

JoergR · ‎2020-09-15

Thanks so far to all the comments on my question.

I will have a look regarding xcp - that is new for me because (as said) I'm on the other side of the storage as a UNIX admin.

I understand that the performance regarding metadata is faster for local disks than on NFS shares. But I did not expect such a huge difference.

Regards

Joerg

parisi · ‎2020-09-15

The differences are highly dependent on network speed/congestion, client memory/CPU, storage type, and file count.

For instance, that find command to a smaller number of files would be fast:

# time find /mnt/nas -type f | wc -l
39

real 0m0.448s
user 0m0.003s
sys 0m0.020s

But never as fast as local:

# time find /etc -type f | wc -l
3085

real 0m0.054s
user 0m0.018s
sys 0m0.037s

parisi · ‎2020-09-10

If you're just looking to find the number of files in a directory, you can use XCP to figure out the file count.

 xcp scan -l -stats NFS-server:/volume/folder/path

XCP is free and can be found at http://xcp.netapp.com.

Here's a comparison of XCP's speed vs. find.

XCP:

Total count: 500,005
Directories: 2
Regular files: 500,003
Symbolic links: None
Special files: None
Hard links: None,
multilink files: None,
Space Saved by Hard links (KB): 0
Sparse data: N/A
Dedupe estimate: N/A
Total space for regular files: size: 50.4 GiB, used: 52.4 GiB
Total space for symlinks: size: 0, used: 0
Total space for directories: size: 48.8 MiB, used: 49.0 MiB
Total space used: 52.4 GiB


Xcp command : xcp scan -l -stats 10.193.67.219:/flexgroup_16/files
Stats : 500,005 scanned
Speed : 90.0 MiB in (1.12 MiB/s), 444 KiB out (5.53 KiB/s)
Total Time : 1m20s.
STATUS : PASSED

find:

# time find /flexgroup/files -type f | wc -l
500003

real 13m57.454s
user 0m1.886s
sys 0m34.219s

With XCP scan, you also get file size info, file age, etc.

For example:

== Maximum Values ==
Size Used Depth Namelen Dirsize
4.63 GiB 4.65 GiB 2 15 500,002

== Average Values ==
Namelen Size Depth Dirsize
14 106 KiB 2 250,002

== Top Space Users ==
root
52.4 GiB

== Top File Owners ==
root
500,005

== Top File Extensions ==
.dat .log .iso .out
500,000 1 1 1

== Number of files ==
empty <8KiB 8-64KiB 64KiB-1MiB 1-10MiB 10-100MiB >100MiB
1 500,000 1 1

== Space used ==
empty <8KiB 8-64KiB 64KiB-1MiB 1-10MiB 10-100MiB >100MiB
47.7 GiB 42.3 MiB 4.65 GiB

== Directory entries ==
empty 1-10 10-100 100-1K 1K-10K >10K
1 1

== Depth ==
0-5 6-10 11-15 16-20 21-100 >100
500,005

== Accessed ==
>1 year >1 month 1-31 days 1-24 hrs <1 hour <15 mins future
1 100,596 312,452 86,954

== Modified ==
>1 year >1 month 1-31 days 1-24 hrs <1 hour <15 mins future
1 115,870 312,452 71,680

== Changed ==
>1 year >1 month 1-31 days 1-24 hrs <1 hour <15 mins future
1 115,870 312,452 71,680

And if you use XCP 1.6 or later, you can use File Systems Analytics - that will keep a running tally of the directories/sizes/files.

In a future release, the file analytics will be available natively in System Manager.

Keep in mind that local processing of these operations will always be faster because there's no network contention to deal with. With any network-based protocol, there's a back and forth conversation that has n amount of latency, depending on the network health. There's also processing that adds latency to the request on the NFS server side. With local, it's much faster because you have less round trip time to deal with.

The benefit of using NFS is for performance when *many* clients need to run operations against the same data. That can't be done locally on the same datasets, and if clients connect to other clients to run processes, the clients will bottleneck much sooner than a storage system.

paul_stejskal · ‎2020-09-10

Tacking on to what Justin said, with a local disk the metadata is in server or computer RAM. Each call is available in nanoseconds or microseconds. With NAS, the calls are more like milliseconds.

It may or may not need a minute, but it depends on 1) directory depth, 2) number of folders total, etc. If there are a lot of folders I wouldn't be surprised.

Recommended things to review:

1) Get a packet trace from storage end. https://kb.netapp.com/Advice_and_Troubleshooting/Data_Storage_Software/ONTAP_OS/How_to_capture_packet_traces_(tcpdump)_on_ONTAP_9.2__systems

(tcpdump start -node <nodename> -port <port> -buffer 4096 -address <ip address of nfs client>)

2) Run qos statistics volume latency show -volume <volume> -vserver <SVM> on volume in question

3) Start the find.

4) tcpdump stop * and pull from SPI, and stop qos statistics

Now what you'd want to do is open the trace in Wireshark. I would expect to see either a single READDIRPLUS call or a bunch of READDIRs for each folder. Maybe this will give a clue where the most time is spent.