Java file-list-directory-iter performance

michael_england · ‎2015-02-17

I'm running a java web service to query file attributes, particularly create, modify, access time using "file-list-directory-iter". Basically it runs recursively for each directory it finds across multiple threads. The process works well but I'm finding the load on the filer to be quite a bit. Here's the code snippet:

request = new NaElement("file-list-directory-iter");
request.addNewChild("path", "/vol/"+volumeName+path);
request.addNewChild("max-records", "65536");
response = server.invokeElem(request);
NaElement fileInfo = response.getChildByName("attributes-list");
List<NaElement> fileList = fileInfo.getChildren();
for (NaElement element : fileList) {
    String fileType = element.getChildContent("file-type");
    String fileName = element.getChildContent("name");
    if (fileType.equalsIgnoreCase("directory")) {
        if (fileName.equals(".") || fileName.equals("..") || fileName.equals(".snapshot")) {
            //skip
        } else {
            directoryList.addDirectory(fileName);
        }
    } else if (fileType.equalsIgnoreCase("file")) {
        MyFile myFile = new MyFile();
        myFile.setFileName(fileName);
        myFile.setFileSize(Long.valueOf(element.getChildContent("file-size")));
        myFile.setBytesUsed(Long.valueOf(element.getChildContent("bytes-used")));
        myFile.setAccessTime(Long.valueOf(element.getChildContent("accessed-timestamp")));
        myFile.setCreateTime(Long.valueOf(element.getChildContent("creation-timestamp")));
        myFile.setModifiedTime(Long.valueOf(element.getChildContent("modified-timestamp")));
        directoryList.addFile(myFile);
    }
}

The problem I'm seeing is the IOP load on the filer is about 2 for every file and directory. This means on a simple test directory with 60,000 files I need to do 120,000 IOPs, it's nice and quick but if I increase the worker thread count I can completely saturate all CPUs on a 6240 for a while. This isn't ideal.

I've also created a powershell script to do similar work although using a CIFS connection rather than through the API. This seems to do about 1 IOP for every 10 files, so for 60,000 files I need about 6,000 IOPs total. Much more efficient. The problem is, powershell spends minutes churning through the data. I'd rather use the Java code as it offers me more flexibility (security, nfs or cifs), and I can run it as a REST based web service rather than a windows only script but I'm not sure this is going to scale without seriously impacting performance on the arrays when deployed at scale.

Has anyone else played with file-list-directory-iter, or perhaps there's a way to improve the java code so I have less impact on the array?