I saw some references and I know software such as IBM's TSM and Bacula use the snapdiff API for backups, but I wasn't able to find any documentation on it; I downloaded and browsed the documentation for all SDKs but was unable to find anything.
Can anyone point me to such documentation? Or this API is not publicly available?
Implementing the SnapDiff API is for approved partners only. While it is possible to use SnapDiff API via the NMSDK, any such implementation would make you unsupported.
Please be aware of the SnapDiff Clause in the NMSDK EULA:
SNAPDIFF API LICENSE GRANT. The SDK may also include the NetApp SnapDiff APIs and accompanying API documentation (collectively, “SnapDiff”), for which the following license terms apply. NetApp grants You a limited, terminable, revocable, nonexclusive license, with no right of sublicense, to implement SnapDiff, in Your Licensee Application, as long as such implementation does not have the functionality to transfer data from a NetApp storage device to a non-NetApp disk or solid state storage device.
Only approved partners are given accompanying documentation and guidelines on using SnapDiff APIs. Their Implementations are also vetted by NetApp Engineering.
Please note a key Provision of the license: Implementations should not have the functionality to transfer data from a NetApp Storage system to non-NetApp disk or solid state storage devices. Such implementations will be in violation of the license.
Also, please note the following:
Unless you are a backup partner that has been explicitly approved for SnapDiff API use, using SnapDiff is UNSUPPORTED. Usage and Performance profiling has NOT been done for the SnapDiff API: this means there is potential for the SnapDiff API to take away system resources from other Data ONTAP subsystems. Thus Client IO could be affected. If you run into such issues on your NetApp systems, you will be UNSUPPORTED.
Please contact me offline via email if you need more clarity
You got it all wrong thinking I want to use the SnapDiff API to implement my own backup tool. As such, I will give the context on why I was researching for this:
Let's say you have a 2 TB qtree that is full of small files arranged in a hierarchy of many directories and that it would be useful to know at least on a daily basis if not instantaneous the size of each of the directories in the hierarchy. The only way to get that is to do a full metadata crawl of the filesystem. Due to the way WAFL works, you are doing a full crawl of the qtree, thus competing for resources in termos of IO and CPU with the other clients of the filer which, of course, impacts the overall performance. Besides this, even if you would have the filer for yourself only for this, it will still take a hefty amount of time to do this (hours). So, doing this on the spot is not very useful, hence you think of implementing something else (I'm sure I'm not the only customer thinking of this) - you scan the filesystem in the off-peak hours, dump this metadata to some database of choice, calculate the size of each folder off-line and then you make this information avaialable to your customers via a web interface for example. Depending on the performance impact and business requirements you might want to do this once per week or once per day, etc., but preferably as often as possible so that your customers have up2date data.
Now, imagine you do daily backups of this qtree using snapvault. If you use a filer strictly for backup, this means that this backup filer is used only during the off-peak hours for backups and it sits idle during the rest of the time. This means that you could off-load the metadata crawl from the primary filer to this secondary filer, which is good since you don't impact the performance of the primary filer anymore and you use the secondary backup filer which was sitting idle anyway. But it still takes hours. But, hang on ......... there is this API called SnapDiff, which would give you a list of the files that changed since the last backup, so if you do a daily backup, this means that theoretically you could crawl the secondary filer only for the files that changed and you update your off-line metadata database only for those, then you redo your directory size calculations.
Now imagine you have a primary filer with 10 or more of these 2 TB qtrees and you can easily see the time and resources you would spare.
The NetApp Catalog Engine looks to me like something that uses the same API and up to some point the same line of thought, but geared towards backups. Spyglass, about which I read in a USENIX paper, looks pretty close to it, but I see it never made it into an actual product.
Correct me if I'm wrong but it looks to me that this usage is perfectly within the limits of the SDK license. Even if you interpret the metadata that I am crawling and dumping to a database as data transfer, this is still within the license as I am dumping this data on a NetApp filer, of course.
In terms of performance, as long as you schedule the crawl of the secondary filer while it is idle (unless you need to recover a huge amount of data or your daily backups take a long time, the idle time can be measured in hours), there is no performance impact what so ever, so there would be no need for support from NetApp. And since you only read data, this should not affect the backups in any way. And you know that due to the way WAFL works, metadata crawls are expensive, otherwise NetApp wouldn't have tried to develop something similar, see above - Spyglass.
Let me know if anything of what I said above is incorrect or unfeasible.