Hi Timothy,
You got it all wrong thinking I want to use the SnapDiff API to implement my own backup tool. As such, I will give the context on why I was researching for this:
Let's say you have a 2 TB qtree that is full of small files arranged in a hierarchy of many directories and that it would be useful to know at least on a daily basis if not instantaneous the size of each of the directories in the hierarchy. The only way to get that is to do a full metadata crawl of the filesystem. Due to the way WAFL works, you are doing a full crawl of the qtree, thus competing for resources in termos of IO and CPU with the other clients of the filer which, of course, impacts the overall performance. Besides this, even if you would have the filer for yourself only for this, it will still take a hefty amount of time to do this (hours). So, doing this on the spot is not very useful, hence you think of implementing something else (I'm sure I'm not the only customer thinking of this) - you scan the filesystem in the off-peak hours, dump this metadata to some database of choice, calculate the size of each folder off-line and then you make this information avaialable to your customers via a web interface for example. Depending on the performance impact and business requirements you might want to do this once per week or once per day, etc., but preferably as often as possible so that your customers have up2date data.
Now, imagine you do daily backups of this qtree using snapvault. If you use a filer strictly for backup, this means that this backup filer is used only during the off-peak hours for backups and it sits idle during the rest of the time. This means that you could off-load the metadata crawl from the primary filer to this secondary filer, which is good since you don't impact the performance of the primary filer anymore and you use the secondary backup filer which was sitting idle anyway. But it still takes hours. But, hang on ......... there is this API called SnapDiff, which would give you a list of the files that changed since the last backup, so if you do a daily backup, this means that theoretically you could crawl the secondary filer only for the files that changed and you update your off-line metadata database only for those, then you redo your directory size calculations.
Now imagine you have a primary filer with 10 or more of these 2 TB qtrees and you can easily see the time and resources you would spare.
The NetApp Catalog Engine looks to me like something that uses the same API and up to some point the same line of thought, but geared towards backups. Spyglass, about which I read in a USENIX paper, looks pretty close to it, but I see it never made it into an actual product.
Correct me if I'm wrong but it looks to me that this usage is perfectly within the limits of the SDK license. Even if you interpret the metadata that I am crawling and dumping to a database as data transfer, this is still within the license as I am dumping this data on a NetApp filer, of course.
In terms of performance, as long as you schedule the crawl of the secondary filer while it is idle (unless you need to recover a huge amount of data or your daily backups take a long time, the idle time can be measured in hours), there is no performance impact what so ever, so there would be no need for support from NetApp. And since you only read data, this should not affect the backups in any way. And you know that due to the way WAFL works, metadata crawls are expensive, otherwise NetApp wouldn't have tried to develop something similar, see above - Spyglass.
Let me know if anything of what I said above is incorrect or unfeasible.
Best regards,
Sebastian