ONTAP Discussions

Self organisation of many files at one location

juerg_maier
6,051 Views

Hi

Assume I get 5'000 pictures daily.

I thought to name every pic with a MS NewID() unique identifyer (36 chars) when they get inserted and attached to my app elements.

I would like to store all pics in a common location (lets say a dedicated NetApp system).

NetApp would need to deal over the years with lots of files at that location and retrieval of up to 10 of the pics should be fast all the time, also for pics added in the past.

Will I need some kind of logic added (maintain a service using an index, store files into folder structures) or will NetApp handle this fairly well on its own?

Thanks

Juerg

5 REPLIES 5

ccolht
6,051 Views

I'm kind of new to NetApp but this doesn't seem like the kind of solution a filer can provide. The filer either provides raw disk (lun) or a file system. Nothing in NFS or CIFS natively does this without software support. Where I've seen systems that work like this they use either a database of metadata to store locations of groups of files or some kind of hashing function to keep like files together. If a group of files will only be part of one group, they can be crammed together into one 'super' file with some kind of delimiting so reads and writes are for one file. The app would then have to manipulate the super file for access to individual elements. Not very flexible but it can save lots of opens. Hashing would seem like the best bet for quickly finding files. This would require a naming convention that keys into the hash. Using a database would alow the most flexibility since you can build queries on the metadata stored for each file. All this happens above the storage so it wouldn't matter what kind of disk you had as long as it was optimal for the type of db/file system you need.

If the pictures are small (or even if they are not) they can be stored as blobs in a database. 5000 new records a day is not that high a rate. Database engines have all kinds of tricks to speed operations with caching and indexes.

All of this depends on your definition of 'fast'

BrendonHiggins
6,051 Views

Hi welcome to the community

I used to work for a mail order catalogue and the graphics design department used to break all sorts of storage systems trying to do this.  In the end you will have to stop and look at the workflow which is dumping the images in the file storage.  How are you going to find anything again and once you have 100,000's of files in a folder (20 days) and what is performance going to be like when you try and open the folder?  {poor}

Option A

Create a work flow procedure where the root folder has the year, which contains months folders, which contain day folders which each have the 5,000 files.  We have a doc management system NetApp CIFS share with 31,000,000 files (Word, Excel, etc) in a single volume and it works great but the FAS3070 is now starting to feel the strain.  Two years ago the number of files killed the old 3rd party backup system which could only handle more than 20,000,000 items in a single folder  (database issue on app).  The backup was also slow because of the large number of files small files, NetApp's snapvault backup D2D solution fixed this however.  We snap this volume every 15 minutes and snapmirror it to a 2nd site.  So you could get many years from this solution, if you stick with 5,000 per day.  If the number gets large F5 have systems which can help increase what is possible with a filer, however $$$$.

Option B

Purchase a image management database and it will create the subfolder, link the different volumes together and managed meta data so you can search the images

Option B is the way that the catalogue people went in the end but a photo shoot whould generate 1,000,000s of images which need to be search by different people and linked into pages of catalgues.

If you go for option A, create some sort of script which creates the subfolders and copies the files or train the people who are uploading the files.  However if you can not find the images again, is there any point in keeping them?

Hope it helps

Bren

juerg_maier
6,051 Views

Hi Bren

Thanks for your input.

To make my point a bit clearer: I do have a database where I keep my objects. Users can add pictures to objects on their own, the pics get standardised in resolution and size by the app and each picture gets this unique id that I also wanted to use as the file name of the pic. The assignment of pic to object is kept in a db table so I do not lose this information. Pics will only be read after querying the DB so the specific file names will be known.

So the question was - do I need to maintain in addition a folder structure (could e.g. be the first 8 chars of my uniqueidentifier) to limit the number of files per folder as too many might slow down file retrieval.

If you say you were able to keep millions of pics in the same folder I would rather chose to leave the folder level(s) out.

Thanks

Juerg

BrendonHiggins
6,051 Views

It was a pig to find but here you go

https://kb.netapp.com/support/index?page=content&actp=LIST&id=3012261

What is the maximum number of subdirectories that a single directory may contain?

Subdirectories may be limited by the number of available inodes and maxdirsize setting.

Limit of 99,998 Directories per sub-directory.

You can face limits on creating a subdirectory within a directory   because subdirectories are also files and hence are limited by the   following:

  • Maximum size of the parent directory (maxdirsize). An   error message is generated when this limit is hit, and an error is   returned to the client which generally interprets the error as a full   volume, although this does not necessarily mean that the volume is out   of space.
  • Number of files that can exist in a volume (maxfiles)
  • Maximum size of the volume
  • 100K/64K limitation specific to subdirectories

Enter df -i or maxfiles to view the available number of inodes.

Before you go forward with your design, talk your NetApp SE and get them to ask the NetApp perfromance people about the numbers you should be using as a limit.  Just because you can, does mean you should, etc...

Hope it helps

Bren

madden
6,051 Views

NetApp has lots of experience with high file count environments. Any application or NetApp system where one or more of the following conditions is true can be considered to be a high file-count environment:
o) Single directories frequently contain more than 50,000 files each
o) Single volumes contain more than 10 million files each
o) Millions of active files have long file names and/or deep directory trees

An internal tech report (TR-3537 "High File-Count Environment Best  Practices") is available to customers under NDA.  Contact your NetApp  representative for it.  It discusses not only the WAFL impact of various directory structures, but also other topics including client access (client sw can also have issues with many files in a given directory), backup, etc.

Reading your usage case I would say it will qualify as a high file count environment, not huge, but enough that you can't dump everything into one directory forever without encountering problems.

If you can't get access to the TR a couple of best practice tips:

o) Keep the number of files per directory <10,000 – and much less (<1,000 is better) if possible.
o) Keep the subdirectory depth less than 5.

Oftentimes we see customers using a hashing algorithm on some metadata (filename, customer name, something that makes sense in your context) to construct a structure that is easy for the application, NetApp's WAFL, and the client software and protocol stack.

Public