volume direct access and indirect access

netappmagic · ‎2015-11-26

SVM1 has LIF1@node1 and LIF2@node2. If the client is coming to LIF1 and the volume is on NODE1 as well, then it will have direct access, however, if the client is coming to LIF2, to access the volume, then the traffic will be forward to NODE2. This is indrect access. indicrect access will be slower.

I can think of using LIF1'IP not DNS name if I know the volume is located on node1, but, it sounds lack of flexibility, I would have to remmeber which volume is on what node, and then when the volume is moved, IP needs to be changed.

My question is, what is a better way to aovid indirect access?

Thank you for sharing!

ontap 8.3

aborzenkov · ‎2015-11-26

What protocol do you use?

netappmagic · ‎2015-11-26

We use NFS most, but some CIFS as well. Thanks for the prompt message.

aborzenkov · ‎2015-11-26

For NFS 4.x it is possible to use referrals or pNFS. Both require client support; referrals redirect open request to the optimal node, while pNFS can dynamically change optimal access path.

CIFS supports DFS referrals as well; similar to NFS it happens once on initial open.

You can find more information in NFS or CIFS Management Guides.

But note that while indirect access may be slower, it does not automatically mean your clients actually notice it for your specific workload. So you should first try to estimate impact of indirect access before trying to optimize for it.

netappmagic · ‎2015-11-26

What if in NFSv3 situation? Since we don't have a plan to upgrade to v4 anytime soon.

> your clients actually notice it for your specific workload

Do you mean the slowness caused by indirect access is little, and not as critidal as optimizing other components?

Thank you!

aborzenkov · ‎2015-11-26

I am not aware of solution for NFSv3, sorry (except pure administrative methods of fixed relationships between node/SVM/LIF).

netappmagic · ‎2015-11-26

Then for instance, the VMWARE ESXi datastore wil have to use LIF1's IP @node1 for the volume?

bobshouseofcards · ‎2015-11-26

On the indirect access slowness concern - yes that's exactly it. Optimize everything else first, as using the cluster backplane is already highly optimized in cDoT.

I had the same concerns and dove into this with a bunch of my NetApp resources. On one side I had folk saying not to worry about the backplane. On the other hand, NetApp put a bunch of redirection features into cDoT that would avoid the backplane. OnCommand Performance Manager includes the backplane as one factor in performance event analysis. Clearly there must be situations where the backplane could be a performance factor, otherwise why bother so much with the redirection and data analysis? So I wanted to know more.

The upshot is this: for all protocols the node that receives the request does all the appropriate logical processing to meet the request. Logical in this case means authentication lookups, mapping of the actual request to the right blocks to be read within a volume, etc. as needed for the request. For read requests, the node meets the request from any data blocks already in its memory cache. If blocks are needed, the node figures out what blocks are needed. If those blocks are not local to the node, a read request for those blocks is sent along the cluster backplane to the node which does own the blocks. The owning node does the read as appropriate (might also server from cache on that node) and returns the data back to the request processing node.

Writes are similar in nature with only the changed block information being passed along the backplane. The inter-node communication is highly optimized for just this type of communication. Of course, the cluster backplane is also flush with bandwidth - 20Gbps full duplex between any two nodes minimum, with 40Gbps aggregate recommended for bigger hardware.

So the entire client data request is not shuffled off to a specific node - only the backend disk block data along with cluster control information as needed. Where protocols are doing purely informational stuff, like establishing sessions and other housekeeping very little has to pass along the backplane. The first node contacted just handles the request and any future requests for that session (assuming some other redirection method is not in play). Client communications are not proxied to another node via the backplane.

Without a good LIF design - as in multiple LIFs spread across the cluster for file level accesses or zoning that limits LIF visibility to multi-pathing for block protocols - it is possible to create a situation where one node is doing most of the protocol work. Chances are you create a CPU utilization issue on that node before you hit any appreciable cluster backplane issue, though I have seen warnings on performance where the backplane was the cause of the slowdown. Remember also that volume moves that cross node boundaries use the cluster backplane. Similarly CIFS ODX style copies might use the backplane also.

cDoT has performance counters to track all the elements of a data request, from the client facing network stack down to the disk performance. I don't know which counters track the backplane off hand, but they exist. If you don't use OPM you can certainly research the counters and collect such statistics manually.

FYI - I will happily take correction on the cluster backplane communication details where warranted. Or if there is something that can be clarified, please do. I've found it somewhat difficult to get really good technical information at this deep a level within cDoT. Everything above is my understanding of what I've been able to learn.

With regard to NFS3 - as indicated in a previous reply doing it manually to direct traffic to the right SVM/volume is the only real option. Create one data LIF per data store and then set your ESX hosts to access that datastore through a particular IP address. Then when you move the datastore volume to a different node, move the LIF along with it. I recommend that you create a specific subnet (physically separate or VLAN) that is at least a /23 so you can have ~512 IP's/datastores. You might not scale that high in practice. I'm big on having plenty of breathing room.

I hope this helps you.

Bob Greenwald

Lead Storage Engineer

Huron Legal | Huron Consulting Group

NDCA, NCIE - SAN Clustered, Data Protection

Kudos and accepted solutions are always appreciated.

netappmagic · ‎2015-11-27

Thanks Bob for such indepth analysis. I have two follow-ups

>>NetApp put a bunch of redirection features into cDoT that would avoid the backplane.

"redirection" seems not going through backplane, differnet than I thought before. Can you please give me a few examples on cases that NetApp redirect traffics and meanwhile avoid the backplane.

>>I recommend that you create a specific subnet (physically separate or VLAN) that is at least a /23 so you can have

>>~512 IP's/datastores. You might not scale that high in practice.

Assuming I have 4 nodes cluster, and 200 datastores. I should create a specific subnet, ex, 10.192.20.x, then each datastore will use one of IP's. What about on the cluster side? Should I create 4 LIF's, one for each node, or create 200 LIF's, 50 LIF's per node, to match each one of datastore? all 400 IP's are in the same subnet/VLAN.

Thanks again, and looking forward your messages again.

bobshouseofcards · ‎2015-11-29

"Redirection" from a client's perspective, involving the initial request, takes a couple of forms depending on protocol. All "redirections" assume that an SVM has multiple data LIFs spread across the nodes that might contain a volume.

CIFS redirections, called "Auto Location" in cDoT, happen at the "base" of a share. Consider a client accessing \\SVM01\Share\Folder\File. The "base" share in this UNC is \\SVM01\Share. Assuming that SVM01 has multiple IPs and they are defined in DNS, a client's initial access to the share could come in on any node. If the volume is on another node, then cDoT can send a DFS style redirection request using the IP address of the LIF defined on the node where the volume exists (say that three times fast). The client can then send future requests directly to the node where the volume lives.

The limitation of Auto Location is that it works only using the base share where data is accessed. If you use junction points and link together a bunch of volumes in a tree like structure, you could logically navigate to a different volume/node combination as you traverse the tree. Auto Location, if triggered, only happens once using the base share no matter how far down the tree the initial access might be. As a result it is possible to defeat this redirection through a poor volume junction point/share structure. Consider a "base share" that contains nothing but junction points to several hundred other volumes. No matter where those other volumes live in the cluster, Auto Location would redirect all client access to the one node that owns the volume where the base share is defined. This design actually aggregates all client access to a single node, making one node do most of the CIFS work. I inherited this exact design. Even with load-sharing mirrors to break that issue, the design guarantees that on average 75% of all client accesses go to the wrong node first in my primary file sharing cluster. This condition is my motivation for digging into all of this backplane/redirection stuff.

NFS redirections are supported under the Parallel NFS (pNFS) capability when using NFS 4.1. Obviously client support is needed. pNFS has a path discovery protocol to allow direct access to the node which controls underlying data storage while communicating to a central node for meta-data and management. I can't speak more than conceptually about it as I have yet to implement in any real world scenarios.

Block protocols - both iSCSI and FCP - are similar to pNFS in that for both there are path discovery mechanisms - ALUA, multi-path software, etc. - to discover all the paths that might access a LUN. The path discovery mechanism then chooses the best one available. The "Optimal" path is one directly to the node where the LUN (volume) resides. Multi-path mechanisms are really just another form of access path redirection.

All of these redirections occur between the client and storage with the intent to have the client send data protocol requests directly to the node that "owns" the volume at any given time.

With respect to your ESX datastore followup - the basic idea is one LIF per volume containing datastores. These are on top of any LIFS for general cluster management, node management, and SVM management. So in a four node cluster, you have a cluster management LIF. You have at least one LIF assigned to each "node" SVM for management. You've created a data SVM to hold the user data. That SVM will need a management LIF (best practice). And now you're going to add 200 datastores. Asumption is one datastore per volume, so also 200 volumes here.

For each datastore volume, create a LIF. The LIF's "home" node should be the node which owns the aggregate where the volume is created. The home port can be any appropriate port. If using VLANs, I suggest a separate VLAN for this "datastore" LIFs as compared to management and more general purpose LIFs. Makes housekeeping easier.

This design may mean that you spread out the datastore LIFs 50 per node, but only if you spread out the datastore volumes 50 per node. If you put 110 volumes on one aggregate, then the node that owns that aggregate will see 110 LIFs that currently reside on that node as well. Remember that LIFs are fluid - you can migrate them to any available node/port that supports the LIFs network segment transparently to clients. So as you move datastore volumes around (assuming you do), then move the LIF at the same time as the volume to the proper node. I suggest when making a volume move as a permanent choice, both move the LIF and update its permanent home node/port. Temporary conditions, such as either planned or unplanned node failover, do not require a permanent change to the LIF as everything will go back to normal when the nodes giveback.

A key concept is that LIFs are not network ports. Ports are static on a node. LIFs can live on top of any port that has the right broadcast domain connectivity. Assume each node in the cluster has an interface group a0a. VLAN 100 is available on each interface group at the switch. So you create a VLAN port a0a-100 on each node. The "port" for all the datastore LIFs might always be "a0a-100". The "node" for each datastore LIF will match the node where the volume lives.

Technical Reports TR-4067 and TR-4068 have a ton of best practice and background information on NFS scenarios and NFS with ESX.

I hope this helps you.

Bob Greenwald

Lead Storage Engineer

Huron Legal | Huron Consulting Group

NCDA, NCIE - SAN Clustered, Data Protection

Kudos and accepted solutions are always appreciated.

heightsnj · ‎2015-12-01

heightsnj · ‎2015-12-01

Hello Bob or anybody else

As we have discussed, two more questions come to my mind if you still check the thread.

1. Is there any way to tell how much direct or indirect accesses have been done, or what are they?

2. When I check OCPM on inter connect LIFs, I could not pull out any data on Latency and IOPS, shown me as "no data to display", although I can pull out graphs on MBps. Why? "no data to display" on Latency could be possible, it could be no latency since the speed is very fast. But, no data on IOPS. would that be possible?

3. One datastore per LIF, then how we handle redundancy?

Hope you could shed some lights here again.

bobshouseofcards · ‎2015-12-02

To your questions:

1. Yes, there is a way to measure the level of indirect access. Clearly, because OPM can tell you if there are issues on the cluster LIFs the data must be present. However, to my knowledge, that level of data is not exposed in the OPM views available to the user by direct access to the database. Hence, your options are to dig into the underlying internal database structure of OPM, because the data must be there, or to query the internal cDot counters directly with ZAPI performance collection apis. After all, that's just what OPM does. I cannot tell you which counters apply simply because I haven't dug through the counter list. You can get a list of available counters available through ZAPI performance calls - I believe the total is well over 15000 these days.

2. I'm assuming you refer to the "cluster" LIFs in this part of your question (as opposed to "intercluster" LIFs). Again this gets back to the counters that cDot collects, and the nature of the measurement. An IOP only has relevance to the requester. For instance, an IOP to disk is how many unique data accesses there are. An IOP to an end user is how many requests the client system makes. They are not necessarily the same. Remember the cluster LIFs, when used for indirect access, are used internal to cDot. How does an IOP from one node to another node relate to a specific end user IOP? If a SnapMirror to a volume between nodes, a volume move between nodes, and end user data access to that same volume are all going at the same time, there could be a lot of IOPs on the backplane. Logically, cDot is optimized to group together such moves of data/blocks where possible. Hence tying backplane IOPs to user workload would be a difficult exercise. Same thing goes with latency. The measurement for "busy" on the backplane is purely throughput - how full is the backplane.

If the processing node is waiting for data from another node, thereby slowing down the user request, it's either going to be because the "remote" node is slow getting the data, or because the cluster backplane is full (throughput) so it's taking a while to get the data onto the backplane. The actual performance per op is already massively optimized by design - hence the operation count and latency per operation really are meaningless numbers in performance analysis. It might be available way down in the counters, but I'm not surprised it isn't bubbled up to the standard user tools.

Standard caveat applies to this discussion - I derive my understanding of the backplane and cluster LIF from public resources, NetApp technical presentations, and some specific digging and discussion with my NetApp contacts. The conclusions and descriptions above are my own, and I will gladly take correction from anyone with better information.

3. Redundancy comes down to a point I made in an earlier response. LIFs are not ports. Network ports provide redundancy. File protocol LIFs do not and don't need to. Mulitple LIFs on file protocols are used to increase total potential bandwidth, not redundancy. Remember that a network interface that supports NFS or CIFS can migrate transparently between network ports. So if I have one LIF and the network port which hosts the LIF goes down, the LIF will migrate to another port within the failover group defined for that port. Redundancy achieved.

Key for file protocols is that a client connection is made to a specific end-point. That end-point has to remain stable - it is singular (even in pNFS the control connection is singular). A CIFS or NFS client does not connect to multiple paths to the same endpoint, it only uses the one named connection.

Block protocol LIFs (FC, iSCSI) do not migrate. Redundancy for block relies on multi-path software at the client and does require two separate LIFs distributed between cluster nodes to achieve full redundancy. Since there already exist semantics at the client end to understand and optimally use multiple connections, even multiple to the same endpoint, LIF migration for block protocols servers no purpose and isn't implemented. In the FC case it would be almost impossible to do it well in all use cases simply because of FC switch zoning mechanisms. Simpler to just not bother with LIF migration. In the block case, multiple links provide both redundancy and bandwidth.

In fact, I ran into the file protocol LIF redundancy situation just yesterday. Due to something wonky with a port-channel to one of my nodes, the entire port-channel to node 1 went down. In this particular setup node 1 owns most of the ESX workload to NFS based datastores, 1 LIF per datastore. When the primary networking went down on node 1, all the file protocol LIFs failed over to the node-2 network ports. Client ESX servers never just went on. End user CIFS connections just went on. In fact, during the corrective action the port-channel on node 1 was brought down and back up several times. The LIFs bounced back and forth between nodes a couple of times during the fix. All good.

I hope this helps you.

Bob Greenwald

Lead Storage Engineer

Huron Legal | Huron Consulting Group

NCDA, NCIE - SAN Clustered, Data Protection

Kudos and accepted solutions are always appreciated.

netappmagic · ‎2015-12-04

Hi Bob,

Thank you so much for your excellent analysis. We appreciate your sharing and contribution very much!