I have a problem with one of our NetApp filers routing traffic out the wrong interface in some specific instances, and having spent a few days banging my head against the first layer of NetApp support, I thought I'd check here to see if anybody has a good idea or has run into this issue before (I did find another thread about iSCSI traffic going out the wrong interface that sounded similar to my problem, but no solution).
Our setup is a pair of 3240s, netapp-a and netapp-b, which are configured identically (theoretically, at least). They each have e0M interfaces configured, and a multimode vif with a bunch of vlans trunked down. NFS traffic is not routed at all here, so all hosts that mount must have an interface on one of the vlans trunked down.
95% of the time this works fine, however we have a number of Solaris 10 hosts that cannot mount filesystems from netapp-b, but can from netapp-a. When I watch the traffic via snoop and pktt, the mount dialog starts and at some point netapp-b starts responding on e0M:
pktt on vif-100
18:12:55.396707 vlan 100, p 0, IP nfsclient.ourdomain.org.945 > netapp-b-100.ourdomain.org.nfs: Flags [S], seq 4057553784, win 32781, options [mss 1460,nop,nop,TS val 379612810 ecr 0,nop,wscale 6,nop,nop,sackOK], length 0
18:12:55.396755 vlan 100, p 4, IP netapp-b-100.ourdomain.org.nfs > nfsclient.ourdomain.org.945: Flags [S.], seq 2382367927, ack 4057553785, win 65535, options [mss 1460,nop,nop,sackOK,nop,wscale 1,nop,nop,TS val 204358559 ecr 379612810], length 0
18:12:55.396949 vlan 100, p 0, IP nfsclient.ourdomain.org.945 > netapp-b-100.ourdomain.org.nfs: Flags [.], ack 1, win 32783, options [nop,nop,TS val 379612810 ecr 204358559], length 0
18:12:55.397250 vlan 100, p 0, IP nfsclient.ourdomain.org.1322256560 > netapp-b-100.ourdomain.org.nfs: 112 null
18:12:55.397289 vlan 100, p 4, IP netapp-b-100.ourdomain.org.nfs > nfsclient.ourdomain.org.1322256560: reply ok 32 null
18:12:55.397412 vlan 100, p 0, IP nfsclient.ourdomain.org.945 > netapp-b-100.ourdomain.org.nfs: Flags [.], ack 37, win 32783, options [nop,nop,TS val 379612810 ecr 204358559], length 0
18:12:55.397509 vlan 100, p 0, IP nfsclient.ourdomain.org.945 > netapp-b-100.ourdomain.org.nfs: Flags [F.], seq 117, ack 37, win 32783, options [nop,nop,TS val 379612810 ecr 204358559], length 0
18:12:55.397520 vlan 100, p 4, IP netapp-b-100.ourdomain.org.nfs > nfsclient.ourdomain.org.945: Flags [.], ack 118, win 33580, options [nop,nop,TS val 204358559 ecr 379612810], length 0
18:12:55.397526 vlan 100, p 4, IP netapp-b-100.ourdomain.org.nfs > nfsclient.ourdomain.org.945: Flags [F.], seq 37, ack 118, win 33580, options [nop,nop,TS val 204358559 ecr 379612810], length 0
18:12:55.397663 vlan 100, p 0, IP nfsclient.ourdomain.org.945 > netapp-b-100.ourdomain.org.nfs: Flags [.], ack 38, win 32783, options [nop,nop,TS val 379612810 ecr 204358559], length 0
18:12:55.399980 vlan 100, p 0, IP nfsclient.ourdomain.org.50151 > netapp-b-100.ourdomain.org.sunrpc: UDP, length 56
18:13:10.402218 vlan 100, p 0, IP nfsclient.ourdomain.org.50151 > netapp-b-100.ourdomain.org.sunrpc: UDP, length 56
You can see at the end packets coming in from the client, and nothing going out. However, if you run a pktt on e0M:
18:12:55.396849 IP netapp-b-100.ourdomain.org.sunrpc > nfsclient.ourdomain.org.50151: UDP, length 28
18:13:10.399088 IP netapp-b-100.ourdomain.org.sunrpc > nfsclient.ourdomain.org.50151: UDP, length 28
The default route on the filer is e0M, but also there is a network routing table entry for that subdomain, as well as a routing table entry for the client itself.
When I first ran into this issue I thought it was a client-OS issue, since it only seems to be affecting our Solaris 10 hosts- all of our linux hosts seem to be fine. But given that this is only happening on one of our filers, and the traffic is obviously going out the wrong interface on the broken one, I'm thinking this is a problem on the NetApp side, either a bug or a configuration error.
Is options ip.fastpath enabled? When enabled it will send traffic out the interface it came in (not e0M).. however for Solaris mount requests which use udp by default (even if a tcp mount) then it could get routed if no route...sounds like you have a route set though so the default route of e0M shouldn't be used in that case.
You have a route that will be used for a return request that isn't default? Fastpath is enabled? Is e0M on the same subnet as the nfs interface?
Yes, fastpath is enabled on both heads (one of which is working, one of which has this problem). And yes, there are entries in the routing table for both the network and the host, which are being ignored. And no, the network e0M is on is different from the network in question.
Seems like a bug to me, and very similar to what is reported here:
However, given the network design here (not my design) I cannot move the default route to another interface (none of the other interfaces are on vlans that route, yet smtp/snmp/http traffic goes out the mgmt interface and is routed). And really, that just seems like avoiding the real issue, unless I am misunderstanding how routing works- I'd personally rather figure out the config/network/bug issue that is causing the traffic to be mis-routed.
That was a good thread with the same type of issue but iSCSI and changing the default route seemed like the fix...but was odd since it was a different subnet with a route and he also blocked iscsi over e0M. You could flush and recreate the routes (quick outage) to see if that helps or takeover/giveback... The odd part is that it sounds like both nodes are set up the same but it only occurs on one node...does sound like a bug if all the same.
Yeah, it has seemed like a bug to me, but NetApp keeps claiming that the issue is that I have my default route set to e0M (my counter-claim is that the default route shouldn't come into play at all). Unfortunately, I can't really shut down NFS on that interface, although I'm tempted to come in this weekend during our outage window just to try that; if I block NFS on that interface that the problem still persists (i.e. NFS packets still try to go out e0M), then at least I will know I'm in the same boat as the other thread, although no matter what the test reveals, it doesn't help me come up with a permanent solution.
I don't like that with the other thread iscsi and in this thread nfs... That blocking protocol on e0M doesnt stop the protocol. And when a route exists it shouldn't go to default. Almost like the udp response ignores routing. If you had multiple data subsets each subnet should respond not e0M.
I'd escalate. Let us know what the answer is...or bug when they create the Burt for it.
The only workaround I can think is to create vfilers with ippaces and leave e0m by itself on vfiler0. Then your vif can be default in the vFiler. Brute force but fix it.
Is the Solaris host on the same sub-net as the the filer? I suspect not, as it would not attempt to route the mountd UDP reply.
Are you using VLAN tagging on the filer and the client interfaces? has the network accepted they are all on the same vlan?
I personally would try adding a static route, if filer and host are not on the same sub-net.
As a point, if you specify option proto=tcp in the mount request, Solaris should do the mountd portion of the connection over tcp also, but that is not to say that there would not be asymetry in the route.
To answer your questions:
1) Yes, the solaris hosts are on the same vlan as the filer. I keep trying to explain to NetApp support that these packets shouldn't be routing at all, but they don't seem to believe me.
2) Yes, I am using vlan tagging on the filer.
3) The route table shows the correct entries for both the client (Solaris box) and the subnet it is on (the filer also has an interface on that net).
4) I have tried using proto=tcp as an arg to mount, with no change in behavior by the filer.
I'll post specifics (route table, ifconfig, etc) when I get my VPN back up and functional.
Sorry for the delay in response; I had to put this issue aside for a bit, but picked it back up again about a week ago (broke my hand, and I tell you, being a sys admin with one hand to type with is a PITA). Anyway, I had temporarily mitigated this issue back in December by setting the interfaces.blocked.nfs option to "e0M'. As soon as I did that, the asymmetrical routing went away, and all of our clients, regardless of OS, were able to mount all volumes just fine.
I opened up a new ticket with NetApp last week (they auto-closed it back at the end of Dec, when I had my accident). In the meantime, they had auto-removed all of the packet traces and other data I had uploaded for the original ticket, so I planned on clearing the interfaces.blocked.nfs option, and repeating my pktt/tcpdumps. I did this on Saturday, and for the life of me the problem just does not happen any more. I've been going since Saturday night with that option off, I'm back to mounting vol0 on my management host over e0M, and life seems fine. What changed between Dec and now? Nothing- I haven't touched a thing, outside of creating a new volume or two. No networking changes, no upgrades, no reboots/failovers, nada.
So at this point I guess I can't really say much. I've closed the ticket with NetApp, and I'll keep an eye on whether or not this crops up again. If I do, I'll try to remember to follow up to this post.
I suspect it may be the default route.. netstat -rn will show the default route likely set to the interface you want and e0M was down so it couldn’t attach the route… a reboot may change that. ONTAP binds the route to an interface. I just had a head swap last weekend where we had an odd route tied to the wrong interface… the right interface on the same network as the route was just down at the time of the route add… when the interface was brought up the route didn’t work to that interface until we deleted and re-added the route.
So, are you saying that netstat -rn would show the route going out interface X, but instead the traffic would go out interface Y? Because that's what I was seeing- the routing table looked perfect, and in fact the first 6 packets or so coming out of the filer would go out the correct interface, but all of the sudden the packets would switch over to going out e0M (with no change to the routing table).
I upgraded the OS on the filers back in mid-December, and verified the problem existed after the upgrade. The filers have not been rebooted since then (so any changes to the interfaces.blocked.nfs option have been since the last reboot). That said, I should plan on a reboot this weekend just to see if the problem comes back- good idea.
Message was edited for spelling by: Chad Lake
OK, I finally got a chance to reboot the filer, and the problem did not re-emerge. I tell you, I don't think I've ever been so disappointed to have everything working like it should....
agreed...consistency even when broken is always better to know. netstat -rn is key to looking at things... I have had a few customers in the last 2 weeks have network issues and found their secondary vifs (ifgrps) did not come up and routing tied to their other vifs and the only way to fix is to up the correct vif, then delete and re-add the same route (or route -f or reboot but that is disruptive) to get it working. A different issue but similar in how routes bind to an interface.
When I was having problems the routing table always appeared correct. And even the first part of the mount request would go over the correct interface (~10-12 packets or so) until abruptly switching to going out over e0M (with no change in the routing table).
If it crops up again I'll try to remember to come back to this thread with an update. Thanks everybody for their help with this!