Software Development Kit (SDK) and API Discussions

Re: Malformed XML exceptions ( how to handle)

aashray
12,025 Views

and which SDK version are you using ? So that I can test and find a solution using the same SDK.

25 REPLIES 25

POOJA_HP_2013
3,486 Views

Hi Coon,

Sorry for the misunderstanding.

Yes, we have observed this exception only with 7.3.4 as of now, as per the case comments. I can confirm this and let you know by tomorrow.

As I had mentioned earlier, since this is not reproducible in our local set-up, we are not sure whether it is observed for other versions also or not.

Kindly let me know if this answers your query.

Regards,

Pooja

coon
3,486 Views

Please clarify "not reproducible in our local set-up"?

You tried against another 7.3.4 instance of Data ONTAP and did not experience the problem?

Did it have enough volumes, snapshots, aggregates (whatever was queried) to exceed 1 MTU? In your packet trace if you open it in wireshark and set a filter for XML (this will limit the output to just the packets that are SDK responses) then sort by the length column, you'll see that the 1513 packets are incomplete (select XML in the middle box and the raw output is in the bottom box. With XML highlighted, the bottom box should show you the hex and ASCII of the zapi response being truncated).

Your reproduction attempts would need to make sure you have enough that the response is larger than 1 MTU/ or TCP MSS

POOJA_HP_2013
3,486 Views

Hi Coon,

We don't have 7.3.4 set-up available in our labs. However, we did try with other versions (7.3.1, 7.3.6, 8.0 etc) but were not able to reproduce it. Not sure if it could be dependent on the data ONTAP version.

Thanks for providing the steps to open the trace packets in wireshark. I may need to check on the configuration details of our lab set-ups, to figure out whether we have a big enough configuration to generate responses larger then 1MTU. I will let you know the details as soon as I have them.

Please do let me know if there is any other information which is required from our end.

Regards,

Pooja

coon
3,486 Views

I did get a request yesterday from someone I wanted to relay to you. And perhaps some more details on why it's so difficult for you and us to reproduce this.

Request: Can you run the apitest command again while gathering a packet trace like you did before (to catch an error) but this time simultaneously run a trace on the client running the apitest command. We'd want to be running packet tracing on both sides.

Explanation: We notice that the responses that have problems generally get caught in a bunch of TCP retransmissions. We'd like to have both sides of that conversation (the client issuing the zapi command and the data ONTAP packet trace) in order to understand better what is causing the TCP retransmissions. I suspect that the reason we're not able to duplicate this other places is that the repro environments we're all trying this in have clean networks (no retransmissions). Perhaps you could have your network team investigate the cause of those retransmissions and if these are the reason you're experiencing this, this might resolve the issue completely.

coon
3,486 Views

Going to put our current suggestions/questions on this

  1. Collect packet traces from both sides of a failed/truncated API call. Goal: Understand better why there are network retransmissions often in the ones that fail and if those have anything to do with why they fail.
  2. Clarify what you've attempted to reproduce. Exact same hardware? Same network topology? Same API calls that fail?
  3. Clarify what fails in the environment. You have provided here some apitest output that fails with truncated XML. Is that reproducable at will? Does that happen every time you make that API call to that controller? As I understand it you see errors in your production code in your environment, but we're not certain if these XML truncation errors are the same, right?
  4. We'd like to understand if the failures you have observed seem to have any consistency around the actual API calls made. Let me explain why, and I think you'll understand what we're asking. If this was a networking problem of some kind (and please know that when I use this term network it includes the NIC in the storage controller, the driver for that NIC, and all of the software in Data ONTAP that does network processing) or if it's the code within Data ONTAP that creates the XML response itself. If it's a networking code problem, it'll happen for all different APIs and not discriminate on which one you call. If it's the part of code that generates the response, it'll always be the same API calls. I know some API calls are going to generate larger and smaller response, but we'd like to know if you are able to discern any patterns from the failure you've seen. This is also related to #3. If aggr-list-space-info succeeds half of the time and fails half of the time, or if it succeeds 95 times and fails 5 times for the exact same length response, then it is more likely a dependent resource causing the problem (like memory within Data ONTAP or the networking code).

Hopefully this makes clear what we'd need to explain this further.

Thanks!

Public