Problems while collecting data for Switches

sanadmin_stadtdo · ‎2013-12-11

Hi all,

we have OnCommand Insight in use for one month and not much experience with the tool. The installation/configuration is done by Netapp.

No we have constant problems with collecting data from switches. So it is often, that no data is collected, but also indicates no error. After stopping and starting the DataSource it works again - for a few hours, days.

We are using 2 identical SAN fabrics with CISCO switches (pure FC-switches like DS-C9509, C9124, C9134 and FCoE-Switches like Nexus-C5548). Switch port performance is only displayed/collected in one Fabric for 2 of 9 switches - in the other fabric (with identical switches) no data is shown, whatever the reason?

Has anyone similar problems or solutions?

I would be grateful for any information.

Thanks a lot.

Michael

ostiguy · ‎2013-12-11

Hey Michael,

This is a bit weird. For Cisco devices, OCI uses SNMP for both inventory and performance collection. So, usually both inventory and perf work, or both do not work. Since inventory has apparently worked for both fabrics at times, I am a bit surprised you are not seeing at least some performance data. Nonetheless, this should not be happening - long, consistent outages in collection make me wonder if the OCI server is having a resources problem.

Do you know if your NetApp contacts configured OCI to send OCI ASUP (autosupport)? OCI ASUP can send what we call extended logs - which is a collection of logs and .zip files from OCI's acqusition (data collection) attempts. If you are sending OCI ASUP, could you send me an email at ostiguy at netapp dot com with what your OCI sit name is - in the title bar of your OCI client, you will see it in parentheses , and you will also see it in the upper right hand corner of the OCI HTTP management portal ( Site: ______ ).

If we are receiving extended logs, I could take a look.

Matt

sanadmin_stadtdo · ‎2013-12-12

Hallo Matt,

thanks for your answer.

At the moment the Server does not send ASUPs, but this is an internal problem with our mail-admins.

Which logs would be helpful? I found some zip-logs at the Insight-Server under "SANscreen => log" witch start with cognos_ or dwh_.

Michael

ostiguy · ‎2013-12-12

Hey Michael,

This is a tricky one, as it might not only be a data collection problem - question - are the Cisco datasources the only datasources that exhibit this?

Lets start with:

../sanscreen/acq/log

This is where acquisition does all of its logging

There are some acq*.log* files - these are the master acquisition log files. Copy them to a folder

Then

foundation_xyz....zip

Where xyz is the OCI datasource name - copy those to the same folder.

Then

\..SANscreen\jboss\server\onaro\log

From this folder :

jboss.log

server.log

performance.log

Copy those 3 files to the same folder.

Create a zip of the folder - if it is under 10MB, you can email me the zip at ostiguy at netapp dot com .

This will allow me to triage your system - the master acquisition logs will let me see what the "SANscreen Acq" windows service is up to. The 3 server logs will let me know if jboss (.log) is healthy, if the server (.log) is processing acquisition reports correctly, and whether performance (.log) reports are being processed correctly

sanadmin_stadtdo · ‎2013-12-12

Hallo Matt,

we have got 2 CISCO-DataSources, 2 NetApp-Data Sources and 1 VMware-DataSource and only the CISCO-DS makes trouble. The others provide perfect data, even reasonable performance data.

You wrote "if it is under 10MB" - but the existing 5 acg.logs and 724 foundation-logs are unpacked over 1GB ! If all logs needed or just the latest - from today? I#m zipping an it would be greater and greater > 700MB!

MIchael

ostiguy · ‎2013-12-12

Yikes.

Just send me a couple of the foundation_.zip for the Cisco datasources

Thanks

ostiguy · ‎2013-12-16

Hey all,

Just want to close this thread out :

Michael had 2 issues:

#1 A pair of switches that were going to be decommissioned - these were highly unreliable for communication via SNMP for some reason, despite them being on the same ethernet subnet as the rest of his switches (which would tend to rule out WAN latency as a root cause). They were removed from his environment, and his datasources were highly reliable from that point forward - prior to this, his datasources would fail with "partial success N-1 of N", because the datasources knew that each fabric had N switches.

#2. A lack of vfc port statistics on his FCoE interfaces on his Cisco Nexus switches. A data source patch resolved it - this patch is NOT going to be a part of OCI 6.4.2.0.1 as it was resolved after the freeze date for 6.4.2.0.1 . This patch will be part of a future data source service pack for OCI 6.4.[1-2]