Adding Kubernetes cluster to Cloud Insights

clinton33 · ‎2020-08-18

Hello,

I am trying to add a k8s cluster to cloud insights. I have installed the package on one of the nodes in the k8s cluster and made the modifications that seem to be called out in the documentation. The docs bounce around a lot and are a tad tough to follow, but I think I have covered all my bases. I did hit an issue with the configmap initially and saw that in the logs. Once I fixed that typo, the agents came up successfully (delete and recreate the logs). Looking at the logs for each pod shows they are up and telegraf is running with no apparent errors or issues inside the pods in the /var/log/telegraf directory. I don't see anything populated in CI so I am unsure where to look next for any indications of an issue. All RS/DS/Pods are in healthy order so I was expecting a successful discovery but I am curious if those that might have successfully added clusters thus far had hit any issues like this and have some tips or log locations to look?

Thanks!

hotz · ‎2020-08-20

Hi,

it takes a few minutes for the data to be exposed right after first pushed to CI.

After a little while though, you should be able to access the data for queries and dashboards.

clinton33 · ‎2020-08-20

Thanks hotz,

So interestingly enough, I let this sit for over a day after setting it up and just as I went to get the output to display here, I noticed that after TWO days it finally tried to post something to what looks like the CI instance and failed:

kubectl logs telegraf-ds-dbcvw -n monitoring
2020-08-18T23:04:37Z I! Starting Telegraf 1.14.0
2020-08-18T23:04:37Z I! Loaded inputs: diskio processes kubernetes cpu disk kernel mem swap system
2020-08-18T23:04:37Z I! Loaded aggregators:
2020-08-18T23:04:37Z I! Loaded processors:
2020-08-18T23:04:37Z I! Loaded outputs: http http http http http http http http http
2020-08-18T23:04:37Z I! Tags enabled: agent_node_ip=10.xxx.xxx.xxx agent_node_os=CentOS Linux agent_node_uuid=B3971A42-5913-F4DC-AD49-D5870F9FC27E host=develwk-2 kubernetes_cluster=devel
2020-08-18T23:04:37Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"develwk-2", Flush Interval:10s
2020-08-20T12:23:10Z E! [agent] Error writing to outputs.http: when writing to [https://mb3627.c02.cloudinsights.netapp.com/rest/v1/lake/ingest/influxdb] received status code: 502

I didn't expect it to take 2 days to get to this point - is this expected? I just caught this now as I went to respond to your reply so I think I have something else to troubleshoot or a specific error to ask if people have seen before?

That error seems like it is related to the Cloud Insights portal - I didn't see anything in the config maps to store any user/passwords but I do know the install snippets have API keys with them and there are a ton of embedded keys in the snippets the instructions ask you to copy over - I'm curious if those are the issue. I don't know how to rectify those at the moment - any thoughts from the new error above?

Thanks!

-Keith

clinton33 · ‎2020-08-20

Checking the other pods, it seems that another POD also took 2 days to send data, but seems to be the main pod at the time trying to send the data because it has multiple error entries:

2020-08-18T23:03:38Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"develwk-3", Flush Interval:10s
2020-08-20T03:15:25Z E! [agent] Error writing to outputs.http: Post https://mb3627.c02.cloudinsights.netapp.com/rest/v1/lake/ingest/influxdb: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2020-08-20T08:41:15Z E! [agent] Error writing to outputs.http: Post https://mb3627.c02.cloudinsights.netapp.com/rest/v1/lake/ingest/influxdb: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2020-08-20T13:46:00Z E! [agent] Error writing to outputs.http: when writing to [https://mb3627.c02.cloudinsights.netapp.com/rest/v1/lake/ingest/influxdb] received status code: 502

This seems like an error on where these config files are setup to send/write the data to the CI instance, but I am unsure of the fix for this at the moment and the errors don't match anything in the docs that I can find thus far.

hotz · ‎2020-08-20

No, it should definitely not take two days. But according to https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html 502 indicates a "Bad Gateway". Are you per chance using a proxy? If I try to navigate to the url with a browser, I'm getting 405, which is "Method Not Allowed". It looks to me like the telegraf agent can't communicate with https://mb3627.c02.cloudinsights.netapp.com properly.

hotz · ‎2020-08-20

Keith,

so from the host develwk-2 kubernetes_cluster, where the telegraf agent is running, if you do:

curl https://mb3627.c02.cloudinsights.netapp.com/rest/v1/systemInfo

do you get an answer?

-Gerhard

clinton33 · ‎2020-08-20

Hi Gerhard,

I agree with what you were thinking. In looking at testing the network, I can get out from the pod itself and the worker node. If I try to get to that location from the Pod and node, it seems to come back OK. My thought is something like one of the API keys or something in that realm might not be working for the POST?

Here are the outputs:

root@telegraf-ds-npsqs:/# curl https://mb3627.c02.cloudinsights.netapp.com/rest/v1/systemInfo
{
"ociVersion": "8.0.0",
"ociBuildNumber": "local",
"ociServerPatchVersion": "0",
"ociServicePackVersion": "0",
"isCertAuth": false,
"apiVersion": "1.7.0",
"samplesApiVersion": "1.7.0",
"osType": "LINUX",
"isDemoDB": false,
"restMaxTimeFilterDays": 31,
"performanceManualBackupDays": 7,
"performancePhoneHomeDays": 7,
"performanceRetentionDays": 90,
"maxRestDataPointsLimit": 1080,
"tenantId": "94c02d2d-a127-46ff-9022-7d41b15d1d22",
"isSaaS": true,
"adminPortalUrl": "https://gateway.c01.cloudinsights.netapp.com",
"mode": "prod",
"acceptedTermsOfService": false,
"serverTimeUTC": 1597935986863
}root@telegraf-ds-npsqs:/#
root@telegraf-ds-npsqs:/#
root@telegraf-ds-npsqs:/# exit
exit

[root@develwk-2 serviceaccount]# curl https://mb3627.c02.cloudinsights.netapp.com/rest/v1/systemInfo
{
"ociVersion": "8.0.0",
"ociBuildNumber": "local",
"ociServerPatchVersion": "0",
"ociServicePackVersion": "0",
"isCertAuth": false,
"apiVersion": "1.7.0",
"samplesApiVersion": "1.7.0",
"osType": "LINUX",
"isDemoDB": false,
"restMaxTimeFilterDays": 31,
"performanceManualBackupDays": 7,
"performancePhoneHomeDays": 7,
"performanceRetentionDays": 90,
"maxRestDataPointsLimit": 1080,
"tenantId": "94c02d2d-a127-46ff-9022-7d41b15d1d22",
"isSaaS": true,
"adminPortalUrl": "https://gateway.c01.cloudinsights.netapp.com",
"mode": "prod",
"acceptedTermsOfService": false,
"serverTimeUTC": 1597935996137
}

Thanks!

-Keith

hotz · ‎2020-08-20

Keith,

I'm starting to think that this is a server side issue and it would be best to involve someone from engineering.

I'll open a case for you, so we can check on the tenant side.

-Gerhard

clinton33 · ‎2020-08-20

Thanks Gerhard,

Much appreciated!

hotz · ‎2020-08-21

Keith,

can you check again please? Engineering released a patch server side just now, we're hoping that will fix it.

clinton33 · ‎2020-08-23

Hi Gerhard,

Sorry for the delay. I checked the pods after the back-end change and their logs had these additional messages in them:

Note - Again it looks like the increment for this logging is 2 days, which I think isn't correct. Is there a specific real-time log to look at on the pod or the worker node that I might not be aware of to look at this real time?

2020-08-18T23:03:38Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"develwk-3", Flush Interval:10s
2020-08-20T03:15:25Z E! [agent] Error writing to outputs.http: Post https://mb3627.c02.cloudinsights.netapp.com/rest/v1/lake/ingest/influxdb: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2020-08-20T08:41:15Z E! [agent] Error writing to outputs.http: Post https://mb3627.c02.cloudinsights.netapp.com/rest/v1/lake/ingest/influxdb: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2020-08-20T13:46:00Z E! [agent] Error writing to outputs.http: when writing to [https://mb3627.c02.cloudinsights.netapp.com/rest/v1/lake/ingest/influxdb] received status code: 502
2020-08-22T07:28:25Z E! [agent] Error writing to outputs.http: Post https://mb3627.c02.cloudinsights.netapp.com/rest/v1/lake/ingest/influxdb: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2020-08-22T07:28:25Z E! [agent] Error writing to outputs.http: Post https://mb3627.c02.cloudinsights.netapp.com/rest/v1/lake/ingest/influxdb: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2020-08-22T11:23:45Z E! [agent] Error writing to outputs.http: Post https://mb3627.c02.cloudinsights.netapp.com/rest/v1/lake/ingest/influxdb: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Given the back-end changes and some of the previous troubleshooting, I went ahead and deleted/restarted the pods. After that was done, they all look clean, but I suspect I don't see the failures for 2 days given what I have seen thus far:

[root@develwk-2 ~]# kubectl logs -n monitoring telegraf-ds-xcw8p
2020-08-23T13:57:18Z I! Starting Telegraf 1.14.0
2020-08-23T13:57:18Z I! Loaded inputs: cpu diskio processes kubernetes disk kernel mem swap system
2020-08-23T13:57:18Z I! Loaded aggregators:
2020-08-23T13:57:18Z I! Loaded processors:
2020-08-23T13:57:18Z I! Loaded outputs: http http http http http http http http http
2020-08-23T13:57:18Z I! Tags enabled: agent_node_ip=10.xxx.xxx.45 agent_node_os=CentOS Linux agent_node_uuid=B3971A42-5913-F4DC-AD49-D5870F9FC27E host=develwk-2 kubernetes_cluster=devel
2020-08-23T13:57:18Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"develwk-2", Flush Interval:10s

I did validate that all looks well and I can curl the SystemInfo page of the CI instance like before so it all seems to be in a similar boat as before. I just don't know how/where to look for real-time logs and my guess is I won't see those timeout failures for 2 days (unless I see something in the CI instance and I don't see anything in there currently).

Let me know what you think. thanks!

hotz · ‎2020-08-24

Keith,

Are we correct in assuming there is still no data in CI and the issue still persists?