I am trying to add a k8s cluster to cloud insights. I have installed the package on one of the nodes in the k8s cluster and made the modifications that seem to be called out in the documentation. The docs bounce around a lot and are a tad tough to follow, but I think I have covered all my bases. I did hit an issue with the configmap initially and saw that in the logs. Once I fixed that typo, the agents came up successfully (delete and recreate the logs). Looking at the logs for each pod shows they are up and telegraf is running with no apparent errors or issues inside the pods in the /var/log/telegraf directory. I don't see anything populated in CI so I am unsure where to look next for any indications of an issue. All RS/DS/Pods are in healthy order so I was expecting a successful discovery but I am curious if those that might have successfully added clusters thus far had hit any issues like this and have some tips or log locations to look?
So interestingly enough, I let this sit for over a day after setting it up and just as I went to get the output to display here, I noticed that after TWO days it finally tried to post something to what looks like the CI instance and failed:
kubectl logs telegraf-ds-dbcvw -n monitoring 2020-08-18T23:04:37Z I! Starting Telegraf 1.14.0 2020-08-18T23:04:37Z I! Loaded inputs: diskio processes kubernetes cpu disk kernel mem swap system 2020-08-18T23:04:37Z I! Loaded aggregators: 2020-08-18T23:04:37Z I! Loaded processors: 2020-08-18T23:04:37Z I! Loaded outputs: http http http http http http http http http 2020-08-18T23:04:37Z I! Tags enabled: agent_node_ip=10.xxx.xxx.xxx agent_node_os=CentOS Linux agent_node_uuid=B3971A42-5913-F4DC-AD49-D5870F9FC27E host=develwk-2 kubernetes_cluster=devel 2020-08-18T23:04:37Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"develwk-2", Flush Interval:10s 2020-08-20T12:23:10Z E! [agent] Error writing to outputs.http: when writing to [https://mb3627.c02.cloudinsights.netapp.com/rest/v1/lake/ingest/influxdb] received status code: 502
I didn't expect it to take 2 days to get to this point - is this expected? I just caught this now as I went to respond to your reply so I think I have something else to troubleshoot or a specific error to ask if people have seen before?
That error seems like it is related to the Cloud Insights portal - I didn't see anything in the config maps to store any user/passwords but I do know the install snippets have API keys with them and there are a ton of embedded keys in the snippets the instructions ask you to copy over - I'm curious if those are the issue. I don't know how to rectify those at the moment - any thoughts from the new error above?
This seems like an error on where these config files are setup to send/write the data to the CI instance, but I am unsure of the fix for this at the moment and the errors don't match anything in the docs that I can find thus far.
I agree with what you were thinking. In looking at testing the network, I can get out from the pod itself and the worker node. If I try to get to that location from the Pod and node, it seems to come back OK. My thought is something like one of the API keys or something in that realm might not be working for the POST?
Sorry for the delay. I checked the pods after the back-end change and their logs had these additional messages in them:
Note - Again it looks like the increment for this logging is 2 days, which I think isn't correct. Is there a specific real-time log to look at on the pod or the worker node that I might not be aware of to look at this real time?
Given the back-end changes and some of the previous troubleshooting, I went ahead and deleted/restarted the pods. After that was done, they all look clean, but I suspect I don't see the failures for 2 days given what I have seen thus far:
I did validate that all looks well and I can curl the SystemInfo page of the CI instance like before so it all seems to be in a similar boat as before. I just don't know how/where to look for real-time logs and my guess is I won't see those timeout failures for 2 days (unless I see something in the CI instance and I don't see anything in there currently).