Hello nebula users, I'm trying to update k8s nebul...
# nebula
r
Hello nebula users, I'm trying to update k8s nebula from 3.4.0 -> 3.6.0, nebula-operator is updating fine. Storaged and metad are also updated, the problem is with graphd, which is in CrashLoopBackOff state, but unfortunately kubectl get logs does not return any relevant info. Have you encountered a similar problem? Thank you
j
Have you updated the CRDs as well?
Also If you can ssh into it before it crashes, there is a logs folder that may help. But may be difficult to do.
r
Thank you for reply. Yes CRDs are updated too to latest available. Pod is ready for half second so its very challenging exec in to pod and see logs :)
w
Now the logging was not properly handled in a stdout/stderr way to enable kubectl get logs, any chance you could inspect the crashing graphd pod to know its log pvc and create a temp pod attached to it?
r
Thanks for manual I will try it. I tried to exec in to metad pods (which are running) and I found this type of error:
75 ThriftClientManager-inl.h:70] Failed to resolve address for 'nebula-cluster-metad-2.nebula-cluster-metad-headless.nebula.svc.cluster.local': Name or service not known (error=-2): Unknown error -2
, but when a ping domain name from pod its working. Even if I have set
kubernetesClusterDomain: "cluster.local"
in nebula-operator.
w
All meta are sufferring from this? or ? how is metad2 going from its log?
r
Yes all metad pods have same error, but with different pod address. Log from metad2:
75 ThriftClientManager-inl.h:70] Failed to resolve address for 'nebula-cluster-metad-0.nebula-cluster-metad-headless.nebula.svc.cluster.local': Name or service not known (error=-2): Unknown error -2
I have enabled logging from graphd, and I recieved:
GraphDaemon.cpp:110] host not found:nebula-cluster-graphd-2.nebula-cluster-graphd-headless.nebula.svc.cluster.local
w
could you please check those headless services PublishNotReadyAddresses true or false? strange, they should be set true by the operator but now from this situation it looks like false
r
All headless svc have PublishNotReadyAddresses true
👀 1
w
are metad crashing, please? If yes what are logs of metad other than services fqdn not reachable?
r
Metad are in running state (in logs are info about "failed to resolve address..." but it appered once after nebula redeploy, but now this error is not shown anymore).
We solved it with delete nebula-graphd statefulSet, after recreate graphd pods start without error.
❤️ 1
j
Ah yeah I completely forgot about this. I too encountered this same issue before. It’s because during the upgrade from 3.4->3.6 the fqdn was changed, but it was not updated in the statefulset (because it’s immutable iirc). So the solution is to delete the SS then it will be re-recreated properly.
❤️ 1
w
Oh, sorry for this, this should be highlighted in docs/ or fixed in the operator. Thanks @Róbert Kuzma @Jeremy Simpson
❤️ 1