Hi everyone, I have an issue with nebula cluser on...
# nebula
n
Hi everyone, I have an issue with nebula cluser on k8s cluster. So in the beginning everything is working good, but in several hours Nebula doesn't make even statistics via Nebula Studio, moreover it's not possible to make index rebuild. E_RPC_FAILURE messages are presented in log, so it means that storage_client_timeout_ms is archived for queries. I don't have any idea at the moment what's happened with nebula cluster.
apiVersion: <http://apps.nebula-graph.io/v1alpha1|apps.nebula-graph.io/v1alpha1>
kind: NebulaCluster metadata: annotations: meta.helm.sh/release-name: nebula-cluster meta.helm.sh/release-namespace: edt-test creationTimestamp: "2024-01-31T223049Z" generation: 4 labels: app.kubernetes.io/managed-by: Helm app.kubernetes.io/version: 1.7.3 helm.sh/chart: nebula-cluster-1.7.3 name: nebula-cluster namespace: edt-test resourceVersion: "132477201" uid: 7f739868-125e-45a7-ad3a-2b1b036dbed3 spec: agent: image: vesoft/nebula-agent resources: limits: cpu: 200m memory: 256Mi requests: cpu: 100m memory: 128Mi version: latest enableBR: false enablePVReclaim: false exporter: annotations: {} env: [] httpPort: 9100 image: vesoft/nebula-stats-exporter labels: {} maxRequests: 20 replicas: 1 resources: limits: cpu: 200m memory: 256Mi requests: cpu: 100m memory: 128Mi version: v3.3.0 graphd: affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app.kubernetes.io/cluster: nebula-cluster app.kubernetes.io/component: storaged topologyKey: kubernetes.io/hostname podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app.kubernetes.io/name operator: In values: - kafka topologyKey: kubernetes.io/hostname annotations: {} config: client_idle_timeout_secs: "1800" logtostderr: "true" max_sessions_per_ip_per_user: "30000" memory_tracker_detail_log: "true" memory_tracker_detail_log_interval_ms: "120000" minloglevel: "1" num_worker_threads: "32" redirect_stdout: "false" session_idle_timeout_secs: "1800" stderrthreshold: "1" storage_client_timeout_ms: "360000" system_memory_high_watermark_ratio: "0.9" timezone_name: UTC+03:00 env: [] image: vesoft/nebula-graphd labels: {} logVolumeClaim: resources: requests: storage: 500Mi replicas: 3 resources: limits: cpu: "4" memory: 16Gi requests: cpu: "0.1" memory: 100Mi service: externalTrafficPolicy: Local type: NodePort version: v3.6.0 imagePullPolicy: Always logRotate: rotate: 5 size: 10M metad: affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app.kubernetes.io/cluster: nebula-cluster app.kubernetes.io/component: storaged topologyKey: kubernetes.io/hostname podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app.kubernetes.io/name operator: In values: - kafka topologyKey: kubernetes.io/hostname annotations: {} config: logtostderr: "true" minloglevel: "1" redirect_stdout: "false" stderrthreshold: "1" timezone_name: UTC+03:00 dataVolumeClaim: resources: requests: storage: 10Gi env: [] image: vesoft/nebula-metad labels: {} logVolumeClaim: resources: requests: storage: 500Mi replicas: 1 resources: limits: cpu: "2" memory: 8Gi requests: cpu: "1" memory: 4Gi version: v3.6.0 reference: name: statefulsets.apps version: v1 schedulerName: default-scheduler storaged: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app.kubernetes.io/name operator: In values: - kafka topologyKey: kubernetes.io/hostname annotations: {} config: enable_partitioned_index_filter: "true" enable_rocksdb_statistics: "true" logtostderr: "true" memory_tracker_detail_log: "true" memory_tracker_detail_log_interval_ms: "120000" minloglevel: "1" num_worker_threads: "64" raft_heartbeat_interval_secs: "31" raft_rpc_timeout_ms: "500" redirect_stdout: "false" rocksdb_block_cache: "8192" rocksdb_db_options: '{"max_subcompactions":"64","max_background_jobs":"64"}' stderrthreshold: "1" timezone_name: UTC+03:00 wal_ttl: "14400" dataVolumeClaims: - resources: requests: storage: 100Gi enableAutoBalance: false enableForceUpdate: false env: [] image: vesoft/nebula-storaged labels: {} logVolumeClaim: resources: requests: storage: 500Mi replicas: 3 resources: limits: cpu: "4" memory: 12Gi requests: cpu: "2" memory: 6Gi version: v3.6.0 topologySpreadConstraints: - topologyKey: kubernetes.io/hostname whenUnsatisfiable: ScheduleAnyway
in the begining I had 3 metad replicas, and 5 storaged and 5 graphd, but after that I reduced metad to 1 replica
what's important configs or k8s requirements for nebula cluster?
I tried to deploy on k8s with v1.22.15 version
w
Hi @Nikolay, can you check the service status by showing hosts? Before the service crashed, did you make any statement query? If yes, you can add
profile
or
explain
in front of the statement, and execute it to see where the generated execution plan takes longer. This is a direction you could refer to for further troubleshooting the issue.
n
Copy code
nebula-cluster-storaged-0.nebula-cluster-storaged-headless.edt-test.svc.cluster.local	9779	ONLINE	3	edt:3	edt:3	3.6.0
nebula-cluster-storaged-1.nebula-cluster-storaged-headless.edt-test.svc.cluster.local	9779	ONLINE	3	edt:3	edt:3	3.6.0
nebula-cluster-storaged-2.nebula-cluster-storaged-headless.edt-test.svc.cluster.local	9779	ONLINE	4	edt:4	edt:4	3.6.0
Hi @Wenting Issue is happened even when there is no any statements. So at the moment I tried to make a query, just simple one and nothing is happened, just loading
Copy code
MATCH(v:employee)
return (v);
Copy code
SUBMIT JOB STATS;
Event statistics cannot be presented
w
in the begining I had 3 metad replicas, and 5 storaged and 5 graphd, but after that I reduced metad to 1 replica
Sorry, NebulaGraph doesn't support scaling-in metad. would you mind recreating a NebulaCluster?
So at the moment I tried to make a query, just a simple one, and nothing is happened, just loading
Copy code
MATCH(v:employee)
return (v);
By "just loading", do you see any errors out there on GraphD/StorageD, please?
n
Yeah, of course. I have recreated. At the moment we decided to install nebula on physical machines instead of k8s and everything is OK, but it's strange why it is not working correctly on k8s cluster