Troubleshooting

Troubleshooting

Common issues and solutions for SRExpert.

Installation Issues

Helm Installation Fails

Symptom: helm install command fails

Solutions:

  1. Check Helm version (v3.x required)
helm version
  1. Verify repository is added
helm repo list
helm repo add srexpert-helm https://nexus.srexpert.io/repository/srexpert-helm/
helm repo update
  1. Check namespace exists
kubectl create namespace srexpert

Pods Not Starting

Symptom: Pods stuck in Pending or CrashLoopBackOff

Check resources:

kubectl describe pod -n srexpert <pod-name>
kubectl logs -n srexpert <pod-name>

Common causes:

IssueSolution
ImagePullBackOffCheck registry credentials
Insufficient resourcesIncrease node capacity
PVC pendingCheck StorageClass
ConfigMap missingVerify Helm values

Image Pull Errors

Symptom: ErrImagePull or ImagePullBackOff

Solutions:

  1. Verify secret exists
kubectl get secret nexus-registry -n srexpert
  1. Check secret configuration
kubectl get secret nexus-registry -n srexpert -o yaml
  1. Recreate secret if needed
kubectl create secret docker-registry nexus-registry \
  --docker-server=registry.srexpert.io \
  --docker-username=YOUR_USER \
  --docker-password=YOUR_PASS \
  -n srexpert

Connection Issues

Cannot Connect to Cluster

Symptom: Cluster shows “Disconnected” status

Check:

  1. Kubeconfig is valid
kubectl --kubeconfig=your-kubeconfig get nodes
  1. Network connectivity
curl -k https://your-cluster-api:6443/healthz
  1. Certificate validity
kubectl config view --raw -o jsonpath='{.users[0].user.client-certificate-data}' | base64 -d | openssl x509 -noout -dates

API Server Timeout

Symptom: Operations timeout

Solutions:

  • Check network latency
  • Increase timeout in settings
  • Verify firewall rules allow port 6443

Authentication Failed

Symptom: 401 Unauthorized errors

Check:

  1. Token hasn’t expired
  2. ServiceAccount exists
  3. RBAC bindings are correct
kubectl auth can-i --list --as=system:serviceaccount:srexpert-system:srexpert

Database Issues

PostgreSQL Not Starting

Symptom: PostgreSQL pod fails to start

Check PVC:

kubectl get pvc -n srexpert
kubectl describe pvc -n srexpert data-srexpert-backend-postgresql-0

Check logs:

kubectl logs -n srexpert srexpert-backend-postgresql-0

Common fixes:

  • Verify StorageClass exists
  • Check storage quota
  • Ensure PV permissions

Connection Refused

Symptom: Backend can’t connect to database

Verify service:

kubectl get svc -n srexpert | grep postgresql

Check password:

kubectl get secret -n srexpert srexpert-database -o jsonpath='{.data.postgres-password}' | base64 -d

UI Issues

Dashboard Not Loading

Symptom: Blank page or loading forever

Solutions:

  1. Clear browser cache
  2. Check browser console for errors
  3. Verify backend is running
kubectl get pods -n srexpert -l app.kubernetes.io/name=srexpert-backend
  1. Check ingress configuration
kubectl get ingress -n srexpert

Login Fails

Symptom: Cannot log in

Check:

  1. Backend logs for errors
kubectl logs -n srexpert -l app.kubernetes.io/name=srexpert-backend --tail=100
  1. Cookie settings match domain
  2. CORS configuration is correct

Performance Issues

Slow Response Times

Symptom: UI is slow

Solutions:

  1. Check resource usage
kubectl top pods -n srexpert
  1. Increase resource limits
resources:
  limits:
    cpu: 2000m
    memory: 4Gi
  1. Enable Redis caching

High Memory Usage

Symptom: Pods getting OOMKilled

Solutions:

  1. Increase memory limits
  2. Check for memory leaks in logs
  3. Reduce concurrent operations

Feature-Specific Issues

SRE CLI / AI Assistant Not Responding

The SRE CLI (AI Operations Terminal) is an in-app chat that streams responses over HTTP/SSE. If it does not respond or returns an error, work through the checks below in order.

1. No AI provider configured (most common)

The SRE CLI needs an AI provider with a valid API key before it can answer. If you see a message like “no AI provider configured”, open the AI/provider settings and add a provider and its API key.

2. Provider API key invalid, expired, or rate-limited

If a provider is configured but requests fail:

  • Verify the API key is still valid and has not expired or been revoked
  • Check whether the provider is rate-limiting or returning quota errors
  • Switch to a different configured provider if the current one is unavailable

3. Missing AI permission or plan

The SRE CLI requires the AI feature, which is available on Professional plans and above. Confirm:

  • Your subscription plan includes AI features
  • Your user has the permission required to use the AI assistant

4. Cluster scope

The assistant answers in the context of the currently selected cluster. If responses seem to reference the wrong resources, confirm the correct cluster is selected before sending your query.

Metrics Not Showing

Check:

  1. Prometheus is configured
  2. Metrics server is running in cluster
  3. RBAC allows metrics access

Logs Collection

Backend Logs

kubectl logs -n srexpert -l app.kubernetes.io/name=srexpert-backend --tail=500 > backend.log

Frontend Logs

kubectl logs -n srexpert -l app.kubernetes.io/name=srexpert-frontend --tail=500 > frontend.log

All Events

kubectl get events -n srexpert --sort-by='.lastTimestamp' > events.log

Getting Help

Support Channels

Information to Include

When contacting support, include:

  1. SRExpert version
  2. Kubernetes version
  3. Error messages
  4. Steps to reproduce
  5. Relevant logs

Version Information

# SRExpert version
kubectl get deployment -n srexpert srexpert-backend -o jsonpath='{.spec.template.spec.containers[0].image}'
 
# Kubernetes version
kubectl version --short

Common Error Messages

ErrorMeaningSolution
ECONNREFUSEDCan’t reach serviceCheck service/network
401 UnauthorizedAuth failedCheck credentials
403 ForbiddenNo permissionCheck RBAC
404 Not FoundResource missingVerify resource exists
500 Internal ErrorServer errorCheck backend logs
503 Service UnavailableService downCheck pod status