Troubleshooting

Troubleshooting

Common issues and solutions for SRExpert.

Installation Issues

Helm Installation Fails

Symptom: helm install command fails

Solutions:

  1. Check Helm version (v3.x required)
helm version
  1. Verify repository is added
helm repo list
helm repo add srexpert-helm https://nexus.srexpert.io/repository/srexpert-helm/
helm repo update
  1. Check namespace exists
kubectl create namespace srexpert

Pods Not Starting

Symptom: Pods stuck in Pending or CrashLoopBackOff

Check resources:

kubectl describe pod -n srexpert <pod-name>
kubectl logs -n srexpert <pod-name>

Common causes:

IssueSolution
ImagePullBackOffCheck registry credentials
Insufficient resourcesIncrease node capacity
PVC pendingCheck StorageClass
ConfigMap missingVerify Helm values

Image Pull Errors

Symptom: ErrImagePull or ImagePullBackOff

Solutions:

  1. Verify secret exists
kubectl get secret nexus-registry -n srexpert
  1. Check secret configuration
kubectl get secret nexus-registry -n srexpert -o yaml
  1. Recreate secret if needed
kubectl create secret docker-registry nexus-registry \
  --docker-server=registry.srexpert.io \
  --docker-username=YOUR_USER \
  --docker-password=YOUR_PASS \
  -n srexpert

Connection Issues

Cannot Connect to Cluster

Symptom: Cluster shows “Disconnected” status

Check:

  1. Kubeconfig is valid
kubectl --kubeconfig=your-kubeconfig get nodes
  1. Network connectivity
curl -k https://your-cluster-api:6443/healthz
  1. Certificate validity
kubectl config view --raw -o jsonpath='{.users[0].user.client-certificate-data}' | base64 -d | openssl x509 -noout -dates

API Server Timeout

Symptom: Operations timeout

Solutions:

  • Check network latency
  • Increase timeout in settings
  • Verify firewall rules allow port 6443

Authentication Failed

Symptom: 401 Unauthorized errors

Check:

  1. Token hasn’t expired
  2. ServiceAccount exists
  3. RBAC bindings are correct
kubectl auth can-i --list --as=system:serviceaccount:srexpert-system:srexpert

Database Issues

PostgreSQL Not Starting

Symptom: PostgreSQL pod fails to start

Check PVC:

kubectl get pvc -n srexpert
kubectl describe pvc -n srexpert data-srexpert-backend-postgresql-0

Check logs:

kubectl logs -n srexpert srexpert-backend-postgresql-0

Common fixes:

  • Verify StorageClass exists
  • Check storage quota
  • Ensure PV permissions

Connection Refused

Symptom: Backend can’t connect to database

Verify service:

kubectl get svc -n srexpert | grep postgresql

Check password:

kubectl get secret -n srexpert srexpert-database -o jsonpath='{.data.postgres-password}' | base64 -d

UI Issues

Dashboard Not Loading

Symptom: Blank page or loading forever

Solutions:

  1. Clear browser cache
  2. Check browser console for errors
  3. Verify backend is running
kubectl get pods -n srexpert -l app.kubernetes.io/name=srexpert-backend
  1. Check ingress configuration
kubectl get ingress -n srexpert

Login Fails

Symptom: Cannot log in

Check:

  1. Backend logs for errors
kubectl logs -n srexpert -l app.kubernetes.io/name=srexpert-backend --tail=100
  1. Cookie settings match domain
  2. CORS configuration is correct

Performance Issues

Slow Response Times

Symptom: UI is slow

Solutions:

  1. Check resource usage
kubectl top pods -n srexpert
  1. Increase resource limits
resources:
  limits:
    cpu: 2000m
    memory: 4Gi
  1. Enable Redis caching

High Memory Usage

Symptom: Pods getting OOMKilled

Solutions:

  1. Increase memory limits
  2. Check for memory leaks in logs
  3. Reduce concurrent operations

Feature-Specific Issues

SRE CLI Not Working

Check:

  1. WebSocket connectivity
  2. AI service is enabled
  3. Kubeconfig has exec permissions

AI Assistant Not Responding

Check:

  1. License includes AI features
  2. AI service is enabled
  3. Network allows outbound connections

Metrics Not Showing

Check:

  1. Prometheus is configured
  2. Metrics server is running in cluster
  3. RBAC allows metrics access

Logs Collection

Backend Logs

kubectl logs -n srexpert -l app.kubernetes.io/name=srexpert-backend --tail=500 > backend.log

Frontend Logs

kubectl logs -n srexpert -l app.kubernetes.io/name=srexpert-frontend --tail=500 > frontend.log

All Events

kubectl get events -n srexpert --sort-by='.lastTimestamp' > events.log

Getting Help

Support Channels

Information to Include

When contacting support, include:

  1. SRExpert version
  2. Kubernetes version
  3. Error messages
  4. Steps to reproduce
  5. Relevant logs

Version Information

# SRExpert version
kubectl get deployment -n srexpert srexpert-backend -o jsonpath='{.spec.template.spec.containers[0].image}'
 
# Kubernetes version
kubectl version --short

Common Error Messages

ErrorMeaningSolution
ECONNREFUSEDCan’t reach serviceCheck service/network
401 UnauthorizedAuth failedCheck credentials
403 ForbiddenNo permissionCheck RBAC
404 Not FoundResource missingVerify resource exists
500 Internal ErrorServer errorCheck backend logs
503 Service UnavailableService downCheck pod status