Troubleshooting
Common issues and solutions for SRExpert.
Installation Issues
Helm Installation Fails
Symptom: helm install command fails
Solutions:
- Check Helm version (v3.x required)
helm version- Verify repository is added
helm repo list
helm repo add srexpert-helm https://nexus.srexpert.io/repository/srexpert-helm/
helm repo update- Check namespace exists
kubectl create namespace srexpertPods Not Starting
Symptom: Pods stuck in Pending or CrashLoopBackOff
Check resources:
kubectl describe pod -n srexpert <pod-name>
kubectl logs -n srexpert <pod-name>Common causes:
| Issue | Solution |
|---|---|
| ImagePullBackOff | Check registry credentials |
| Insufficient resources | Increase node capacity |
| PVC pending | Check StorageClass |
| ConfigMap missing | Verify Helm values |
Image Pull Errors
Symptom: ErrImagePull or ImagePullBackOff
Solutions:
- Verify secret exists
kubectl get secret nexus-registry -n srexpert- Check secret configuration
kubectl get secret nexus-registry -n srexpert -o yaml- Recreate secret if needed
kubectl create secret docker-registry nexus-registry \
--docker-server=registry.srexpert.io \
--docker-username=YOUR_USER \
--docker-password=YOUR_PASS \
-n srexpertConnection Issues
Cannot Connect to Cluster
Symptom: Cluster shows “Disconnected” status
Check:
- Kubeconfig is valid
kubectl --kubeconfig=your-kubeconfig get nodes- Network connectivity
curl -k https://your-cluster-api:6443/healthz- Certificate validity
kubectl config view --raw -o jsonpath='{.users[0].user.client-certificate-data}' | base64 -d | openssl x509 -noout -datesAPI Server Timeout
Symptom: Operations timeout
Solutions:
- Check network latency
- Increase timeout in settings
- Verify firewall rules allow port 6443
Authentication Failed
Symptom: 401 Unauthorized errors
Check:
- Token hasn’t expired
- ServiceAccount exists
- RBAC bindings are correct
kubectl auth can-i --list --as=system:serviceaccount:srexpert-system:srexpertDatabase Issues
PostgreSQL Not Starting
Symptom: PostgreSQL pod fails to start
Check PVC:
kubectl get pvc -n srexpert
kubectl describe pvc -n srexpert data-srexpert-backend-postgresql-0Check logs:
kubectl logs -n srexpert srexpert-backend-postgresql-0Common fixes:
- Verify StorageClass exists
- Check storage quota
- Ensure PV permissions
Connection Refused
Symptom: Backend can’t connect to database
Verify service:
kubectl get svc -n srexpert | grep postgresqlCheck password:
kubectl get secret -n srexpert srexpert-database -o jsonpath='{.data.postgres-password}' | base64 -dUI Issues
Dashboard Not Loading
Symptom: Blank page or loading forever
Solutions:
- Clear browser cache
- Check browser console for errors
- Verify backend is running
kubectl get pods -n srexpert -l app.kubernetes.io/name=srexpert-backend- Check ingress configuration
kubectl get ingress -n srexpertLogin Fails
Symptom: Cannot log in
Check:
- Backend logs for errors
kubectl logs -n srexpert -l app.kubernetes.io/name=srexpert-backend --tail=100- Cookie settings match domain
- CORS configuration is correct
Performance Issues
Slow Response Times
Symptom: UI is slow
Solutions:
- Check resource usage
kubectl top pods -n srexpert- Increase resource limits
resources:
limits:
cpu: 2000m
memory: 4Gi- Enable Redis caching
High Memory Usage
Symptom: Pods getting OOMKilled
Solutions:
- Increase memory limits
- Check for memory leaks in logs
- Reduce concurrent operations
Feature-Specific Issues
SRE CLI / AI Assistant Not Responding
The SRE CLI (AI Operations Terminal) is an in-app chat that streams responses over HTTP/SSE. If it does not respond or returns an error, work through the checks below in order.
1. No AI provider configured (most common)
The SRE CLI needs an AI provider with a valid API key before it can answer. If you see a message like “no AI provider configured”, open the AI/provider settings and add a provider and its API key.
2. Provider API key invalid, expired, or rate-limited
If a provider is configured but requests fail:
- Verify the API key is still valid and has not expired or been revoked
- Check whether the provider is rate-limiting or returning quota errors
- Switch to a different configured provider if the current one is unavailable
3. Missing AI permission or plan
The SRE CLI requires the AI feature, which is available on Professional plans and above. Confirm:
- Your subscription plan includes AI features
- Your user has the permission required to use the AI assistant
4. Cluster scope
The assistant answers in the context of the currently selected cluster. If responses seem to reference the wrong resources, confirm the correct cluster is selected before sending your query.
Metrics Not Showing
Check:
- Prometheus is configured
- Metrics server is running in cluster
- RBAC allows metrics access
Logs Collection
Backend Logs
kubectl logs -n srexpert -l app.kubernetes.io/name=srexpert-backend --tail=500 > backend.logFrontend Logs
kubectl logs -n srexpert -l app.kubernetes.io/name=srexpert-frontend --tail=500 > frontend.logAll Events
kubectl get events -n srexpert --sort-by='.lastTimestamp' > events.logGetting Help
Support Channels
- Documentation: https://docs.srexpert.io
- Support Portal: https://srexpert.atlassian.net/servicedesk/customer/portal/1
- Email: [email protected]
Information to Include
When contacting support, include:
- SRExpert version
- Kubernetes version
- Error messages
- Steps to reproduce
- Relevant logs
Version Information
# SRExpert version
kubectl get deployment -n srexpert srexpert-backend -o jsonpath='{.spec.template.spec.containers[0].image}'
# Kubernetes version
kubectl version --shortCommon Error Messages
| Error | Meaning | Solution |
|---|---|---|
ECONNREFUSED | Can’t reach service | Check service/network |
401 Unauthorized | Auth failed | Check credentials |
403 Forbidden | No permission | Check RBAC |
404 Not Found | Resource missing | Verify resource exists |
500 Internal Error | Server error | Check backend logs |
503 Service Unavailable | Service down | Check pod status |