Examples
Practical examples of using the SRE CLI for common tasks.
Morning Health Check
Start your day by checking cluster status:
“How is my cluster doing?”
Your cluster “production” is healthy:
- Nodes: 5, all Ready
- Pods: 127 running, 2 pending
- CPU: 45% | Memory: 62%
- Alerts: No critical alerts
Pending pods:
batch-job-xyz- waiting for resourcescache-warmup-123- scheduled for later
Investigating a Problem
When you notice something is wrong:
“What pods are failing?”
Found 3 pods with issues:
-
api-gateway-7d9f8b6c5-x2k4m (CrashLoopBackOff)
- Namespace: production
- Restarts: 12
- Last error: OOMKilled
-
worker-5c6d7e8f9-abc12 (Error)
- Namespace: production
- Restarts: 3
- Last error: Exit code 1
-
batch-processor-1a2b3c4d-def56 (Pending)
- Namespace: jobs
- Reason: Insufficient memory
“Show me the logs from api-gateway”
Last 50 lines from api-gateway-7d9f8b6c5-x2k4m:
| Timestamp | Level | Message |
|---|---|---|
| 2024-01-15 10:23:45 | ERROR | Memory limit exceeded |
| 2024-01-15 10:23:45 | FATAL | Out of memory, shutting down |
“Why is it running out of memory?”
The api-gateway pod has a memory limit of 256Mi but is trying to use approximately 380Mi. This could be caused by:
- Memory leak in the application
- Increased traffic causing more connections
- Memory limit set too low
Recommendation: Increase the memory limit to at least 512Mi, or investigate the application for memory leaks.
Security Audit
Running a security check:
“Run a security scan on production”
Running security scan on namespace “production”…
CRITICAL: 2 issues
- Image
nginx:1.19has 3 critical CVEs - Pod
db-backupruns as root
HIGH: 5 issues
- 2 pods without resource limits
- 1 service account with cluster-admin
- 2 secrets not encrypted
MEDIUM: 8 issues
- Various configuration best practices
“Tell me more about the critical CVEs”
Image: nginx:1.19
Vulnerabilities:
-
CVE-2023-44487 (Critical) - HTTP/2 rapid reset attack
- Fix: Upgrade to
nginx:1.25.3
- Fix: Upgrade to
-
CVE-2023-38545 (Critical) - curl heap buffer overflow
- Fix: Upgrade to
nginx:1.25.3
- Fix: Upgrade to
-
CVE-2023-38546 (Critical) - curl cookie injection
- Fix: Upgrade to
nginx:1.25.3
- Fix: Upgrade to
Recommendation: Update to nginx:1.25.3 or later.
“Who has cluster-admin access?”
Service accounts with cluster-admin:
- system:admin (expected)
- jenkins-deployer (in jenkins namespace)
- Last used: 2024-01-10
- Bound by: jenkins-admin-binding
This may be overly permissive. Consider using namespace-scoped roles instead of cluster-admin for CI/CD.
Compliance Check
Verifying compliance status:
“Check CIS Kubernetes benchmark compliance”
CIS Kubernetes Benchmark v1.8 Results:
| Section | Compliance |
|---|---|
| Control Plane | 85% |
| etcd | 100% |
| Control Plane Config | 90% |
| Worker Nodes | 78% |
| Policies | 72% |
Top failures:
- [FAIL] 5.2.2 - Privileged containers detected (3 pods)
- [FAIL] 5.2.6 - Containers running as root (7 pods)
- [FAIL] 5.7.1 - Default namespace in use (12 pods)
“How do I fix the privileged containers issue?”
3 pods are running with privileged: true:
- kube-proxy (kube-system) - Expected, system component
- csi-driver (storage) - Required for storage driver
- debug-pod (production) - Should be removed or fixed
For debug-pod, remove the privileged flag:
- Edit the deployment
- Remove
securityContext.privileged: true - Add
securityContext.allowPrivilegeEscalation: false
This pod doesn’t need privileged access for its function.
Resource Analysis
Understanding resource usage:
“Which pods are using the most memory?”
Top 10 pods by memory usage:
| Pod | Namespace | Usage | Limit | % |
|---|---|---|---|---|
| postgres-primary-0 | data | 4.2 Gi | 8 Gi | 52% |
| elasticsearch-0 | logging | 3.8 Gi | 4 Gi | 95% |
| redis-master-0 | cache | 2.1 Gi | 4 Gi | 52% |
| api-gateway-xxx | production | 890 Mi | 1 Gi | 87% |
| worker-xxx | production | 756 Mi | 1 Gi | 74% |
Warning: elasticsearch-0 is at 95% of its memory limit.
“What are the memory trends for elasticsearch?”
Memory usage for elasticsearch-0 (last 24 hours):
| Time | Usage | % |
|---|---|---|
| 06:00 | 2.8 Gi | 70% |
| 12:00 | 3.2 Gi | 80% |
| 18:00 | 3.5 Gi | 87% |
| 00:00 | 3.8 Gi | 95% |
The memory usage is steadily increasing. This could indicate:
- Index growth without proper cleanup
- Memory leak
- Increased query load
Recommendation: Check index retention policies and consider increasing the memory limit or adding more replicas.
Network Analysis
Understanding network configuration:
“What services are exposed externally?”
Externally accessible services:
| Service | Type | External IP | Ports |
|---|---|---|---|
| api-gateway | LoadBalancer | 52.23.145.67 | 443, 80 |
| web-frontend | LoadBalancer | 52.23.145.68 | 443 |
| grafana | NodePort | (any node):30080 | 3000 |
Warning: grafana is exposed via NodePort. Consider using a LoadBalancer with proper authentication or placing it behind an ingress controller.
“What network policies protect the database?”
Network policies for namespace “data”:
-
deny-all-ingress
- Denies all incoming traffic by default
-
allow-api-to-postgres
- Allows: pods with label
app=api - To: pods with label
app=postgres - Ports: 5432
- Allows: pods with label
-
allow-monitoring
- Allows: namespace
kube-system - To: all pods
- Ports: 9090 (metrics)
- Allows: namespace
The database is well protected with deny-by-default policy.
Tips for Effective Use
Start Broad, Then Narrow
“What’s wrong with my cluster?” → get overview
“Tell me more about the failing pods” → focus on specific issue
“Show me the logs from that pod” → drill down to details
Use Context
The AI remembers your conversation:
“Show me high CPU pods”
“Why is the first one using so much?”
“How can I fix it?”
Ask for Recommendations
“What should I do about this security issue?”
“What’s the best practice here?”
“How can I improve this configuration?”