Examples

Practical examples of using the SRE CLI for common tasks.

Note: The responses below are illustrative. Actual output depends on your cluster, the selected AI provider, and your permissions.

Morning Health Check

Start your day by checking cluster status:

“How is my cluster doing?”

Your cluster “production” is healthy:

Nodes: 5, all Ready
Pods: 127 running, 2 pending
CPU: 45% | Memory: 62%
Alerts: No critical alerts

Pending pods:

batch-job-xyz - waiting for resources
cache-warmup-123 - scheduled for later

Investigating a Problem

When you notice something is wrong:

“What pods are failing?”

Found 3 pods with issues:

api-gateway-7d9f8b6c5-x2k4m (CrashLoopBackOff)
- Namespace: production
- Restarts: 12
- Last error: OOMKilled
worker-5c6d7e8f9-abc12 (Error)
- Namespace: production
- Restarts: 3
- Last error: Exit code 1
batch-processor-1a2b3c4d-def56 (Pending)
- Namespace: jobs
- Reason: Insufficient memory

“Show me the logs from api-gateway”

Last 50 lines from api-gateway-7d9f8b6c5-x2k4m:

Timestamp	Level	Message
2024-01-15 10:23:45	ERROR	Memory limit exceeded
2024-01-15 10:23:45	FATAL	Out of memory, shutting down

“Why is it running out of memory?”

The api-gateway pod has a memory limit of 256Mi but is trying to use approximately 380Mi. This could be caused by:

Memory leak in the application
Increased traffic causing more connections
Memory limit set too low

Recommendation: Increase the memory limit to at least 512Mi, or investigate the application for memory leaks.

Security Audit

Running a security check:

“Run a security scan on production”

Running security scan on namespace “production”…

CRITICAL: 2 issues

Image nginx:1.19 has 3 critical CVEs
Pod db-backup runs as root

HIGH: 5 issues

2 pods without resource limits
1 service account with cluster-admin
2 secrets not encrypted

MEDIUM: 8 issues

Various configuration best practices

“Tell me more about the critical CVEs”

Image: nginx:1.19

Vulnerabilities:

CVE-2023-44487 (Critical) - HTTP/2 rapid reset attack
- Fix: Upgrade to nginx:1.25.3
CVE-2023-38545 (Critical) - curl heap buffer overflow
- Fix: Upgrade to nginx:1.25.3
CVE-2023-38546 (Critical) - curl cookie injection
- Fix: Upgrade to nginx:1.25.3

Recommendation: Update to nginx:1.25.3 or later.

“Who has cluster-admin access?”

Service accounts with cluster-admin:

system:admin (expected)
jenkins-deployer (in jenkins namespace)
- Last used: 2024-01-10
- Bound by: jenkins-admin-binding

This may be overly permissive. Consider using namespace-scoped roles instead of cluster-admin for CI/CD.

Compliance Check

Verifying compliance status:

“Check CIS Kubernetes benchmark compliance”

CIS Kubernetes Benchmark v1.8 Results:

Section	Compliance
Control Plane	85%
etcd	100%
Control Plane Config	90%
Worker Nodes	78%
Policies	72%

Top failures:

[FAIL] 5.2.2 - Privileged containers detected (3 pods)
[FAIL] 5.2.6 - Containers running as root (7 pods)
[FAIL] 5.7.1 - Default namespace in use (12 pods)

“How do I fix the privileged containers issue?”

3 pods are running with privileged: true:

kube-proxy (kube-system) - Expected, system component
csi-driver (storage) - Required for storage driver
debug-pod (production) - Should be removed or fixed

For debug-pod, remove the privileged flag:

Edit the deployment
Remove securityContext.privileged: true
Add securityContext.allowPrivilegeEscalation: false

This pod doesn’t need privileged access for its function.

Resource Analysis

Understanding resource usage:

“Which pods are using the most memory?”

Top 10 pods by memory usage:

Pod	Namespace	Usage	Limit	%
postgres-primary-0	data	4.2 Gi	8 Gi	52%
elasticsearch-0	logging	3.8 Gi	4 Gi	95%
redis-master-0	cache	2.1 Gi	4 Gi	52%
api-gateway-xxx	production	890 Mi	1 Gi	87%
worker-xxx	production	756 Mi	1 Gi	74%

Warning: elasticsearch-0 is at 95% of its memory limit.

“What are the memory trends for elasticsearch?”

Memory usage for elasticsearch-0 (last 24 hours):

Time	Usage	%
06:00	2.8 Gi	70%
12:00	3.2 Gi	80%
18:00	3.5 Gi	87%
00:00	3.8 Gi	95%

The memory usage is steadily increasing. This could indicate:

Index growth without proper cleanup
Memory leak
Increased query load

Recommendation: Check index retention policies and consider increasing the memory limit or adding more replicas.

Network Analysis

Understanding network configuration:

“What services are exposed externally?”

Externally accessible services:

Service	Type	External IP	Ports
api-gateway	LoadBalancer	52.23.145.67	443, 80
web-frontend	LoadBalancer	52.23.145.68	443
grafana	NodePort	(any node):30080	3000

Warning: grafana is exposed via NodePort. Consider using a LoadBalancer with proper authentication or placing it behind an ingress controller.

“What network policies protect the database?”

Network policies for namespace “data”:

deny-all-ingress
- Denies all incoming traffic by default
allow-api-to-postgres
- Allows: pods with label app=api
- To: pods with label app=postgres
- Ports: 5432
allow-monitoring
- Allows: namespace kube-system
- To: all pods
- Ports: 9090 (metrics)

The database is well protected with deny-by-default policy.

Taking Action (with confirmation)

The assistant can also perform changes — but it always confirms first.

“Restart the api-gateway deployment in production”

I’ll restart deployment api-gateway in namespace production. This will roll its pods one by one (no downtime if replicas > 1). Confirm?

“Yes”

✅ Restarted api-gateway — rollout in progress, 2/3 pods updated and Ready.

Read-only questions never require this step; only actions that change the cluster (restart, scan, acknowledge an alert, …) ask for confirmation before running.

Tips for Effective Use

Start Broad, Then Narrow

“What’s wrong with my cluster?” → get overview

“Tell me more about the failing pods” → focus on specific issue

“Show me the logs from that pod” → drill down to details

Use Context

The AI remembers your conversation:

“Show me high CPU pods”

“Why is the first one using so much?”

“How can I fix it?”

Ask for Recommendations

“What should I do about this security issue?”

“What’s the best practice here?”

“How can I improve this configuration?”

What You Can Do Overview