SRE CLIExamples

Examples

Practical examples of using the SRE CLI for common tasks.

Morning Health Check

Start your day by checking cluster status:

“How is my cluster doing?”

Your cluster “production” is healthy:

  • Nodes: 5, all Ready
  • Pods: 127 running, 2 pending
  • CPU: 45% | Memory: 62%
  • Alerts: No critical alerts

Pending pods:

  • batch-job-xyz - waiting for resources
  • cache-warmup-123 - scheduled for later

Investigating a Problem

When you notice something is wrong:

“What pods are failing?”

Found 3 pods with issues:

  1. api-gateway-7d9f8b6c5-x2k4m (CrashLoopBackOff)

    • Namespace: production
    • Restarts: 12
    • Last error: OOMKilled
  2. worker-5c6d7e8f9-abc12 (Error)

    • Namespace: production
    • Restarts: 3
    • Last error: Exit code 1
  3. batch-processor-1a2b3c4d-def56 (Pending)

    • Namespace: jobs
    • Reason: Insufficient memory

“Show me the logs from api-gateway”

Last 50 lines from api-gateway-7d9f8b6c5-x2k4m:

TimestampLevelMessage
2024-01-15 10:23:45ERRORMemory limit exceeded
2024-01-15 10:23:45FATALOut of memory, shutting down

“Why is it running out of memory?”

The api-gateway pod has a memory limit of 256Mi but is trying to use approximately 380Mi. This could be caused by:

  1. Memory leak in the application
  2. Increased traffic causing more connections
  3. Memory limit set too low

Recommendation: Increase the memory limit to at least 512Mi, or investigate the application for memory leaks.

Security Audit

Running a security check:

“Run a security scan on production”

Running security scan on namespace “production”…

CRITICAL: 2 issues

  • Image nginx:1.19 has 3 critical CVEs
  • Pod db-backup runs as root

HIGH: 5 issues

  • 2 pods without resource limits
  • 1 service account with cluster-admin
  • 2 secrets not encrypted

MEDIUM: 8 issues

  • Various configuration best practices

“Tell me more about the critical CVEs”

Image: nginx:1.19

Vulnerabilities:

  1. CVE-2023-44487 (Critical) - HTTP/2 rapid reset attack

    • Fix: Upgrade to nginx:1.25.3
  2. CVE-2023-38545 (Critical) - curl heap buffer overflow

    • Fix: Upgrade to nginx:1.25.3
  3. CVE-2023-38546 (Critical) - curl cookie injection

    • Fix: Upgrade to nginx:1.25.3

Recommendation: Update to nginx:1.25.3 or later.


“Who has cluster-admin access?”

Service accounts with cluster-admin:

  1. system:admin (expected)
  2. jenkins-deployer (in jenkins namespace)
    • Last used: 2024-01-10
    • Bound by: jenkins-admin-binding

This may be overly permissive. Consider using namespace-scoped roles instead of cluster-admin for CI/CD.

Compliance Check

Verifying compliance status:

“Check CIS Kubernetes benchmark compliance”

CIS Kubernetes Benchmark v1.8 Results:

SectionCompliance
Control Plane85%
etcd100%
Control Plane Config90%
Worker Nodes78%
Policies72%

Top failures:

  1. [FAIL] 5.2.2 - Privileged containers detected (3 pods)
  2. [FAIL] 5.2.6 - Containers running as root (7 pods)
  3. [FAIL] 5.7.1 - Default namespace in use (12 pods)

“How do I fix the privileged containers issue?”

3 pods are running with privileged: true:

  1. kube-proxy (kube-system) - Expected, system component
  2. csi-driver (storage) - Required for storage driver
  3. debug-pod (production) - Should be removed or fixed

For debug-pod, remove the privileged flag:

  • Edit the deployment
  • Remove securityContext.privileged: true
  • Add securityContext.allowPrivilegeEscalation: false

This pod doesn’t need privileged access for its function.

Resource Analysis

Understanding resource usage:

“Which pods are using the most memory?”

Top 10 pods by memory usage:

PodNamespaceUsageLimit%
postgres-primary-0data4.2 Gi8 Gi52%
elasticsearch-0logging3.8 Gi4 Gi95%
redis-master-0cache2.1 Gi4 Gi52%
api-gateway-xxxproduction890 Mi1 Gi87%
worker-xxxproduction756 Mi1 Gi74%

Warning: elasticsearch-0 is at 95% of its memory limit.


“What are the memory trends for elasticsearch?”

Memory usage for elasticsearch-0 (last 24 hours):

TimeUsage%
06:002.8 Gi70%
12:003.2 Gi80%
18:003.5 Gi87%
00:003.8 Gi95%

The memory usage is steadily increasing. This could indicate:

  1. Index growth without proper cleanup
  2. Memory leak
  3. Increased query load

Recommendation: Check index retention policies and consider increasing the memory limit or adding more replicas.

Network Analysis

Understanding network configuration:

“What services are exposed externally?”

Externally accessible services:

ServiceTypeExternal IPPorts
api-gatewayLoadBalancer52.23.145.67443, 80
web-frontendLoadBalancer52.23.145.68443
grafanaNodePort(any node):300803000

Warning: grafana is exposed via NodePort. Consider using a LoadBalancer with proper authentication or placing it behind an ingress controller.


“What network policies protect the database?”

Network policies for namespace “data”:

  1. deny-all-ingress

    • Denies all incoming traffic by default
  2. allow-api-to-postgres

    • Allows: pods with label app=api
    • To: pods with label app=postgres
    • Ports: 5432
  3. allow-monitoring

    • Allows: namespace kube-system
    • To: all pods
    • Ports: 9090 (metrics)

The database is well protected with deny-by-default policy.

Tips for Effective Use

Start Broad, Then Narrow

“What’s wrong with my cluster?” → get overview

“Tell me more about the failing pods” → focus on specific issue

“Show me the logs from that pod” → drill down to details

Use Context

The AI remembers your conversation:

“Show me high CPU pods”

“Why is the first one using so much?”

“How can I fix it?”

Ask for Recommendations

“What should I do about this security issue?”

“What’s the best practice here?”

“How can I improve this configuration?”