Alerts & NotificationsOverview

Alerts & Notifications

Configure and manage alerts to stay informed about your cluster status.

Overview

The Alerts system monitors your Kubernetes clusters and notifies you when issues occur. Access it from Monitoring > Alert Rules in the sidebar.

Key features:

  • Pre-configured alert templates for common scenarios
  • Custom PromQL-based alert rules
  • 7 notification channels (Email, Slack, Discord, Teams, Telegram, PagerDuty, Webhook)
  • Alert silencing and acknowledgment
  • Escalation policies
Alert Plugins
Configure notification channels including Slack, Discord, Teams and more

Alert Templates

SRExpert provides 10 pre-configured alert templates to get you started quickly.

Resource Alerts

TemplateMetricDefault ThresholdDurationSeverity
High CPU UsageCPU utilization> 80%5 minHigh
High Memory UsageMemory utilization> 85%5 minHigh
High Disk UsageDisk utilization> 85%10 minHigh
Container OOM KilledOOM events> 01 minCritical

Availability Alerts

TemplateMetricDefault ThresholdDurationSeverity
Pod Restart LoopRestart count> 510 minCritical
Pod Not ReadyReady statusfalse5 minHigh
Deployment Replicas MismatchReplicas delta> 05 minHigh

Infrastructure Alerts

TemplateMetricDefault ThresholdDurationSeverity
Node Not ReadyNode status!= Ready5 minCritical
PVC Storage Almost FullPVC usage> 80%10 minHigh

Application Alerts

TemplateMetricDefault ThresholdDurationSeverity
High HTTP Error Rate5xx error rate> 5%5 minCritical

Creating Alert Rules

Using Templates

  1. Go to Monitoring > Alert Rules
  2. Click Create Rule
  3. Select a template from the dropdown
  4. Configure:
    • Target Clusters (required) - Select one or more clusters
    • Namespaces (optional) - Leave empty to monitor all
    • Notification Channels - Select channels to receive alerts
  5. Adjust threshold and duration if needed
  6. Click Create

Custom Rules

Create rules with custom PromQL queries:

  1. Go to Monitoring > Alert Rules
  2. Click Create Rule
  3. Select Custom Rule
  4. Fill in the form:

Basic Information

  • Rule Name - Unique identifier (e.g., high-memory-api)
  • Display Name - Human-readable name
  • Description - What this alert monitors

Query Configuration

  • PromQL Query - The metric query
  • Operator - Comparison: >, <, >=, <=, ==, !=
  • Threshold - The value to compare against

Timing

  • Duration - How long the condition must be true
    • Options: 1m, 5m, 10m, 15m, 30m, 1h, 2h
  • Severity - Alert priority level
    • Critical, High, Medium, Low, Info

Scope

  • Target Clusters (required) - Must have Prometheus datasource active
  • Namespaces (optional) - Specific namespaces to monitor

Notifications

  • Channels - Select notification channels
  • Frequency - How often to re-notify
    • Options: Immediate, 1m, 5m, 15m, 30m, 1h

Example: Memory Alert

Create an alert for high memory usage in production:

  • Rule Name: high-memory-production
  • Display Name: High Memory - Production
  • Query: container_memory_usage_bytes / container_spec_memory_limit_bytes * 100
  • Operator: >
  • Threshold: 90
  • Duration: 5m
  • Severity: High
  • Clusters: production-cluster
  • Namespaces: production
  • Channels: slack-ops, email-team

Notification Channels

Configure channels in Monitoring > Contact Points.

Alert Plugins Sidebar
Contact points configuration for alert notifications

Supported Channels

ChannelUse CaseConfiguration Required
EmailTeam notificationsSMTP server, recipients
SlackReal-time team alertsWebhook URL
DiscordDevOps communitiesWebhook URL
Microsoft TeamsEnterprise teamsWebhook URL
TelegramMobile notificationsBot token, Chat ID
PagerDutyOn-call rotationIntegration key
WebhookCustom integrationsEndpoint URL

Creating a Channel

  1. Go to Monitoring > Contact Points
  2. Click Add Contact Point
  3. Select channel type
  4. Configure settings (varies by type)
  5. Set filters (optional):
    • Severity Filter - Only receive certain severities
    • Cluster Filter - Only from specific clusters
    • Namespace Filter - Only from specific namespaces
  6. Click Test to verify
  7. Click Save

Slack Configuration

  1. Create a Slack Webhook:
    • Go to your Slack workspace settings
    • Create an Incoming Webhook
    • Copy the Webhook URL
  2. In SRExpert:
    • Add Contact Point > Slack
    • Paste the Webhook URL
    • Test and save

Slack messages include:

  • Alert severity with color coding
  • Rule name and description
  • Affected cluster and namespace
  • Current metric value
  • Timestamp

Email Configuration

Configure SMTP settings:

  • SMTP Host - Your mail server
  • SMTP Port - Usually 587 (TLS) or 465 (SSL)
  • Username/Password - SMTP credentials
  • From Address - Sender email
  • Recipients - Comma-separated email addresses

PagerDuty Configuration

  1. In PagerDuty, create an Events API v2 integration
  2. Copy the Integration Key
  3. In SRExpert, add the key

Severity mapping:

  • Critical → critical
  • High → error
  • Medium → warning
  • Low/Info → info

Webhook Configuration

Send alerts to any HTTP endpoint:

  • URL - Your endpoint (POST requests)
  • Headers (optional) - Custom headers

Payload format:

{
  "alert_id": "uuid",
  "rule_name": "high-cpu-usage",
  "severity": "high",
  "status": "firing",
  "cluster_name": "production",
  "namespace": "default",
  "current_value": 85.5,
  "threshold_value": 80,
  "message": "CPU usage is at 85.5%",
  "fired_at": "2024-01-15T10:30:00Z"
}

Managing Alerts

Alert States

StateDescriptionActions
FiringCondition is currently metAcknowledge, Silence
PendingCondition met, waiting for duration-
ResolvedCondition no longer met-
AcknowledgedSomeone is handling itResolve
SilencedTemporarily mutedUnsilence

Acknowledging Alerts

When you start investigating an alert:

  1. Go to Monitoring > Alert Rules or the dashboard
  2. Find the firing alert
  3. Click Acknowledge

This:

  • Stops repeat notifications
  • Records who acknowledged and when
  • Tracks response time metrics

Silencing Alerts

Temporarily mute alerts during maintenance:

  1. Go to Monitoring > Silences
  2. Click Create Silence
  3. Configure:
    • Duration - How long to silence
    • Matchers - Which alerts to silence (by name, cluster, namespace)
    • Comment - Why it’s silenced
  4. Click Create

Use cases:

  • Planned maintenance windows
  • Known issues being fixed
  • Noisy alerts under investigation

Testing Alert Rules

Before enabling, test your rule:

  1. Find the rule in Alert Rules
  2. Click the menu (⋮) > Test
  3. The system will:
    • Execute the PromQL query
    • Check if it would fire
    • Show you the result

Alert Dashboard

The dashboard at Monitoring > Alert Rules shows:

Overview Cards

  • Active Alerts - Currently firing
  • Critical - Critical severity count
  • High - High severity count
  • Recent - Fired in last hour

Alert List

  • Status icon (firing, resolved, etc.)
  • Severity badge with color
  • Rule name and display name
  • Affected cluster and namespace
  • Current value vs threshold
  • Time since firing
  • Quick actions (Acknowledge, Silence, Resolve)

Filtering

Filter alerts by:

  • Status (Firing, Resolved, Acknowledged, Silenced)
  • Severity (Critical, High, Medium, Low, Info)
  • Cluster
  • Namespace
  • Time range

Alert Rule Actions

From the rule card menu:

ActionDescription
EditModify rule configuration
TestTest rule execution
PauseTemporarily disable evaluation
ResumeRe-enable paused rule
DeleteRemove rule permanently

Severity Guidelines

SeverityResponseExample
CriticalImmediate action requiredNode down, pod crash loop
HighAction within hoursHigh memory, failing pods
MediumAction within a dayResource approaching limit
LowReview when convenientMinor configuration issue
InfoInformational onlyCertificate expiring soon

Best Practices

Alert Design

  1. Be specific - Target specific namespaces when possible
  2. Set appropriate durations - Avoid flapping with short durations
  3. Use severity correctly - Reserve Critical for truly urgent issues
  4. Add descriptions - Help on-call engineers understand the alert
  5. Test before enabling - Use the test feature

Channel Configuration

  1. Route by severity - Critical to PagerDuty, Info to Slack
  2. Use rate limiting - Prevent alert storms
  3. Set up redundancy - Have backup channels
  4. Test regularly - Verify channels still work

Reducing Alert Fatigue

  1. Review noisy alerts - Tune thresholds or disable
  2. Combine related alerts - One alert for related issues
  3. Use silencing - During known maintenance
  4. Document runbooks - So alerts lead to action

Alert History

View past alerts:

  1. Go to Monitoring > Alert Rules
  2. Select an alert to see its history
  3. Information includes:
    • When it fired
    • When it resolved
    • Who acknowledged
    • Duration

Alerts are kept for 72 hours before automatic cleanup.

Next Steps