Alerts & NotificationsOverview

Alerts & Notifications

Configure and manage alerts to stay informed about your cluster status.

Overview

The Alerts system monitors your Kubernetes clusters and notifies you when issues occur. Access it from Monitoring > Alerting in the sidebar. The Alerting page is organized into two tabs: Notification Channels and Alert Rules.

Key features:

  • Pre-configured alert templates for common scenarios
  • Custom PromQL-based alert rules
  • 7 notification channels (Email, Slack, Discord, Teams, Telegram, PagerDuty, Webhook)
  • Alert silencing and acknowledgment
  • Escalation policies (where available)
Alert Plugins
Configure notification channels including Slack, Discord, Teams and more

Alert Templates

SRExpert provides 10 pre-configured alert templates to get you started quickly.

Resource Alerts

TemplateMetricDefault ThresholdDurationSeverity
High CPU UsageCPU utilization> 80%5 minHigh
High Memory UsageMemory utilization> 85%5 minHigh
High Disk UsageDisk utilization> 85%10 minHigh
Container OOM KilledOOM events> 01 minCritical

Availability Alerts

TemplateMetricDefault ThresholdDurationSeverity
Pod Restart LoopRestart count> 510 minCritical
Pod Not ReadyReady statusfalse5 minHigh
Deployment Replicas MismatchReplicas delta> 05 minHigh

Infrastructure Alerts

TemplateMetricDefault ThresholdDurationSeverity
Node Not ReadyNode status!= Ready5 minCritical
PVC Storage Almost FullPVC usage> 80%10 minHigh

Application Alerts

TemplateMetricDefault ThresholdDurationSeverity
High HTTP Error Rate5xx error rate> 5%5 minCritical

Creating Alert Rules

Using Templates

  1. Go to Monitoring > Alerting and open the Alert Rules tab
  2. Click Create Rule
  3. Select a template from the dropdown
  4. Configure:
    • Target Clusters (required) - Select one or more clusters
    • Namespaces (optional) - Leave empty to monitor all
    • Notification Channels - Select channels to receive alerts
  5. Adjust threshold and duration if needed
  6. Click Create

Custom Rules

Create rules with custom PromQL queries:

  1. Go to Monitoring > Alerting and open the Alert Rules tab
  2. Click Create Rule
  3. Select Custom Rule
  4. Fill in the form:

Basic Information

  • Rule Name - Unique identifier (e.g., high-memory-api)
  • Display Name - Human-readable name
  • Description - What this alert monitors

Query Configuration

  • PromQL Query - The metric query
  • Operator - Comparison: >, <, >=, <=, ==, !=
  • Threshold - The value to compare against

Timing

  • Duration - How long the condition must be true
    • Options: 1m, 5m, 10m, 15m, 30m, 1h, 2h
  • Severity - Alert priority level
    • Critical, High, Medium, Low, Info

Scope

  • Target Clusters (required) - Must have Prometheus datasource active
  • Namespaces (optional) - Specific namespaces to monitor

Notifications

  • Channels - Select notification channels
  • Frequency - How often to re-notify
    • Options: Immediate, 1m, 5m, 15m, 30m, 1h

Example: Memory Alert

Create an alert for high memory usage in production:

  • Rule Name: high-memory-production
  • Display Name: High Memory - Production
  • Query: container_memory_usage_bytes / container_spec_memory_limit_bytes * 100
  • Operator: >
  • Threshold: 90
  • Duration: 5m
  • Severity: High
  • Clusters: production-cluster
  • Namespaces: production
  • Channels: slack-ops, email-team

Notification Channels

Configure channels in Monitoring > Alerting, on the Notification Channels tab. The tab shows summary counters (Total Channels, Active, Inactive, Healthy) and an Add Channel button (or Create First Channel when no channels exist yet).

Alert Plugins Sidebar
Notification Channels configuration for alert notifications

Supported Channels

ChannelUse CaseConfiguration Required
EmailTeam notificationsSMTP server, ports, from/to addresses
SlackReal-time team alertsWebhook URL
DiscordDevOps communitiesWebhook URL
Microsoft TeamsEnterprise teamsWebhook URL
TelegramMobile notificationsBot token, Chat ID
PagerDutyOn-call rotationIntegration key
WebhookCustom integrationsEndpoint URL

Creating a Channel

Creating a notification channel is a 4-step wizard:

  1. Go to Monitoring > Alerting and open the Notification Channels tab
  2. Click Add Channel (or Create First Channel)
  3. Step 1 - Select Integration Type: choose one of the 7 integration types (Slack, Microsoft Teams, Discord, Telegram, PagerDuty, Webhook, Email)
  4. Step 2 - Configure Connection: fill in the common fields plus the type-specific connection fields
    • Internal Name - Unique identifier (no spaces)
    • Display Name - Friendly, human-readable name
    • Description - What this channel is for
    • Type-specific fields (e.g. Webhook URL for Slack, SMTP settings for Email)
  5. Step 3 and Step 4: continue with the remaining configuration (such as alert filters and delivery options, where available) and review your settings
  6. Finish the wizard to create the channel, then use Send Test from the channel’s actions menu to verify it

For step-by-step, per-integration instructions, see Notification Plugins.

Slack Configuration

  1. Create a Slack Webhook:
    • Go to your Slack workspace settings
    • Create an Incoming Webhook
    • Copy the Webhook URL
  2. In SRExpert:
    • Add Channel > Slack
    • Paste the Webhook URL
    • Test and save

Slack messages include:

  • Alert severity with color coding
  • Rule name and description
  • Affected cluster and namespace
  • Current metric value
  • Timestamp

Email Configuration

Email channels use per-channel SMTP settings, configured in Step 2 of the wizard:

  • SMTP Server - Your mail server host
  • SMTP Port - Usually 587 (TLS) or 465 (SSL)
  • From Email - Sender address
  • To Email - Recipient address
  • Username (optional) - SMTP username
  • Password (optional) - SMTP password

PagerDuty Configuration

  1. In PagerDuty, create an Events API v2 integration
  2. Copy the Integration Key
  3. In SRExpert, add the key

Severity mapping:

  • Critical → critical
  • High → error
  • Medium → warning
  • Low/Info → info

Webhook Configuration

Send alerts to any HTTP endpoint:

  • URL - Your endpoint (POST requests)
  • Headers (optional) - Custom headers

Payload format:

{
  "alert_id": "uuid",
  "rule_name": "high-cpu-usage",
  "severity": "high",
  "status": "firing",
  "cluster_name": "production",
  "namespace": "default",
  "current_value": 85.5,
  "threshold_value": 80,
  "message": "CPU usage is at 85.5%",
  "fired_at": "2024-01-15T10:30:00Z"
}

The exact payload fields may evolve over time. Build integrations defensively and avoid assuming a fixed set of keys.

Managing Alerts

Alert States

StateDescriptionActions
FiringCondition is currently metAcknowledge, Silence
PendingCondition met, waiting for duration-
ResolvedCondition no longer met-
AcknowledgedSomeone is handling itResolve
SilencedTemporarily mutedUnsilence

Acknowledging Alerts

When you start investigating an alert:

  1. Go to Monitoring > Alerting (Alert Rules tab) or the dashboard
  2. Find the firing alert
  3. Click Acknowledge

This:

  • Stops repeat notifications
  • Records who acknowledged and when
  • May track response-time metrics (where available)

Silencing Alerts

Silencing lets you temporarily mute alerts during maintenance. From the Alerting view, you can create a silence and configure:

  • Duration - How long to silence
  • Matchers - Which alerts to silence (by name, cluster, namespace)
  • Comment - Why it’s silenced

Use cases:

  • Planned maintenance windows
  • Known issues being fixed
  • Noisy alerts under investigation

Testing Alert Rules

Before enabling, test your rule:

  1. Find the rule in Alert Rules
  2. Click the menu (⋮) > Test
  3. The system will:
    • Execute the PromQL query
    • Check if it would fire
    • Show you the result

Alert Dashboard

The dashboard at Monitoring > Alerting (Alert Rules tab) shows:

Overview Cards

  • Active Alerts - Currently firing
  • Critical - Critical severity count
  • High - High severity count
  • Recent - Fired in last hour

Alert List

  • Status icon (firing, resolved, etc.)
  • Severity badge with color
  • Rule name and display name
  • Affected cluster and namespace
  • Current value vs threshold
  • Time since firing
  • Quick actions (Acknowledge, Silence, Resolve)

Filtering

Filter alerts by:

  • Status (Firing, Resolved, Acknowledged, Silenced)
  • Severity (Critical, High, Medium, Low, Info)
  • Cluster
  • Namespace
  • Time range

Alert Rule Actions

From the rule card menu:

ActionDescription
EditModify rule configuration
TestTest rule execution
PauseTemporarily disable evaluation
ResumeRe-enable paused rule
DeleteRemove rule permanently

Severity Guidelines

SeverityResponseExample
CriticalImmediate action requiredNode down, pod crash loop
HighAction within hoursHigh memory, failing pods
MediumAction within a dayResource approaching limit
LowReview when convenientMinor configuration issue
InfoInformational onlyCertificate expiring soon

Best Practices

Alert Design

  1. Be specific - Target specific namespaces when possible
  2. Set appropriate durations - Avoid flapping with short durations
  3. Use severity correctly - Reserve Critical for truly urgent issues
  4. Add descriptions - Help on-call engineers understand the alert
  5. Test before enabling - Use the test feature

Channel Configuration

  1. Route by severity - Critical to PagerDuty, Info to Slack
  2. Limit noise - Use filters and any available rate-limiting options to prevent alert storms
  3. Set up redundancy - Have backup channels
  4. Test regularly - Verify channels still work

Reducing Alert Fatigue

  1. Review noisy alerts - Tune thresholds or disable
  2. Combine related alerts - One alert for related issues
  3. Use silencing - During known maintenance
  4. Document runbooks - So alerts lead to action

Alert History

View past alerts:

  1. Go to Monitoring > Alerting (Alert Rules tab)
  2. Select an alert to see its history
  3. Information includes:
    • When it fired
    • When it resolved
    • Who acknowledged
    • Duration

Historical alerts may be retained for a limited period before automatic cleanup.

Next Steps