Prometheus Alert Manager
The Prometheus Alertmanager should be used for alerting. This supports the grouping of alerts according to predefined Labels (e.g. "server instances") and routing via various third-party providers, e.g. via JSON via a web hook or by email.
A list of integrations is available at here.
The alerts themselves must be set in the Prometheus configuration.
A separate alertrules.yml file can also be created for this purpose, which is referenced in the Prometheus configuration.
Example configuration
The following example configuration checks whether the free RAM of a node exporter falls below the threshold of 10%:
groups:
- name: Node_Exporter.SystemAlerts
rules:
- alert: HostOutOfMemory
expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
for: 5m
labels:
severity: warning
annotations:
summary: Host out of memory (instance {{ $labels.instance }})
description: Node memory is filling up (< 10% left)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
Adjust path
The path to the Alert Manager in Prometheus must also be configured (by default localhost:9093):
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
rule_files:
- "alertrules.yml"
Alert processing
Alert processing is customized in the Alert Manager configuration. Alert receivers can be configured here (here an SMTP receiver as an example) and grouping and inhibition rules can be defined.
Alerts can be grouped according to certain Labels (route elements), and time intervals can be set for alert groups to control the distribution of alerts.
Inhibition rules can be set, for example to create hierarchical alert structures (e.g. according to severity).
global:
resolve_timeout: 5m
smtp_smarthost: inubit.nemesys:25
smtp_from: alertmanager@virtimo.de
route:
group_by: ['instance','alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'smtp'
receivers:
- name: 'smtp'
email_configs:
- to: alerting@virtimo.de
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']