Grafana Alerts Examples: A YAML Guide

by Jhon Lennon 38 views

Hey everyone! Today, we're diving deep into the awesome world of Grafana alerts, specifically focusing on how to set them up using YAML. If you're looking to get notified when things go south with your metrics, you've come to the right place, guys. We'll be exploring practical Grafana alerts examples that you can adapt for your own monitoring needs. Forget those confusing UIs for a sec; YAML gives us a clean, version-controllable way to define our alerts, making life so much easier when managing complex setups. So, let's get started and unlock the power of programmatic alerting in Grafana!

Understanding Grafana Alerting

Alright, so what exactly are Grafana alerts, and why should you even care? In a nutshell, Grafana alerting is your vigilant guardian, constantly watching your metrics and dashboards. When a specific condition is met – say, your server's CPU usage spikes to an alarming level, or your website's error rate suddenly skyrockets – Grafana can fire off a notification. This is super crucial, especially in production environments. Imagine your application crashing, and you have no idea until your users start complaining! With Grafana alerts, you get a heads-up before disaster strikes, allowing you to jump in and fix the issue proactively. This proactive approach saves you from downtime, lost revenue, and a whole lot of headaches. The core of Grafana's alerting system revolves around defining rules that evaluate specific queries against your time-series data. When these rules cross predefined thresholds or meet certain conditions, an alert state is triggered. This state can then transition through different phases: Pending (waiting to confirm the alert condition), Firing (the condition is met and notifications are sent), and Resolved (the condition is no longer met, and Grafana signals that the issue is fixed). This state management ensures you don't get spammed with alerts for temporary blips but are reliably notified of persistent problems. The flexibility here is key; you can monitor anything Grafana can visualize, from system metrics like CPU, memory, and network I/O, to application-specific metrics like request latency, error counts, and queue lengths, and even business KPIs. The power comes from combining sophisticated query languages (like PromQL for Prometheus, InfluxQL for InfluxDB, or SQL for relational databases) with logical conditions and thresholds. This allows for highly customized Grafana alerts examples tailored to the unique needs of your infrastructure and applications. We'll be focusing on the YAML configuration, which is often used in conjunction with provisioning Grafana resources, especially in containerized environments like Kubernetes, where declarative configurations are the standard. It's all about infrastructure as code, ensuring your alerting setup is reproducible, version-controlled, and easily managed alongside your application deployments.

Why Use YAML for Grafana Alerts?

Now, you might be thinking, "Why bother with YAML for Grafana alerts when I can just click around in the UI?" Great question, guys! While the Grafana UI is fantastic for ad-hoc exploration and setting up simple alerts, using YAML offers some serious advantages, especially as your monitoring needs grow. First off, version control is a game-changer. By defining your alerts in YAML files, you can commit them to a Git repository. This means you have a full history of every change made to your alerting rules, who made them, and when. If a change causes issues, you can easily roll back. Plus, it makes collaboration much smoother – team members can review changes, suggest improvements, and work on alert configurations together. Secondly, repeatability and scalability. Need to set up the same alert on multiple environments or instances? With YAML, you just copy, paste, and modify a few parameters. This is infinitely faster and less error-prone than manually configuring each one in the UI. It's perfect for managing alerts across diverse infrastructures, from development to staging to production. Third, automation. YAML configurations are ideal for automated provisioning. You can use tools like Ansible, Terraform, or Grafana's own provisioning mechanisms to automatically deploy your alert rules when you spin up new servers or services. This ensures your monitoring is always up-to-date without manual intervention. Finally, complex logic. While the UI is great for simple thresholds, YAML allows you to express more complex alerting logic, combine multiple conditions, and define intricate evaluation intervals. It gives you fine-grained control over every aspect of your alert definition. It's about treating your infrastructure, including your alerting rules, as code. This means your alerting setup becomes a part of your application's deployment pipeline, ensuring consistency and reliability. So, while the UI is good for quick wins, YAML is the way to go for robust, scalable, and maintainable alerting strategies. It’s about building a solid foundation for your monitoring, making sure you’re always in the know when it matters most. This approach aligns perfectly with modern DevOps practices, where automation and infrastructure as code are paramount for efficient and reliable operations. By mastering YAML configurations, you're not just setting up alerts; you're building a resilient monitoring system that scales with your needs.

Basic Grafana Alert Structure in YAML

Let's get down to business and look at the fundamental structure of a Grafana alert definition in YAML. This will form the basis for our Grafana alerts examples. A typical alert rule definition will include several key components:

  • uid: A unique identifier for the alert rule. This is crucial for referencing and managing your alerts. It's best practice to generate a unique ID yourself rather than letting Grafana auto-generate it if you're provisioning.
  • title: A human-readable name for your alert. This is what you'll see in the alert list and notifications, so make it descriptive!
  • condition: This is the heart of your alert. It specifies the query that Grafana will run and the condition that must be met for the alert to fire. It usually references a query defined within the alert rule.
  • data: This section contains the queries that Grafana will execute. Each query has an refId (a short, unique identifier like 'A', 'B', 'C') which is then referenced in the condition.
    • refId: The reference ID for the query.
    • queryType: The type of query (e.g., range for time series data).
    • relativeTimeRange: Defines the time window for the query (e.g., from: 300 means the last 5 minutes).
    • datasourceUid: The unique identifier for your data source (e.g., Prometheus, InfluxDB).
    • model: This is where the actual query language expression goes. For Prometheus, it would be your PromQL query.
  • noDataState: What Grafana should do if the query returns no data. Options include NoData (default), Alerting, OK, or Error.
  • execErrState: What Grafana should do if there's an error executing the query. Options include Error (default), Alerting, OK, or NoData.
  • evaluateFor: How long the condition must be true before the alert transitions to the Firing state. This helps prevent flapping alerts. It's typically a duration string like 5m (5 minutes).
  • evaluateEvery: How often Grafana should evaluate the alert rule. For example, 1m means it checks every minute.
  • labels: Key-value pairs that can be attached to the alert. These are useful for routing notifications and grouping alerts.
  • annotations: More descriptive information about the alert, often used to provide context in notifications. You can include things like summary, description, runbook URLs, etc.
  • folderUid: The unique identifier of the folder where the alert rule should be stored in Grafana.
  • isEnabled: A boolean (true/false) to enable or disable the alert rule.

Here's a simplified, structural example to give you a feel for it:

# Example alert rule structure
- uid: "my-unique-alert-id-123"
  title: "High CPU Usage Alert"
  condition: "A"
  data:
    - refId: "A"
      queryType: "range"
      relativeTimeRange: { from: 300, to: 0 }
      datasourceUid: "my-prometheus-datasource-uid"
      model:
        # Prometheus query language (PromQL)
        expr: "sum(rate(node_cpu_seconds_total{mode='idle'}[5m])) by (instance)"
        hide: false
        intervalMs: 1000
        maxDataPoints: 43200
        refId: "A"
        legendFormat: "{{instance}}"
        queryType: "range"
  noDataState: "NoData"
  execErrState: "Error"
  evaluateFor: "5m"
  evaluateEvery: "1m"
  labels:
    severity: "warning"
  annotations:
    summary: "High CPU usage detected on {{ $labels.instance }}"
    description: "CPU usage is above 80% for the last 5 minutes."
    runbook_url: "http://my-runbook.com/cpu-issues"
  folderUid: "my-alerts-folder-uid"
  isEnabled: true

This structure might look a bit daunting at first, but once you break it down, it's quite logical. Each piece plays a specific role in defining how and when Grafana should alert you. The datasourceUid and the model.expr are where you'll customize the query to match your specific metrics and data source. The evaluateFor and evaluateEvery settings are crucial for tuning the alert's sensitivity. Remember to replace placeholders like my-prometheus-datasource-uid and my-alerts-folder-uid with your actual UIDs and names. Getting these UIDs right is important for Grafana to locate the correct data source and folder. You can usually find these UIDs in the Grafana UI URLs when you're editing the data source or folder.

Practical Grafana Alerts Examples in YAML

Now for the fun part, guys! Let's look at some real-world Grafana alerts examples written in YAML that you can adapt. These examples cover common scenarios, and we'll explain the nuances.

Example 1: High CPU Usage Alert (Prometheus)

This is a classic. We want to know when a server's CPU is running hot. This example uses Prometheus as the data source.

- uid: "cpu-usage-high-{{ $labels.instance }}"
  title: "High CPU Usage on {{ $labels.instance }}"
  condition: "A"
  data:
    - refId: "A"
      queryType: "range"
      relativeTimeRange: { from: 600, to: 0 } # Look at the last 10 minutes
      datasourceUid: "prometheus"
      model:
        # Calculate CPU usage percentage. Higher value means less idle time.
        expr: "100 - avg by (instance) (rate(node_cpu_seconds_total{mode='idle'}[5m])) * 100"
        hide: false
        intervalMs: 1000
        maxDataPoints: 43200
        refId: "A"
        legendFormat: "CPU Usage {{ instance }}"
        queryType: "range"
  noDataState: "NoData"
  execErrState: "Error"
  evaluateFor: "5m"
  evaluateEvery: "1m"
  labels:
    severity: "warning"
    team: "infra"
  annotations:
    summary: "High CPU usage on instance {{ $labels.instance }}"
    description: "Instance {{ $labels.instance }} has been experiencing CPU usage above 80% for the last 5 minutes. Current value: {{ $values.A.Value | printf "%.2f" }}%."
    runbook_url: "https://your-wiki.com/runbooks/high-cpu"
  folderUid: "${FOLDER_UID_INFRA}" # Using a templated value for folder UID
  isEnabled: true

Explanation:

  • uid: We're using a template here cpu-usage-high-{{ $labels.instance }}. This is a common practice to generate unique UIDs based on alert labels, ensuring each instance gets its own alert ID. Grafana can resolve these templates.
  • title: Similarly, the title dynamically includes the instance name for clarity.
  • data.model.expr: This PromQL query calculates the percentage of CPU utilization by subtracting the idle CPU time from 100. We use avg by (instance) to group results per instance.
  • relativeTimeRange: We're looking at the last 10 minutes (600 seconds) to ensure the metric is consistent.
  • evaluateFor: "5m": The alert will only fire if the CPU usage stays above the threshold for 5 minutes straight. This prevents alerts from brief spikes.
  • labels: We've added severity: warning and team: infra for routing and categorization.
  • annotations: The description includes the actual current value ({{ $values.A.Value | printf "%.2f" }}%) and a link to a runbook. The {{ $values.A.Value }} is a Grafana template variable that injects the result of query A.
  • folderUid: Demonstrates using environment variables or Grafana's templating system to set the folder.

Example 2: High Error Rate Alert (Application Metrics)

This example focuses on monitoring application errors. Let's assume you're exporting custom metrics like http_requests_total with a status_code label to Prometheus.

- uid: "app-error-rate-high-{{ $labels.job }}"
  title: "High HTTP Error Rate on {{ $labels.job }}"
  condition: "B > 0.05"
  data:
    - refId: "A"
      queryType: "range"
      relativeTimeRange: { from: 300, to: 0 } # Last 5 minutes
      datasourceUid: "prometheus"
      model:
        # Total HTTP requests in the last 5 minutes
        expr: "sum(rate(http_requests_total{status_code=~'5..'}[5m])) by (job)"
        hide: false
        refId: "A"
    - refId: "B"
      queryType: "range"
      relativeTimeRange: { from: 300, to: 0 } # Last 5 minutes
      datasourceUid: "prometheus"
      model:
        # Total HTTP requests in the last 5 minutes
        expr: "sum(rate(http_requests_total[5m])) by (job)"
        hide: false
        refId: "B"
  noDataState: "NoData"
  execErrState: "Error"
  evaluateFor: "2m"
  evaluateEvery: "30s"
  labels:
    severity: "critical"
    service: "web-app"
  annotations:
    summary: "High HTTP error rate detected on job {{ $labels.job }}"
    description: "Job {{ $labels.job }} is experiencing an error rate above 5% for the last 2 minutes. Error rate: {{ ($values.B.Value > 0 and ($values.A.Value / $values.B.Value) * 100 || 0) | printf "%.2f" }}%."
    runbook_url: "https://your-wiki.com/runbooks/high-error-rate"
  folderUid: "${FOLDER_UID_APPS}"
  isEnabled: true

Explanation:

  • condition: Here, the condition is B > 0.05. This means the alert fires if the result of query B (total requests) is greater than 0.05. Wait, that's not right! The condition should actually compare the error rate derived from A and B. Let's correct this.

Correction: The condition should evaluate the ratio of errors to total requests. A better approach uses two queries and compares their results. Let's refine the condition and queries for a true error rate.

Example 2 (Revised): High Error Rate Alert (Application Metrics)

This revised example calculates the error rate (server errors, HTTP 5xx) as a percentage of total requests.

- uid: "app-error-rate-high-{{ $labels.job }}"
  title: "High HTTP Error Rate on {{ $labels.job }}"
  condition: "B > 5"
  data:
    - refId: "A"
      queryType: "range"
      relativeTimeRange: { from: 300, to: 0 } # Last 5 minutes
      datasourceUid: "prometheus"
      model:
        # Calculate the rate of HTTP 5xx errors over total requests
        # expr: "(sum(rate(http_requests_total{status_code=~'5..'}[5m])) by (job) / sum(rate(http_requests_total[5m])) by (job)) * 100"
        # Let's use two separate queries for clarity in the 'condition' part
        expr: "sum(rate(http_requests_total{status_code=~'5..'}[5m])) by (job)"
        hide: false
        refId: "A"
        legendFormat: "Errors {{ job }}"
    - refId: "B"
      queryType: "range"
      relativeTimeRange: { from: 300, to: 0 } # Last 5 minutes
      datasourceUid: "prometheus"
      model:
        # Calculate the rate of total requests
        expr: "sum(rate(http_requests_total[5m])) by (job)"
        hide: false
        refId: "B"
        legendFormat: "Total Requests {{ job }}"
    - refId: "C"
      queryType: "math"
      datasourceUid: "- ") # Math datasource is not needed for expression, but required for refId
      model:
        # Calculate the error rate percentage: (Errors / Total Requests) * 100
        expression: "($A / $B) * 100"
        hide: false
        refId: "C"
  noDataState: "NoData"
  execErrState: "Error"
  evaluateFor: "2m"
  evaluateEvery: "30s"
  labels:
    severity: "critical"
    service: "web-app"
  annotations:
    summary: "High HTTP error rate detected on job {{ $labels.job }}"
    description: "Job {{ $labels.job }} is experiencing an error rate above 5% for the last 2 minutes. Current error rate: {{ $values.C.Value | printf "%.2f" }}%."
    runbook_url: "https://your-wiki.com/runbooks/high-error-rate"
  folderUid: "${FOLDER_UID_APPS}"
  isEnabled: true

Explanation (Revised Example 2):

  • condition: Now the condition is C > 5. This refers to the result of the new query C, which calculates the error rate percentage.
  • data: We now have three queries:
    • A: Counts the rate of 5xx errors per job.
    • B: Counts the rate of all requests per job.
    • C: A math query type that calculates ($A / $B) * 100. This gives us the percentage of requests that are errors.
  • evaluateFor: "2m": The alert fires if the error rate exceeds 5% for 2 consecutive minutes.
  • annotations.description: Dynamically shows the calculated error rate using {{ $values.C.Value }}.

This revised example correctly implements the logic for monitoring error rates and provides a much more actionable alert.

Example 3: Service Down Alert (Using up metric)

Many monitoring systems (like Prometheus with node_exporter or blackbox_exporter) expose an up metric, which is 1 if the service is up and 0 if it's down.

- uid: "service-down-{{ $labels.instance }}"
  title: "Service Down: {{ $labels.instance }}"
  condition: "A == 0"
  data:
    - refId: "A"
      queryType: "range"
      relativeTimeRange: { from: 120, to: 0 } # Last 2 minutes
      datasourceUid: "prometheus"
      model:
        # Check if the 'up' metric is 0 for the instance
        expr: "up{job='my-app-service'}"
        hide: false
        refId: "A"
        legendFormat: "{{ instance }}"
  noDataState: "Alerting"
  execErrState: "Alerting"
  evaluateFor: "1m"
  evaluateEvery: "15s"
  labels:
    severity: "critical"
    service: "my-app-service"
  annotations:
    summary: "Service {{ $labels.instance }} is down."
    description: "The service {{ $labels.instance }} (job: my-app-service) appears to be down. The 'up' metric is 0."
    runbook_url: "https://your-wiki.com/runbooks/service-down"
  folderUid: "${FOLDER_UID_SERVICES}"
  isEnabled: true

Explanation:

  • condition: A == 0. This is straightforward: if the up metric from query A is 0, the alert fires.
  • data.model.expr: We query the up metric specifically for the job my-app-service. Make sure to adjust job='my-app-service' to match your actual job name in Prometheus.
  • noDataState: "Alerting": If Prometheus doesn't return any data for this query (which could happen if the entire job is gone), we want to treat it as an alert.
  • execErrState: "Alerting": Similarly, if there's an error fetching the metric, we assume the service is problematic and trigger an alert.
  • evaluateFor: "1m": We want to be alerted quickly if a service goes down, so 1m is a reasonable time to wait before firing.

Example 4: No Data Alert

Sometimes, the lack of data can be as critical as bad data. This could mean a data pipeline has stopped, or a sensor has failed.

- uid: "no-data-pipeline-A"
  title: "No Data Received from Pipeline A"
  condition: "A"
  data:
    - refId: "A"
      queryType: "range"
      relativeTimeRange: { from: 600, to: 0 } # Last 10 minutes
      datasourceUid: "influxdb"
      model:
        # Query to get the latest timestamp or count of records
        # Example for InfluxDB: Query for records in the last 10 minutes
        # Adjust this query based on your data source and metric type
        query: "SELECT count(value) FROM my_pipeline_metric WHERE time > now() - 10m"
        hide: false
        refId: "A"
  noDataState: "Alerting"
  execErrState: "Error"
  evaluateFor: "15m"
  evaluateEvery: "5m"
  labels:
    severity: "critical"
    pipeline: "A"
  annotations:
    summary: "No data received from Pipeline A for 15 minutes."
    description: "Pipeline A has not sent any data in the last 15 minutes. Last check was at {{ $labels.time }}"
    runbook_url: "https://your-wiki.com/runbooks/pipeline-no-data"
  folderUid: "${FOLDER_UID_PIPELINES}"
  isEnabled: true

Explanation:

  • condition: Just A. This means the alert will trigger if query A doesn't return any data points.
  • data.model.query: This is a placeholder for your data source query. The key is that this query should return a value if data is flowing. If it returns nothing, noDataState takes over.
  • noDataState: "Alerting": This is the crucial setting here. If query A returns no rows or values, Grafana will consider the alert to be Firing.
  • evaluateFor: "15m": We're giving it a grace period of 15 minutes before alerting, assuming temporary gaps might be acceptable.

Important Notes on datasourceUid and folderUid:

  • datasourceUid: You must replace `