Skip to content

Commit

Permalink
v1.11.0
Browse files Browse the repository at this point in the history
  • Loading branch information
joeyorlando authored Oct 10, 2024
2 parents 1550bcd + 04ab676 commit b06f937
Show file tree
Hide file tree
Showing 22 changed files with 1,016 additions and 483 deletions.
169 changes: 79 additions & 90 deletions docs/sources/configure/escalation-chains-and-routes/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,118 +36,107 @@ refs:

# Escalation Chains and Routes

Often alerts from monitoring systems need to be sent to different escalation chains and messaging channels, based on their severity, or other alert content.
In Grafana OnCall, configuring proper alert routing and escalation ensures that alerts are directed to the right teams and handled promptly.

Alerts often need to be sent to different teams or channels depending on their severity or specific alert details.
Set up routes and escalation chains to customize and automate escalation according to each teams workflows.

## Routes

Routes are used to determine which escalation chain should be used for a specific alert
group. A route's _[Routing Templates]_
are evaluated for each alert and **the first matching route** is used to determine the
escalation chain and chatops channels.
Routes determine which escalation chain should be triggered for a specific alert group based on the details of the alert.
A route uses [Routing Templates](ref:routing-templates) to determine the escalation chain and notification channels.

> **Example:**
>
>
> * trigger escalation chain called `Database Critical` for alerts with `{{ payload.severity == "critical" and payload.service == "database" }}` in the payload
> * create a different route for alerts with the payload `{{ "synthetic-monitoring-dev-" in payload.namespace }}` and select a escalation chain called `Security`.
When an alert is received, its details are evaluated against the route's routing template, and **the first matching route** determines how the alert will be handled.

### Manage routes
**Example:**

1. Open Integration page
1. Click **Add route** button to create a new route
1. Click **Edit** button to edit `Routing Template`. The routing template must evaluate to `True` for it to apply
1. Select channels in **Publish to Chatops** section
> **Note:** If the **Publish to Chatops** section doesn't exist, connect Chatops integrations first.
> For more information, refer to [Notify people].
1. Select **Escalation Chain** from the list
1. If **Escalation Chain** does not exist, click **Add new escalation chain** button to create a new one, it will open in a new tab.
1. Once created, **Reload list**, and select the new escalation chain
1. Click **Arrow Up** and **Arrow Down** on the right to change the order of routes
1. Click **Three dots** and **Delete Route** to delete the route
- Trigger the `Database Critical` escalation chain for alerts with `{{ payload.severity == "critical" and payload.service == "database" }}`
- Use a different route for alerts with the payload `{{ "synthetic-monitoring-dev-" in payload.namespace }}`, selecting the `Security` escalation chain.

### Routing based on labels
### Create and manage routes

> **Note:** Labels are currently available only in cloud.
To create or manage a route:

In addition, there is a `labels` variable available to your routing templates, which contains all of the labels assigned
to the Alert Group, as a `dict`. This allows you to route based on labels (or a mix of labels and/or payload based data):
1. Navigate to the **Integrations** page.
1. Click **Add route** to create a new route, or **Edit** to modify an existing one.
1. In the **Routing Template** section, define conditions that will determine which alerts this route applies to.
The template must evaluate to `True` for the route to be selected.
1. Select the appropriate escalation chain from the **Escalation Chain** dropdown.
If an escalation chain doesn’t exist, click **Add new escalation chain**, which will open a new tab for chain creation.
After creating the chain, return to the routes page and click **Reload list** to update the available options.
1. In the **Publish to ChatOps** section, select the relevant communication channels for this route (Slack, Teams, etc.).
Ensure ChatOps integrations are configured before using this feature.
1. Arrange the routes by clicking the up/down arrows to prioritize the routes as needed. The order determines which route is evaluated first.
1. To delete a route, click the three dots on the route and select **Delete Route**.

> **Example:**
>
> * `{{ labels.foo == "bar" or "hello" in labels.keys() or payload.severity == "critical" }}`
### Label-based routing

## Escalation Chains
{{< admonition type="note" >}}
This feature is available exclusively on Grafana Cloud.
{{< /admonition >}}

Once an alert group is created and assigned to the route with escalation chain, the
escalation chain will be executed. Until user performs an action, which stops the escalation
chain (e.g. acknowledge, resolve, silence etc), the escalation chain will continue to
execute.
You can use the labels variable in your routing templates to evaluate based on alert group labels.
This provides additional flexibility in routing alerts based on both labels and payload data.

Users can create escalation chains to configure different type of escalation workflows.
For example, you can create a chain that will notify on-call users with high priority, and
another chain that will only send a message into a Slack channel.
**Example:**

Escalation chains determine Who and When to notify. How to notify is set by the user, based on their own preferences.
`{{ labels.foo == "bar" or "hello" in labels.keys() or payload.severity == "critical" }}`

### Types of escalation steps
## Escalation Chains

* `Wait` - wait for a specified amount of time before proceeding to the next step. If you
need a larger time interval, use multiple wait steps in a row.
* `Notify users` - send a notification to a user or a group of users.
* `Notify users from on-call schedule` - send a notification to a user or a group of users
from an on-call schedule.
* `Notify all users from a team` - send a notification to all users in a team.
* `Resolve incident automatically` - resolve the alert group right now with status
`Resolved automatically`.
* `Escalate to all Slack channel members` - send a notification to the users in the slack channel. These users will be notified
via the method configured in their user profile.
* `Notify Slack User Group` - send a notification to each member of a slack user group. These users will be notified
via the method configured in their user profile.
* `Trigger outgoing webhook` - trigger an [outgoing webhook].
* `Notify users one by one (round robin)` - notify users sequentially, cycling through users for **different alert groups**.
Example: if users A, B, and C are in the list, the first alert group notifies A, the second alert group notifies B, and
the third alert group notifies C. Note: users are sorted alphabetically by their username.
To notify multiple users **within the same alert group** until someone acknowledges, instead use `Notify users` policies with
`Wait` policies between them in the escalation chain.
* `Continue escalation if current time is in range` - continue escalation only if current
time is in specified range. It will wait for the specfied time to continue escalation.
Useful when you want to get escalation only during working hours
* `Continue escalation if >X alerts per Y minutes (beta)` - continue escalation only if it
passes some threshold
* `Repeat escalation from beginning (5 times max)` - loop the escalation chain

> **Note:** Both "**Escalate to all Slack channel members**" and "**Notify Slack User Group**" will filter OnCall registered users
matching the users in the Slack channel or Slack User Group with their profiles linked to their Slack accounts (ie. users
should have linked their Slack and OnCall users). In both cases, the filtered users satisfying the criteria above are
notified following their respective notification policies. However, to avoid **spamming** the Slack channel/thread,
users **won't be notified** in the alert group Slack **thread** (this is how the feature is currently implemented)
but instead notify them using their **other defined** options in
their respective policies.
Escalation chains define the series of actions taken when an alert is triggered.
The chain continues until a user intervenes by acknowledging, resolving, or silencing the alert.

### Notification types
You can configure different escalation chains for different workflows.
For example, one chain might notify on-call users immediately, while another sends a low-priority message to a Slack channel.

Each escalation step that notifies a user, does so by triggering their personal notification steps. These are configured in the Grafana
OnCall users page (by clicking "View my profile").
It will be executed for each user in the escalation step
User can configure two types of personal notification chains:
### Create and manage escalation chains

* **Default Notifications**
1. Navigate to the **Escalation Chains** page.
1. Click **New escalation chain** to create a new chain.
1. Enter a unique name and assign the chain to a team.
1. Click **Add escalation step** to define the steps for this chain (e.g., notifying users, waiting, escalating).
1. To edit an existing chain, click **Edit**. To remove a chain, click **Delete**.

* **Important Notifications**
{{< admonition type="note" >}}

In the escalation step, user can select which type of notification to use.
For more information, refer to [Notify people].
- The name must be unique across the organization.
Alert groups inherit the team from the integration, not the escalation chain.
- Linked integrations and routes are shown in the right panel.
Changes to the escalation chain impact all associated integrations and routes.
{{< /admonition >}}

### Types of escalation steps

- `Wait`: Pause for a specified time before moving to the next step. You can add multiple wait steps for longer intervals.
- `Notify users`: Notify individual users or groups.
- `Notify users from on-call schedule`: Send notifications to users from a defined on-call schedule.
- `Notify all team members`: Notify all users in a team.
- `Resolve incident automatically`: Immediately resolve the alert group with the status `Resolved automatically`.
- `Notify Slack channel members`: Notify users in a Slack channel based on their OnCall profile preferences.
- `Notify Slack user group`: Notify all members of a Slack user group.
- `Trigger outgoing webhook`: Activate an [outgoing webhook](ref:outgoing-webhooks).
- `Round robin notifications`: Notify users sequentially, with each user receiving different alert groups.
- `Time-based escalation`: Continue escalation only if the current time falls within a specific range (e.g., during working hours)
- `Threshold-based escalation`: Escalate only if a certain number of alerts occur within a specific time frame.
- `Repeat escalation`: Loop the escalation chain up to five times.
- `Declare incident (non-default routes)`: **Available only in Grafana Cloud**. Declares an incident with a specified severity.
Limited to one incident per route at a time.
Additional alerts are grouped into the active incident, and up to five are listed as incident context.

{{< admonition type="note" >}}
The **Notify Slack channel members** and **Notify Slack user group** steps are designed to notify OnCall-registered users via their configured notification rules.
To avoid spamming a Slack channel with alert group notifications, notifications are not sent in the alert group Slack thread.
{{< /admonition >}}

### Notification types

### Manage Escalation Chains
When an escalation step notifies a user, it follows their personal notification settings, which are configured in their user profile.

1. Open **Escalation Chains** page
2. Click **New escalation chain** button to create a new escalation chain
Each user can have two sets of notification rules:

3. Enter a name and assign it to a team
> **Note:** Name must be unique across organization
> **Note:** Alert Groups inherit the team from the Integration, not the Escalation Chain
4. Click **Add escalation step** button to add a new step
5. Click **Delete** to delete the Escalation Chain, and **Edit** to edit the name or the team.
- **Default Notifications**: For standard alerts.
- **Important Notifications**: For high-priority alerts.

> **Important:** Linked Integrations and Routes are displayed in the right panel. Any change in the Escalation Chain will
affect all linked Integrations and Routes.
Each escalation step allows you to select which set of notification rules to use.
For more information about user notification rules, refer to the [Notifications](ref:notify-people) section.
5 changes: 3 additions & 2 deletions docs/sources/oncall-api-reference/escalation_policies.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ The above command returns JSON structured in the following way:
| ---------------------------------- | :--------------------------------------: | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `escalation_chain_id` | Yes | Each escalation policy is assigned to a specific escalation chain. |
| `position` | Optional | Escalation policies execute one after another starting from `position=0`. `Position=-1` will put the escalation policy to the end of the list. A new escalation policy created with a position of an existing escalation policy will move the old one (and all following) down in the list. |
| `type` | Yes | One of: `wait`, `notify_persons`, `notify_person_next_each_time`, `notify_on_call_from_schedule`, `notify_user_group`, `trigger_webhook`, `resolve`, `notify_whole_channel`, `notify_if_time_from_to`. |
| `type` | Yes | One of: `wait`, `notify_persons`, `notify_person_next_each_time`, `notify_on_call_from_schedule`, `notify_user_group`, `trigger_webhook`, `resolve`, `notify_whole_channel`, `notify_if_time_from_to`, `declare_incident`. |
| `important` | Optional | Default is `false`. Will assign "important" to personal notification rules if `true`. This can be used to distinguish alerts on which you want to be notified immediately by phone. Applicable for types `notify_persons`, `notify_team_members`, `notify_on_call_from_schedule`, and `notify_user_group`. |
| `duration` | If type = `wait` | The duration, in seconds, when type `wait` is chosen. Valid values are: `60`, `300`, `900`, `1800`, `3600`. |
| `action_to_trigger` | If type = `trigger_webhook` | ID of a webhook. |
Expand All @@ -52,7 +52,8 @@ The above command returns JSON structured in the following way:
| `notify_on_call _from_schedule` | If type = `notify_on_call_from_schedule` | ID of a Schedule. |
| `notify_if_time_from` | If type = `notify_if_time_from_to` | UTC time represents the beginning of the time period, for example `09:00:00Z`. |
| `notify_if_time_to` | If type = `notify_if_time_from_to` | UTC time represents the end of the time period, for example `18:00:00Z`. |
| `team_to_notify` | If type = `notify_team_members` | ID of a team. |
| `team_to_notify` | If type = `notify_team_members` | ID of a team. |
| `severity` | If type = `declare_incident` | Severity of the incident. |

**HTTP request**

Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import re

from apps.alerts.incident_appearance.templaters.alert_templater import AlertTemplater
from common.utils import convert_md_to_html, escape_html, url_re, urlize_with_respect_to_a
from common.utils import convert_md_to_html, escape_html, url_re, urlize_with_respect_to_a, validate_url


class AlertWebTemplater(AlertTemplater):
Expand All @@ -26,7 +26,7 @@ def _postformat(self, templated_alert):
message = message.replace(substitution, original_link)
templated_alert.message = urlize_with_respect_to_a(message)
if templated_alert.image_url:
templated_alert.image_url = escape_html(templated_alert.image_url)
templated_alert.image_url = validate_url(templated_alert.image_url)

return templated_alert

Expand Down
2 changes: 2 additions & 0 deletions engine/apps/alerts/models/alert_receive_channel.py
Original file line number Diff line number Diff line change
Expand Up @@ -422,6 +422,8 @@ def emojized_verbal_name(self):

@property
def new_incidents_web_link(self):
from apps.alerts.models import AlertGroup

return UIURLBuilder(self.organization).alert_groups(
f"?integration={self.public_primary_key}&status={AlertGroup.NEW}",
)
Expand Down
4 changes: 2 additions & 2 deletions engine/apps/alerts/tasks/notify_user.py
Original file line number Diff line number Diff line change
Expand Up @@ -400,7 +400,7 @@ def perform_notification(log_record_pk, use_default_notification_policy_fallback
UserNotificationPolicyLogRecord(
author=user,
type=UserNotificationPolicyLogRecord.TYPE_PERSONAL_NOTIFICATION_FAILED,
notification_policy=notification_policy,
notification_policy=notification_policy if not use_default_notification_policy_fallback else None,
reason="Expected data is missing",
alert_group=alert_group,
notification_step=notification_policy.step if notification_policy else None,
Expand All @@ -424,7 +424,7 @@ def perform_notification(log_record_pk, use_default_notification_policy_fallback
UserNotificationPolicyLogRecord(
author=user,
type=UserNotificationPolicyLogRecord.TYPE_PERSONAL_NOTIFICATION_FAILED,
notification_policy=notification_policy,
notification_policy=notification_policy if not use_default_notification_policy_fallback else None,
reason="Skipped notification because alert group is resolved",
alert_group=alert_group,
notification_step=notification_policy.step if notification_policy else None,
Expand Down
Loading

0 comments on commit b06f937

Please sign in to comment.