You can use two built-in alerting solutions for monitoring the health of Astronomer:Documentation Index
Fetch the complete documentation index at: https://astronomer-preview.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
- Deployment-level alerts, which notify you when the health of an Airflow Deployment is low or if any of Airflow’s underlying components are underperforming, including the Airflow scheduler.
- Platform-level alerts, which notify you when a component of your Software installation is unhealthy, such as Elasticsearch, Astronomer’s Houston API, or your Docker Registry.
Anatomy of an alert
Platform and Deployment alerts are defined in YAML and use PromQL queries for alerting conditions. Eachalert YAML object contains the following key-value pairs:
expr: The logic that determines when the alert will fire, written in PromQL.for: The length of time that theexprlogic has to be true for the alert to fire. This can be defined in minutes or hours (e.g.5mor2h).labels.tier: The level of your platform that the alert should operate at. Deployment alerts have a tier ofairflow, while platform alerts have a tier ofplatform.labels.severity: The severity of the alert. Can beinfo,warning,high, orcritical.annotations.summary: The text for the alert that’s sent by Alertmanager.annotations.description: A human-readable description of what the alert does.
Subscribe to alerts
Astronomer uses Prometheus Alertmanager to manage alerts. This includes silencing, inhibiting, aggregating, and sending out notifications using methods such as email, on-call notification systems, and chat platforms. You can configure Alertmanager to send built-in Astronomer alerts to email, HipChat, PagerDuty, Pushover, Slack, OpsGenie, and more by defining alert receivers in the Alertmanager Helm chart and modifying the Alertmanageremail-config parameter.
Create alert receivers
Alertmanager uses receivers to integrate with different messaging platforms. To begin sending notifications for alerts, you first need to definereceivers in YAML using the Alertmanager Helm chart.
This Helm chart contains groups for each possible alert type based on labels.tier and labels.severity. Each receiver must be defined within at least one alert type in order to receive notifications.
For example, adding the following receiver to receivers.platformCritical would cause platform alerts with critical severity to appear in a specified Slack channel:
customRoutes list with the appropriate match_re and receiver configuration values. For example:
platform, platformCritical, or airflow receiver defined in the prior section, you do not need a customRoute to route to them. They will automatically be routed to by the tier label.
For more information on building and configuring receivers, refer to Prometheus documentation.
Push alert receivers to Astronomer
To add a new receiver to Astronomer, add your receiver configuration to yourvalues.yaml file and push the changes to your installation as described in Apply a config change. The receivers you add must be specified in the same order and format as they appear in the Alertmanager Helm chart. Once you push the alerts to Astronomer, they are automatically added to the Alertmanager ConfigMap.
Create custom alerts
In addition to subscribing to Astronomer’s built-in alerts, you can also create custom alerts and push them to Astronomer. Platform and Deployment alerts are defined in YAML and pushed to Astronomer with the Prometheus Helm chart. For example, the following alert will fire if more than 2 Airflow schedulers across the platform are not heartbeating for more than 5 minutes:AdditionalAlerts section of your values.yaml file and push the file with Helm as described in Apply a config change.
After you’ve pushed the alert to Astronomer, make sure that you’ve configured a receiver to subscribe to the alert. For more information, read Subscribe to Alerts.
Reference: Deployment alerts
The following table lists some of the most common Deployment alerts that you might receive from Astronomer. For a complete list of built-in Airflow alerts, see the Prometheus configmap.| Alert | Description | Follow-Up |
|---|---|---|
AirflowDeploymentUnhealthy | Your Airflow Deployment is unhealthy or not completely available. | Contact Astronomer support. |
AirflowEphemeralStorageLimit | Your Airflow Deployment has been using more than 5GiB of its ephemeral storage for over 10 minutes. | Make sure to continually remove unused temporary data in your Airflow tasks. |
AirflowPodQuota | Your Airflow Deployment has been using over 95% of its pod quota for over 10 minutes. | Either increase your Deployment’s Extra Capacity in the Software UI or update your DAGs to use less resources. If you have not already done so, upgrade to Airflow 2.0 for improved resource management. |
AirflowSchedulerUnhealthy | The Airflow scheduler has not emitted a heartbeat for over 1 minute. | Contact Astronomer support. |
AirflowTasksPendingIncreasing | Your Airflow Deployment created tasks faster than it was clearing them for over 30 minutes. | Ensure that your tasks are running and completing correctly. If your tasks are running as expected, raise concurrency and parallelism in Airflow, then consider increasing one of the following resources to handle the increase in performance:
|
ContainerMemoryNearTheLimitInDeployment | A container in your Airflow Deployment is near its memory quota; it’s been using over 95% of its memory quota for over 60 minutes. | Either increase your Deployment’s allocated resources in the Software UI or update your DAGs to use less memory. If you have not already done so, upgrade to Airflow 2.0 for improved resource management. |