Documentation Index
Fetch the complete documentation index at: https://astronomer-preview.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
APC includes two built-in alerting systems for monitoring health:
- Deployment-level alerts: Notify you when an Airflow Deployment is unhealthy or components are underperforming.
- Platform-level alerts: Notify you when APC platform components are unhealthy (Elasticsearch, Houston API, Registry, Commander).
Alerts fire based on metrics collected by Prometheus. When alert conditions are met, Prometheus Alertmanager sends notifications to your configured channels.
Alertmanager is enabled by default as part of the APC monitoring stack (tags.monitoring: true). To disable it individually, set global.alertmanagerEnabled: false in your values.yaml. See Apply platform configuration for details.
Alert architecture
Anatomy of an alert
Alerts are defined in YAML using PromQL queries:
- alert: ManyUnhealthySchedulers
expr: count(rate(airflow_scheduler_heartbeat{}[1m]) <= 0) > 5
for: 5m
labels:
tier: platform
severity: critical
annotations:
summary: "{{ $value }} airflow schedulers are not heartbeating"
description: "More than 5 Airflow schedulers have not emitted a heartbeat for over 5 minutes."
| Field | Description |
|---|
expr | PromQL expression that determines when to fire |
for | Duration the condition must be true (for example, 5m, 1h) |
labels.tier | Alert level: airflow (Deployment) or platform |
labels.severity | Severity: info, warning, high, critical |
annotations.summary | Alert message text |
annotations.description | Human-readable description |
Subscribe to alerts
Alertmanager uses receivers to integrate with notification platforms. Define receivers in your values.yaml:
Email alerts
alertmanager:
receivers:
platform:
email_configs:
- smarthost: smtp.example.com:587
from: alerts@example.com
to: ops-team@example.com
auth_username: alerts@example.com
auth_password: ${SMTP_PASSWORD}
send_resolved: true
Slack alerts
alertmanager:
receivers:
platformCritical:
slack_configs:
- api_url: https://hooks.slack.com/services/xxx/yyy/zzz
channel: '#platform-alerts'
title: '{{ .CommonAnnotations.summary }}'
text: |-
{{ range .Alerts }}{{ .Annotations.description }}
{{ end }}
alertmanager:
receivers:
platformCritical:
pagerduty_configs:
- service_key: ${PAGERDUTY_SERVICE_KEY}
severity: '{{ .CommonLabels.severity }}'
description: '{{ .CommonAnnotations.summary }}'
OpsGenie alerts
alertmanager:
receivers:
platformCritical:
opsgenie_configs:
- api_key: ${OPSGENIE_API_KEY}
message: '{{ .CommonAnnotations.summary }}'
priority: '{{ if eq .CommonLabels.severity "critical" }}P1{{ else }}P3{{ end }}'
Default receiver groups
APC includes default receiver groups based on tier and severity:
| Receiver | Tier | Severity |
|---|
platform | platform | all |
platformCritical | platform | critical |
airflow | airflow | all |
Custom routes
If you define a platform, platformCritical, or airflow receiver, you don’t need a customRoute to route to it — alerts are automatically routed based on the tier label. Use customRoutes only for non-default routing (for example, high-severity Deployment alerts):
alertmanager:
customRoutes:
- receiver: deployment-high-receiver
match_re:
tier: airflow
severity: high
- receiver: deployment-warning-receiver
match_re:
tier: airflow
severity: warning
Custom receivers
Use alertmanager.customReceiver to define receivers for notification services not covered by the built-in receiver keys. Custom receivers work alongside customRoutes to route alerts to those services:
alertmanager:
customReceiver:
- name: sns-receiver
sns_configs:
- api_url: <SNS_ENDPOINT>
topic_arn: <SNS_TOPIC_ARN>
subject: '[Alert: {{ .GroupLabels.alertname }}]'
sigv4:
region: <AWS_REGION>
role_arn: <SNS_ROLE_ARN>
customRoutes:
- receiver: sns-receiver
match_re:
tier: platform
severity: critical
Apply configuration
Push receiver configuration to your installation:
helm upgrade astronomer astronomer/astronomer \
-f values.yaml \
--namespace astronomer
Create custom alerts
Add custom alerts using the Prometheus Helm chart:
Alert when multiple schedulers are unhealthy:
prometheus:
additionalAlerts:
platform: |
- alert: MultipleSchedulersUnhealthy
expr: count(rate(airflow_scheduler_heartbeat{}[1m]) <= 0) > 2
for: 5m
labels:
tier: platform
severity: critical
annotations:
summary: "{{ $value }} schedulers are not heartbeating"
description: "More than 2 Airflow schedulers are unhealthy for over 5 minutes."
Deployment alert example
Alert on high task failure rate:
prometheus:
additionalAlerts:
airflow: |
- alert: HighTaskFailureRate
expr: |
(
sum(increase(airflow_ti_failures{}[1h])) by (deployment)
/
sum(increase(airflow_ti_successes{}[1h]) + increase(airflow_ti_failures{}[1h])) by (deployment)
) > 0.1
for: 15m
labels:
tier: airflow
severity: warning
annotations:
summary: "High task failure rate in {{ $labels.deployment }}"
description: "Task failure rate exceeds 10% for the past 15 minutes."
Built-in deployment alerts
For a complete list of built-in alerts, see the Prometheus alerts configmap.
| Alert | Description | Action |
|---|
AirflowDeploymentUnhealthy | Deployment is unhealthy or unavailable for 15+ minutes | Check pod status, review logs |
AirflowPodQuota | Using more than 95% pod quota for 10+ minutes | Increase Extra Capacity or optimize Dags |
AirflowSchedulerUnhealthy | Scheduler not heartbeating for 6+ minutes | Check scheduler logs, restart if needed |
AirflowTasksPendingIncreasing | Tasks pending faster than clearing for 30+ minutes | Increase concurrency or worker resources |
| Alert | Description | Action |
|---|
CriticalComponentPodCrashLooping | A core platform component pod (Houston, Commander, Grafana, Prometheus, Registry) is repeatedly restarting for 15+ minutes | Check pod logs in the APC namespace, investigate the crash cause |
CriticalComponentPodNotReady | A pod in the APC platform namespace has been in a non-ready state for 15+ minutes | Check pod events and logs in the APC namespace |
TargetDown | More than 10% of Prometheus scrape targets for a job are unreachable for 10+ minutes | Check the failing service’s pods and endpoints |
ElasticSeachUnassignedShards | Elasticsearch cluster has unassigned shards for 10+ minutes | Check Elasticsearch cluster health and logs |
ElasticDiskHighWatermarkReached | Elasticsearch node disk usage exceeds 90% for 5+ minutes | Increase Elasticsearch storage or clean up old indices |
ElasticDiskFloodWatermarkReached | Elasticsearch node disk usage exceeds 95% for 5+ minutes — Elasticsearch enforces a read-only index block at this threshold | Immediately increase storage or delete old indices |
IngessCertificateExpiration | A TLS certificate for a platform hostname expires in less than one week | Renew the TLS certificate |
The ElasticSeachUnassignedShards and IngessCertificateExpiration alert names contain typos in their current implementation. Use the exact names shown when creating silences or custom routes.
Viewing active alerts
Alertmanager UI
Access Alertmanager to view active alerts:
https://alertmanager.<base-domain>
Prometheus UI
Query alerts in Prometheus:
https://prometheus.<base-domain>/alerts
CLI
# View firing alerts
kubectl exec -n astronomer prometheus-0 -- \
wget -qO- localhost:9090/api/v1/alerts | jq '.data.alerts[] | select(.state=="firing")'
Silencing alerts
Temporarily silence alerts during maintenance:
Via Alertmanager UI
- Go to
https://alertmanager.<base-domain>
- Click Silences > New Silence
- Add matchers (for example,
alertname=AirflowSchedulerUnhealthy)
- Set duration and comment
- Click Create
Via API
curl -X POST https://alertmanager.<base-domain>/api/v2/silences \
-H "Content-Type: application/json" \
-d '{
"matchers": [{"name": "alertname", "value": "AirflowSchedulerUnhealthy", "isRegex": false}],
"startsAt": "2026-02-05T00:00:00Z",
"endsAt": "2026-02-05T06:00:00Z",
"createdBy": "admin",
"comment": "Maintenance window"
}'
Best practices
- Start with built-in alerts before creating custom ones
- Set appropriate thresholds - avoid alert fatigue
- Use severity levels - reserve
critical for pages
- Include runbook links in alert descriptions
- Test alerts in non-production environments first
- Document escalation paths for each severity level