Platform and deployment alerts

APC includes two built-in alerting systems for monitoring health:

Deployment-level alerts: Notify you when an Airflow Deployment is unhealthy or components are underperforming.
Platform-level alerts: Notify you when APC platform components are unhealthy (Elasticsearch, Houston API, Registry, Commander).

Alerts fire based on metrics collected by Prometheus. When alert conditions are met, Prometheus Alertmanager sends notifications to your configured channels. Alertmanager is enabled by default as part of the APC monitoring stack (tags.monitoring: true). To disable it individually, set global.alertmanagerEnabled: false in your values.yaml. See Apply platform configuration for details.

Alert architecture

Anatomy of an alert

Alerts are defined in YAML using PromQL queries:

- alert: ManyUnhealthySchedulers
  expr: count(rate(airflow_scheduler_heartbeat{}[1m]) <= 0) > 5
  for: 5m
  labels:
    tier: platform
    severity: critical
  annotations:
    summary: "{{ $value }} airflow schedulers are not heartbeating"
    description: "More than 5 Airflow schedulers have not emitted a heartbeat for over 5 minutes."

Field	Description
`expr`	PromQL expression that determines when to fire
`for`	Duration the condition must be true (for example, `5m`, `1h`)
`labels.tier`	Alert level: `airflow` (Deployment) or `platform`
`labels.severity`	Severity: `info`, `warning`, `high`, `critical`
`annotations.summary`	Alert message text
`annotations.description`	Human-readable description

Configure alert receivers

Alertmanager uses receivers to integrate with notification platforms. Define receivers in your values.yaml:

Email alerts

alertmanager:
  receivers:
    platform:
      email_configs:
        - smarthost: smtp.example.com:587
          from: alerts@example.com
          to: ops-team@example.com
          auth_username: alerts@example.com
          auth_password: ${SMTP_PASSWORD}
          send_resolved: true

Slack alerts

alertmanager:
  receivers:
    platformCritical:
      slack_configs:
        - api_url: https://hooks.slack.com/services/xxx/yyy/zzz
          channel: '#platform-alerts'
          title: '{{ .CommonAnnotations.summary }}'
          text: |-
            {{ range .Alerts }}{{ .Annotations.description }}
            {{ end }}

PagerDuty alerts

alertmanager:
  receivers:
    platformCritical:
      pagerduty_configs:
        - service_key: ${PAGERDUTY_SERVICE_KEY}
          severity: '{{ .CommonLabels.severity }}'
          description: '{{ .CommonAnnotations.summary }}'

OpsGenie alerts

alertmanager:
  receivers:
    platformCritical:
      opsgenie_configs:
        - api_key: ${OPSGENIE_API_KEY}
          message: '{{ .CommonAnnotations.summary }}'
          priority: '{{ if eq .CommonLabels.severity "critical" }}P1{{ else }}P3{{ end }}'

Default receiver groups

APC includes default receiver groups based on tier and severity:

Receiver	Tier	Severity
`platform`	platform	all
`platformCritical`	platform	critical
`airflow`	airflow	all

Custom routes

If you define a platform, platformCritical, or airflow receiver, you don’t need a customRoute to route to it — alerts are automatically routed based on the tier label. Use customRoutes only for non-default routing (for example, high-severity Deployment alerts):

alertmanager:
  customRoutes:
    - receiver: deployment-high-receiver
      match_re:
        tier: airflow
        severity: high
    - receiver: deployment-warning-receiver
      match_re:
        tier: airflow
        severity: warning

Custom receivers

Use alertmanager.customReceiver to define receivers for notification services not covered by the built-in receiver keys. Custom receivers work alongside customRoutes to route alerts to those services:

alertmanager:
  customReceiver:
    - name: sns-receiver
      sns_configs:
        - api_url: <SNS_ENDPOINT>
          topic_arn: <SNS_TOPIC_ARN>
          subject: '[Alert: {{ .GroupLabels.alertname }}]'
          sigv4:
            region: <AWS_REGION>
            role_arn: <SNS_ROLE_ARN>
  customRoutes:
    - receiver: sns-receiver
      match_re:
        tier: platform
        severity: critical

Apply configuration

Push receiver configuration to your installation:

helm upgrade astronomer astronomer/astronomer \
  -f values.yaml \
  --namespace astronomer

Create custom alerts

Add custom alerts using the Prometheus Helm chart:

Platform alert example

Alert when multiple schedulers are unhealthy:

prometheus:
  additionalAlerts:
    platform: |
      - alert: MultipleSchedulersUnhealthy
        expr: count(rate(airflow_scheduler_heartbeat{}[1m]) <= 0) > 2
        for: 5m
        labels:
          tier: platform
          severity: critical
        annotations:
          summary: "{{ $value }} schedulers are not heartbeating"
          description: "More than 2 Airflow schedulers are unhealthy for over 5 minutes."

Deployment alert example

Alert on high task failure rate:

prometheus:
  additionalAlerts:
    airflow: |
      - alert: HighTaskFailureRate
        expr: |
          (
            sum(increase(airflow_ti_failures{}[1h])) by (deployment)
            /
            sum(increase(airflow_ti_successes{}[1h]) + increase(airflow_ti_failures{}[1h])) by (deployment)
          ) > 0.1
        for: 15m
        labels:
          tier: airflow
          severity: warning
        annotations:
          summary: "High task failure rate in {{ $labels.deployment }}"
          description: "Task failure rate exceeds 10% for the past 15 minutes."

Built-in deployment alerts

For a complete list of built-in alerts, see the Prometheus alerts configmap.

Alert	Description	Action
`AirflowDeploymentUnhealthy`	Deployment is unhealthy or unavailable for 15+ minutes	Check pod status, review logs
`AirflowPodQuota`	Using more than 95% pod quota for 10+ minutes	Increase Extra Capacity or optimize Dags
`AirflowSchedulerUnhealthy`	Scheduler not heartbeating for 6+ minutes	Check scheduler logs, restart if needed
`AirflowTasksPendingIncreasing`	Tasks pending faster than clearing for 30+ minutes	Increase concurrency or worker resources

Built-in platform alerts

Alert	Description	Action
`CriticalComponentPodCrashLooping`	A core platform component pod (Houston, Commander, Grafana, Prometheus, Registry) is repeatedly restarting for 15+ minutes	Check pod logs in the APC namespace, investigate the crash cause
`CriticalComponentPodNotReady`	A pod in the APC platform namespace has been in a non-ready state for 15+ minutes	Check pod events and logs in the APC namespace
`TargetDown`	More than 10% of Prometheus scrape targets for a job are unreachable for 10+ minutes	Check the failing service’s pods and endpoints
`ElasticSeachUnassignedShards`	Elasticsearch cluster has unassigned shards for 10+ minutes	Check Elasticsearch cluster health and logs
`ElasticDiskHighWatermarkReached`	Elasticsearch node disk usage exceeds 90% for 5+ minutes	Increase Elasticsearch storage or clean up old indices
`ElasticDiskFloodWatermarkReached`	Elasticsearch node disk usage exceeds 95% for 5+ minutes — Elasticsearch enforces a read-only index block at this threshold	Immediately increase storage or delete old indices
`IngessCertificateExpiration`	A TLS certificate for a platform hostname expires in less than one week	Renew the TLS certificate

The ElasticSeachUnassignedShards and IngessCertificateExpiration alert names contain typos in their current implementation. Use the exact names shown when creating silences or custom routes.

Viewing active alerts

Alertmanager UI

Access Alertmanager to view active alerts:

https://alertmanager.<base-domain>

Prometheus UI

Query alerts in Prometheus:

https://prometheus.<base-domain>/alerts

CLI

# View firing alerts
kubectl exec -n astronomer prometheus-0 -- \
  wget -qO- localhost:9090/api/v1/alerts | jq '.data.alerts[] | select(.state=="firing")'

Silencing alerts

Temporarily silence alerts during maintenance:

Via Alertmanager UI

Go to https://alertmanager.<base-domain>
Click Silences > New Silence
Add matchers (for example, alertname=AirflowSchedulerUnhealthy)
Set duration and comment
Click Create

Via API

curl -X POST https://alertmanager.<base-domain>/api/v2/silences \
  -H "Content-Type: application/json" \
  -d '{
    "matchers": [{"name": "alertname", "value": "AirflowSchedulerUnhealthy", "isRegex": false}],
    "startsAt": "2026-02-05T00:00:00Z",
    "endsAt": "2026-02-05T06:00:00Z",
    "createdBy": "admin",
    "comment": "Maintenance window"
  }'

Best practices

Start with built-in alerts before creating custom ones
Set appropriate thresholds - avoid alert fatigue
Use severity levels - reserve critical for pages
Include runbook links in alert descriptions
Test alerts in non-production environments first
Document escalation paths for each severity level

Overview

Install And Upgrade

Documentation

Reference

Platform and deployment alerts

Alert architecture

Anatomy of an alert

Configure alert receivers

Email alerts

Slack alerts

PagerDuty alerts

OpsGenie alerts

Default receiver groups

Custom routes

Custom receivers

Apply configuration

Create custom alerts

Platform alert example

Deployment alert example

Built-in deployment alerts

Built-in platform alerts

Viewing active alerts

Alertmanager UI

Prometheus UI

CLI

Silencing alerts

Via Alertmanager UI

Via API

Best practices

Overview

Install And Upgrade

Documentation

Reference

Documentation Index

​Alert architecture

​Anatomy of an alert

​Subscribe to alerts

​Configure alert receivers

​Email alerts

​Slack alerts

​PagerDuty alerts

​OpsGenie alerts

​Default receiver groups

​Custom routes

​Custom receivers

​Apply configuration

​Create custom alerts

​Platform alert example

​Deployment alert example

​Built-in deployment alerts

​Built-in platform alerts

​Viewing active alerts

​Alertmanager UI

​Prometheus UI

​CLI

​Silencing alerts

​Via Alertmanager UI

​Via API

​Best practices

​Related documentation

Alert architecture

Anatomy of an alert

Subscribe to alerts

Configure alert receivers

Email alerts

Slack alerts

PagerDuty alerts

OpsGenie alerts

Default receiver groups

Custom routes

Custom receivers

Apply configuration

Create custom alerts

Platform alert example

Deployment alert example

Built-in deployment alerts

Built-in platform alerts

Viewing active alerts

Alertmanager UI

Prometheus UI

CLI

Silencing alerts

Via Alertmanager UI

Via API

Best practices

Related documentation