All applications are vulnerable to service-interrupting events - a network outage, a team member accidentally deleting a namespace, a critical bug introduced by your latest application push or even a natural disaster. All are rare and undesirable events that modern teams running enterprise-grade software need to protect against. At Astronomer, we encourage all customers to have a robust, targeted, and well-tested DR (Disaster Recovery) plan. The doc below will provide guidelines for how to:Documentation Index
Fetch the complete documentation index at: https://astronomer-preview.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
- Backup the Astronomer Platform
- Restore the Astronomer Platform in case of an incident
Why Velero
Unlike other tools that directly access the Kubernetes etcd database to perform backups and restores, Velero uses the Kubernetes API to capture the state of cluster resources and to restore them when necessary. This API-driven approach has a number of key benefits:- Backups can capture subsets of the cluster’s resources, filtering by namespace, resource type, and/or label selector, providing a high degree of flexibility around what’s backed up and restored.
- Users of managed Kubernetes offerings often do not have access to the underlying etcd database, so direct backups/restores of it are not possible.
- Resources exposed through aggregated API servers can easily be backed up and restored even if they’re stored in a separate etcd database.
Backup
To recover the Astronomer platform in the case of an incident, back up the following resources in order of priority:- The Kubernetes cluster state and Astronomer Postgres database.
- ElasticSearch, Prometheus, and Alertmanager persistent volume claims (PVCs).
Kubernetes cluster backup
With Velero, you can back up or restore all objects in your cluster, or you can filter objects by type, namespace, and/or label. There are two types of backups:- On-Demand
- Scheduled
- Uploads a tarball of copied Kubernetes objects into cloud object storage.
- Calls the cloud provider API to make disk snapshots of persistent volumes, if specified.
Prerequisites
The following instructions assume you have:- Velero installed in your cluster
- The Velero CLI
kubectlaccess to your cluster
On-demand backup
If you need to create a backup on demand, run the following in the Velero CLI:--snapshot-volumes=false.
Scheduled backup
Production environments should have scheduled backups enabled. The frequency of this backup depends on your needs and constraints. We recommend that you start with at least daily backups and adjust the frequency from there as needed. To schedule a backup for a specific time, run:Database backup
You can use one of the following methods to backup the Astronomer database:- Enable automatic backups with your cloud provider (Preferred)
- Use traditional backup tools such as pg_dump for Postgresql
Enable automatic backups with your cloud provider
The easiest and most reliable way to ensure the database is backed up is to enable automatic backups with your cloud provider. This will create daily backups of your Astronomer Postgresql database. Refer to the following links to Cloud Provider documentation for creating Postgres Database Backups:- AWS: Database backup and restore in AWS
- GCP: Create automatic backups in GCP
- Azure: Azure Postgres backup and restore
Traditional backup tool (pg_dump):
To run pg_dump successfully, someone with “read” access to the Astronomer Database will need to collect the following (stored as a Kubernetes Secret):
- Database connection string
- Username
- Password
Restore
In the case of an incident, you’re always free to restore either:- A single Airflow Deployment
- The Whole Platform (all Airflow Deployments)
Single Deployment
The steps below are valid for the Astronomer Platform on Helm3 (Astronomer v0.14+).Non-deleted Airflow Deployment
To restore a previous version of a deployment that has not been deleted in the Astronomer Software UI (or CLI/API) and that has been backed up with Velero, follow the steps below.-
Identify the Velero backup you intend to use by running:
-
Identify the Kubernetes namespace in question, which corresponds to your Airflow Deployment’s “release name” and has your platform’s namespace (typically “astronomer”) prepended to the front.
For example, the namespace for an Airflow Deployment with the release name
weightless-meteor-5042would beastronomer-weightless-meteor-5042. -
Run:
Deleted Airflow Deployment
To restore a single Airflow Deployment that was deleted in the Astronomer Software UI (or CLI/API), perform the previous steps to restore its Velero namespace. Once that is complete, the Astronomer Database needs to be updated to mark that release as not deleted. Follow the steps below.-
Grab your database connection string (stored as a Kubernetes secret)
-
To connect to the database, launch a container into your cluster with the Postgres client:
-
Then run the following command to connect to the database:
Example:
-
Update the record for the deployment you wish to restore.
Whole platform
In case your team ever needs to migrate to new infrastructure or your existing infrastructure is no longer accessible and you need to restore the Astronomer Platform in its entirety, including all Airflow Deployments within it, follow the steps below.- Create a new Kubernetes cluster without Astronomer installed
- Install Velero into the new cluster, ensuring that it can reach the previous backups in their storage location (e.g. S3 storage or GCS Bucket)
- Set the Velero backup storage location to
readonlyto prevent accidentally overwriting any backups by running:
-
Restore database snapshots to a new Postgres database or create a new database and restore from
pg_dumpbackups. -
Perform velero full cluster restore by running:
-
If the database endpoint has changed (e.g. it has a new hostname), it needs to be provided to the platform.
- AWS - Update the
astronomer-bootstrapsecret to have the new connection string. Then the pods in the astronomer namespace will need to be restarted to pick up this change. Thepgbouncer-configsecret in each release namespace will also need to be updated with the new endpoint in the connection string. - GCP - The
pg-sqlproxy-gcloud-sqlproxydeployment needs to be updated to put the new database instance name in theinstancesargument passed to the container
- AWS - Update the