Skip to main content

Documentation Index

Fetch the complete documentation index at: https://astronomer-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Configure automated cleanup jobs to maintain database health by removing old data. Astro Private Cloud (APC) includes several cleanup jobs that run as CronJobs on configurable schedules to manage storage growth and query performance.

Cleanup jobs summary

JobDefault ScheduleDefault RetentionPurpose
cleanupDeploymentsDaily @ 00:0014 daysRemoves soft-deleted deployments
cleanupDeployRevisionsDaily @ 23:1190 daysArchive deploy history
cleanupTaskUsageDataDaily @ 23:4090 daysPurge task metrics
cleanupClusterAuditsDaily @ 23:4990 daysRemove cluster audit logs
cleanupAirflowDbDaily @ 05:23365 daysClean Airflow metadata (disabled by default)

cleanupDeployments

Permanently removes deployments that have been soft-deleted after the retention period.

What gets cleaned

  • Deployment database records marked with deletedAt
  • Associated Docker registry images
  • Deployment metadata database

Configuration

houston:
  cleanupDeployments:
    enabled: true
    schedule: "0 0 * * *"      # Midnight daily
    olderThan: 14              # Days since deletion
    dryRun: false              # Set true to preview

Manual trigger

Run this command from a machine with access to the underlying Kubernetes cluster:
kubectl -n <namespace> exec -it deploy/<release-name>-houston -- yarn cleanup-deployments --older-than=14 --dry-run=false

cleanupDeployRevisions

Removes old deployment revision records to reduce database size.

What gets cleaned

  • deployRevision records older than retention period
  • Historical deployment configuration snapshots

Configuration

houston:
  cleanupDeployRevisions:
    enabled: true
    schedule: "11 23 * * *"    # 23:11 daily
    olderThan: 90              # Days to retain

Manual trigger

Run this command from a machine with access to the underlying Kubernetes cluster:
kubectl -n <namespace> exec -it deploy/<release-name>-houston -- yarn cleanup-deploy-revisions --older-than=90

Per-deployment cleanup

Run this command from a machine with access to the underlying Kubernetes cluster to clean revisions for a specific deployment:
kubectl -n <namespace> exec -it deploy/<release-name>-houston -- yarn cleanup-deploy-revisions --older-than=90 --deploymentUuid=<uuid>

cleanupTaskUsageData

Purges task usage metrics and audit logs.

What gets cleaned

  • TaskUsage records (daily aggregated metrics)
  • TaskUsageAuditLog records (raw task data)

Configuration

houston:
  cleanupTaskUsageData:
    enabled: true
    schedule: "40 23 * * *"    # 23:40 daily
    olderThan: 90              # Minimum 90 days
    dryRun: false
Minimum retention is 90 days and can’t be reduced.

Manual trigger

Run this command from a machine with access to the underlying Kubernetes cluster:
kubectl -n <namespace> exec -it deploy/<release-name>-houston -- yarn cleanup-task-usage-data --older-than=90 --dry-run=false

GraphQL trigger

query {
  cleanupTaskUsageDataJob(olderThan: 90)
}

cleanupClusterAudits

Removes cluster audit log entries.

What gets cleaned

  • ClusterAudit records tracking cluster configuration changes
  • Historical cluster state snapshots

Configuration

houston:
  cleanupClusterAudits:
    enabled: true
    schedule: "49 23 * * *"    # 23:49 daily
    olderThan: 90              # Days to retain

Manual trigger

Run this command from a machine with access to the underlying Kubernetes cluster:
kubectl -n <namespace> exec -it deploy/<release-name>-houston -- yarn cleanup-cluster-audit --older-than=90

Filter by cluster

Run this command from a machine with access to the underlying Kubernetes cluster to clean audits for specific clusters:
kubectl -n <namespace> exec -it deploy/<release-name>-houston -- yarn cleanup-cluster-audit --older-than=90 --cluster-ids=<id1>,<id2>

cleanupAirflowDb

Cleans Airflow metadata from individual Deployment databases.
This job is disabled by default due to potential impact on running Deployments.

What gets cleaned

Default tables:
  • callback_request - Task callback requests
  • celery_taskmeta, celery_tasksetmeta - Celery metadata
  • dag - Dag definitions
  • dag_run - Dag execution history
  • dataset_event - Dataset events
  • import_error - Import errors
  • job - Job records
  • log - Task execution logs
  • session - Session data
  • sla_miss - SLA violations
  • task_fail - Task failures
  • task_instance - Task execution records
  • task_reschedule - Reschedule events
  • trigger - Trigger records
  • xcom - Cross-communication data

Configuration

houston:
  cleanupAirflowDb:
    enabled: false             # Must explicitly enable
    schedule: "23 5 * * *"     # 05:23 daily
    olderThan: 365             # Days to retain
    outputPath: "/tmp"         # Archive location
    dropArchives: true         # Delete after archiving
    dryRun: false
    provider: local            # Storage: local/aws/azure/gcp
    bucketName: "/tmp"         # Cloud bucket or local path
    tables: ""                 # Specific tables (empty = all)

Cloud storage export

Export archived data to cloud storage:
houston:
  cleanupAirflowDb:
    enabled: true
    provider: aws              # aws, azure, or gcp
    bucketName: "my-archive-bucket"
    providerEnvSecretName: "aws-credentials-secret"

Specific tables only

Clean only specific tables:
houston:
  cleanupAirflowDb:
    enabled: true
    tables: "log,task_instance,xcom"

Manual trigger

Run this command from a machine with access to the underlying Kubernetes cluster:
kubectl -n <namespace> exec -it deploy/<release-name>-houston -- yarn cleanup-airflow-db-data \
  --older-than=365 \
  --provider=local \
  --bucket-name=/tmp \
  --tables="log,task_instance"

Schedule reference

Default schedules are staggered to avoid simultaneous execution:
TimeJob
00:00cleanupDeployments
05:23cleanupAirflowDb
23:11cleanupDeployRevisions
23:40cleanupTaskUsageData
23:49cleanupClusterAudits

Common configuration options

All cleanup jobs share these options:
houston:
  cleanup<JobName>:
    enabled: true/false        # Enable/disable the job
    schedule: "cron-expression"  # When to run
    olderThan: <days>          # Retention period
    dryRun: false              # Preview without deleting
    readinessProbe: {}         # Optional health probes
    livenessProbe: {}

Kubernetes CronJob behavior

All cleanup CronJobs use:
  • Concurrency policy: Forbid (prevents overlapping runs)
  • Backoff limit: 1 retry on failure
  • Restart policy: Never

Monitor cleanup jobs

Check job status

# List all cleanup CronJobs
kubectl get cronjobs -n astronomer | grep cleanup

# View recent job runs
kubectl get jobs -n astronomer | grep cleanup

# Check job logs
kubectl logs job/<job-name> -n astronomer

# Trigger individual jobs manually
kubectl create job --from=cronjobs/jobname jobname-hash -n astronomer

Verify data cleanup

-- Check remaining records by date
SELECT DATE(created_at), COUNT(*)
FROM deploy_revision
GROUP BY DATE(created_at)
ORDER BY DATE(created_at) DESC;

Troubleshooting

Job not running

  1. Check CronJob exists:
    kubectl get cronjob houston-cleanup-deployments -n astronomer
    
  2. Check job is enabled in Helm values
  3. Verify schedule syntax is valid cron expression

Job failing

  1. Check job logs:
    kubectl logs job/houston-cleanup-deployments-<timestamp> -n astronomer
    
  2. Database connectivity: Ensure Houston can reach the database
  3. Permissions: Verify service account has required database permissions

Data not being cleaned

  1. Check retention period: Data younger than olderThan won’t be deleted
  2. Verify timestamps: Check createdAt/deletedAt values in database
  3. Run with dry-run: Preview what would be deleted

Best practices

  1. Monitor database size before and after cleanup jobs
  2. Start with dry-run when adjusting retention periods
  3. Stagger schedules if adding custom cleanup jobs
  4. Archive before delete for cleanupAirflowDb in production
  5. Set alerts for failed cleanup jobs