Your Infrastructure Drifted. Terraform Doesn't Know Yet.
Infrastructure drift is the gap between what your Terraform state file believes exists and what is actually running in your cloud account and that gap grows silently. A support engineer widens a security group to debug a production issue and forgets to revert it. An autoscaling group modifies instance tags. A teammate applies a hotfix directly through the console under pressure. None of these show up in git and none trigger a pipeline. And that isn't even the worst part because the next time someone runs terraform apply, Terraform either destroys the undeclared resource or overwrites the manual change, depending on what got touched.
This is not a hypothetical. I've traced a production outage to a NAT Gateway that Terraform recreated during a routine apply (as expected). Six weeks prior however, a developer had manually added a secondary association to route traffic for a new microservice. No ticket. No PR. The state file had no record of it. When Terraform reconciled. The association was lost. Traffic from three services dropped immediately.
Why Drift Happens
The root cause is almost never malice. It is speed, access, and incomplete processes.
- Console cowboys: Production access without guardrails invites manual changes and engineers under pressure take the path of least resistance.
- Incomplete applies: A partially failed
terraform applyleaves state and reality misaligned. Terraform records what it managed to create before the failure, which may not match what actually exists. - Auto-remediation and autoscaling side effects: AWS Auto Scaling, RDS automated backups, and Lambda concurrency scaling all mutate infrastructure outside Terraform's control. If you are managing these resources with Terraform without
ignore_changesblocks, the next plan will treat the differences as changes to revert. - External tooling: Security scanners, config management agents, and cloud provider features (like AWS Security Hub auto-remediation) can all modify resources that Terraform owns.
How to Detect It Before It Finds You
Running terraform plan on demand is not enough. Plans are point-in-time snapshots and by the time you run one, drift may have been accumulating for weeks.
Scheduled plan jobs. Add a CI job that runs terraform plan -detailed-exitcode on a cron schedule across every workspace. Exit code 2 means changes are pending. Alert on it. This catches detectable drift early without requiring a human to remember to look.
# In your CI pipeline (GitHub Actions, GitLab CI, etc.)
terraform plan -detailed-exitcode -out=tfplan
if [ $? -eq 2 ]; then
echo "Drift detected in workspace: $TF_WORKSPACE"
# Send alert to Slack, PagerDuty, etc.
fi
Snyk IaC (formerly driftctl). This tool compares your Terraform state against actual cloud resources and flags anything that exists in AWS but is not tracked by state. It also surfaces resources that exist in state but have been modified outside Terraform. Run it in your pipeline or as a standalone scan:
snyk iac describe --from="tfstate+s3://your-state-bucket/terraform.tfstate"
AWS Config drift detection. For teams on AWS, Config Rules and CloudFormation Drift Detection (even if you are using Terraform) can flag configuration changes against expected baselines. It does not replace Terraform-level diffing but adds an independent signal that is useful for compliance.
Preventing Drift at the Source
Detection is reactive. Prevention is the goal.
- Remove human write access to production. Use Service Control Policies (SCPs) or Azure Policy to deny console-based changes to production environments for all but break-glass IAM roles. Engineers should interact with production through pipelines only. This is uncomfortable at first and worth the friction.
- State locking and remote backends. Use S3 with DynamoDB locking or Terraform Cloud. This prevents concurrent applies that leave state in a partial write.
- Enforce
ignore_changesexplicitly. For resources like Auto Scaling groups and ECS services where AWS manages some attributes, document the ignored fields in code with a comment. Undocumentedignore_changesblocks are how drift gets institutionalised accidentally. - Break-glass process with mandatory cleanup. If engineers need emergency console access, require a PR within 24 hours that codifies the change in Terraform. Track this with an on-call runbook entry. The emergency fix is acceptable; leaving it uncodified is not.
- Immutable infrastructure where possible. The fewer long-lived mutable resources you have, the smaller the surface area for drift. EC2 instances that get replaced rather than patched cannot drift.
What This Looks Like at Scale
After implementing scheduled plan jobs, Snyk IaC scans, and SCP-based console restrictions at one of my clients who was managing 6 AWS accounts:
- Drift incidents dropped from roughly 3 per month to 0 over a 60-day window.
- Mean time to detect configuration anomalies fell from days (discovered reactively) to under 4 hours (caught by scheduled jobs).
- The audit team was able to use the drift detection output directly as evidence of configuration control for their SOC 2 Type II assessment, reducing manual evidence collection by approximately 8 hours per cycle.
None of this requires expensive tooling. A scheduled CI job and disciplined access control gets you most of the way there.
If your Terraform workspaces have not had a plan run against them in the past 48 hours, you are flying blind. At the very least, I highly recommend you start there.
If you want a structured review of your IaC posture and access model, book a strategy call and we can work through it together.