The New CTO's AWS Infrastructure Audit: What to Check in Week One
Every new CTO inherits infrastructure they did not build. The first 30 days are for understanding, not changing. This guide gives you a structured three-phase audit framework — inventory, architecture quality, risk prioritisation — and a checklist you can run in week one.
The day you start as a new CTO, VP Engineering, or Head of Platform, you inherit infrastructure you did not design, built by people who may or may not still be at the company, documented at some level between “thoroughly” and “in the original engineer’s head.” Your job in the first 30 days is not to fix it. It is to understand it accurately enough to know what actually needs fixing — and in what order.
That distinction matters. CTOs who arrive and immediately start changing things before they understand the system create two problems: they make changes based on incomplete information, and they signal to the engineering team that their existing work is wrong before anyone has established trust. Both outcomes are avoidable with a structured audit first.
This guide provides that structure. It is organised into three phases — inventory, architecture quality, and risk prioritisation — with specific signals to look for at each stage. The framework is grounded in the AWS Well-Architected Framework because it provides a vendor-recognised standard your board and any future auditors or investors will understand. It ends with a checklist you can run in your first week.
The four questions every CTO should answer about inherited AWS infrastructure
Before the audit phases, get clear on what you are trying to find out. A useful framing is four questions, each with a distinct time horizon and audience:
What do we have?
A complete inventory — accounts, regions, services, whether infrastructure is defined as code. Audience: you, your engineering leads.
Is it well-configured?
An assessment of architecture quality against a known standard (the Well-Architected Framework). Audience: you, your security lead, eventually the board.
What is the priority order of risks?
A ranked findings list with severity and blast radius — what breaks first if this is exploited or fails. Audience: you, your team, sprint planning.
What does it cost, and is that number consistent with what we use?
A cost trend and attribution check — are we paying for what we use. Audience: you, CFO, board.
Why the first 30 days matter more than the next 90
Three forces make the first 30 days the best window for this kind of audit.
Political capital. You have more permission to ask naive-sounding questions in week one than you will in month six. “Walk me through how this is deployed” is a natural question for a new CTO. The same question six months in signals that you have not been paying attention. Use the window.
Baseline establishment. The value of an audit is not just the findings — it is the reference point it creates. “When I arrived, GuardDuty was not enabled in three accounts and we had 14 IAM users with active static access keys. We are down to two users pending OIDC migration.” That is a narrative your board can follow. Without the baseline, you cannot show progress.
Credibility sequencing. Engineering teams watch new leadership carefully. Arriving with questions before opinions — and publishing a structured assessment before proposing changes — signals that you are making evidence-based decisions. That credibility compounds over the next 90 days when you do start making changes.
Phase 1 — What do we have? (inventory)
Time required: 2–4 hours for a typical startup environment
Terraform state and module structure
The first question is: what does Terraform manage? Everything in state is visible, reviewable, and auditable. Everything not in state was created manually — and is invisible to any Terraform-based audit tooling. The gap between the two is the first meaningful risk signal you will find.
Ask for access to the Terraform repository and remote state location. Then run:
# Run from your Terraform root directory# Count total managed resources across all modulesterraform show -json | jq '.values.root_module.resources | length'# List every resource type in state — reveals what terraform manages (and what it doesn't)terraform state list | sed 's/\.module\.[^.]*\.//' | sed 's/\..*//' | sort | uniq -c | sort -rn# Show which workspaces exist (prod, staging, dev, etc.)terraform workspace list# If using Terragrunt or multiple root modules, run from each:find . -name "*.tfstate" 2>/dev/null | head -20Run in each Terraform root module directory. The state list output reveals scope; the resource type frequency reveals patterns.
What to look for: Is there one monolithic root module or per-service modules? Are there multiple workspaces (prod, staging, dev) or separate root modules per environment? Is state stored in S3 with DynamoDB locking — or locally, which means the last person to run terraform apply owns the truth?
A monolithic module that manages prod, staging, and shared networking in one state file is a blast-radius risk: a wrong terraform destroy or a corrupted state file affects everything. It is also an indicator of infrastructure that grew organically without architectural review — which means the same pattern likely extends to the AWS resources themselves.
AWS account structure and Organizations
Single-account AWS environments are common at early-stage companies and represent a structural risk: every service, every developer, and every CI pipeline shares the same account boundary. If one IAM credential is compromised, the blast radius is the entire account. AWS Organizations — with separate accounts for production, staging, security tooling, and billing — is the well-architected pattern for multi-environment isolation.
# List all accounts in the AWS Organization (requires Organizations access)aws organizations list-accounts --query 'Accounts[*].{Name:Name,Id:Id,Status:Status,Email:Email}' --output table# Check which SCPs are attached to the root and each OUaws organizations list-roots --query 'Roots[*].{Id:Id,Name:Name}' --output table# Check CloudTrail is enabled in every region (run from the management account)aws cloudtrail describe-trails --include-shadow-trails --query 'trailList[*].{Name:Name,Region:HomeRegion,MultiRegion:IsMultiRegionTrail,Logging:LoggingEnabled}' --output tableRun from the management account. Organizations access requires org:ListAccounts permission.
Confirm that CloudTrail is enabled as a multi-region trail in every account and that logs go to a centralised S3 bucket in a dedicated logging account — not the same account as the workloads. Logs stored in the same account as the workload can be deleted by any principal with S3 write access to that bucket.
Existing security tooling
Before you decide what security tooling to add, you need to know what is already running. This avoids duplicating spend and tells you what signals are already available if something goes wrong today.
# GuardDuty — check whether a detector exists in this regionaws guardduty list-detectors# If a detector ID is returned, check its statusaws guardduty get-detector --detector-id <DETECTOR_ID> --query '{Status:Status,UpdatedAt:UpdatedAt,FindingPublishingFrequency:FindingPublishingFrequency}'# Security Hub — check whether it is configuredaws securityhub describe-hub 2>/dev/null && echo "Security Hub: enabled" || echo "Security Hub: not configured in this account/region"# AWS Config — check whether config rules are recordingaws configservice describe-configuration-recorder-status --query 'ConfigurationRecordersStatus[*].{Name:name,Recording:recording,LastStatus:lastStatus}'Run in each account and each region that has workloads. GuardDuty and Config must be enabled per-region.
The minimum acceptable baseline: CloudTrail enabled everywhere, GuardDuty enabled in every region with workloads, AWS Config recording. Security Hub centralises findings from both and is low-cost to enable. If none of these are running, you have no runtime signal for active credential abuse or anomalous API activity — which means any breach that occurred before you arrived may be unknown.
Phase 2 — Is it well-configured? (architecture review)
Time required: 1–3 days depending on infrastructure complexity
Phase 1 tells you what exists. Phase 2 tells you whether what exists is configured correctly for the workload it serves. The standard for “correctly” here is the AWS Well-Architected Framework — specifically the four pillars most relevant to an operational audit: Security, Reliability, Cost Optimization, and Operational Excellence.
For Terraform-managed infrastructure, a linter (Checkov, tfsec, or Trivy) gives you a baseline misconfiguration count quickly. Run one against the Terraform repository and note the finding count and the proportion of High findings. This is your linter baseline — useful, but incomplete. Linters check individual resource configurations; they do not evaluate blast radius, cross-service dependencies, or whether the architecture is appropriate for a production workload. For that, you need an architectural review against the WAF pillars — covered in detail in the Terraform Architecture Review guide.
Security: IAM, network, encryption
IAM is the highest-priority area in the Security pillar because it determines the blast radius of every other security failure. The three highest-signal IAM signals in an inherited environment:
- 1.IAM users with static access keys — count them. Each is a long-lived credential that does not expire. The question is not whether they exist; the question is whether each one has a justification and a rotation policy.
- 2.Roles with wildcard actions or Resource: "*" — search for Action: "*" and Resource: "*" in aws_iam_role_policy and aws_iam_user_policy resources. These represent maximum-blast-radius permissions that linters may not fully catch.
- 3.EC2 instances without IMDSv2 enforcement — any EC2 instance without http_tokens = "required" in its metadata_options block is a SSRF risk. This is a well-known pattern; see the IAM breach pattern analysis for the incident context.
For S3: check that every bucket has server-side encryption configured and public access blocked at the bucket level. The Checkov rule CKV_AWS_19 covers encryption, but the Well-Architected Security pillar additionally requires that encryption uses a customer-managed KMS key for any bucket containing sensitive data. For network: security groups with 0.0.0.0/0 ingress on ports 22 or 3389 are an immediate finding regardless of context.
Reliability: multi-AZ, backup, blast radius
The Reliability pillar asks: what happens when a component fails? The minimum standard for a production environment is that no single component failure should result in complete service unavailability. In Terraform terms, the highest-signal reliability checks are:
- 1.RDS instances: multi_az = true for any database serving production traffic. A single-AZ RDS instance has a maintenance window that causes downtime and no automatic failover if the underlying hardware fails.
- 2.RDS backup_retention_period: should be at least 7 days for production. A value of 0 means automated backups are disabled.
- 3.Auto Scaling Groups: min_size ≥ 2 for production-serving ASGs, across at least two availability zones. A single-instance EC2 serving prod traffic has no failover.
- 4.ALB: subnets list should include at least two AZs. An ALB configured in one AZ fails if that AZ has an AWS-side event.
Cost: untagged resources, oversized instances, idle services
An inherited AWS environment often has cost visibility problems before it has cost efficiency problems. If resources are not tagged with environment, service, and team, you cannot attribute spend — which means you cannot have a productive conversation about what to cut.
Open the AWS Cost Explorer and look at the 90-day cost trend. Three questions:
- 1.Is the trend flat, growing proportionally with usage, or growing faster than usage? Faster-than-usage growth typically means infrastructure is being added without corresponding review.
- 2.What is the top-five cost by service? EC2, RDS, and data transfer are common large lines — but NAT Gateway often appears unexpectedly high because VPC endpoints are cheaper for most S3 and DynamoDB traffic.
- 3.How much spend is untagged? Open the Tag Editor and look for aws_instance, aws_db_instance, and aws_s3_bucket resources without Name and Environment tags. Untagged spend cannot be attributed and will make future cost discussions impossible.
Operations: observability, alerting, runbooks
The Operational Excellence pillar asks: can the team operate this system at midnight when something goes wrong? The signals are:
- 1.CloudWatch log groups: do they exist for every Lambda, ECS task, and application? Are retention periods set — or are logs retained indefinitely (creating growing cost and compliance exposure)?
- 2.Alarms: are there CloudWatch alarms on error rates, latency, CPU, and custom business metrics? Who receives them — a PagerDuty integration, an SNS topic, an email address that a former employee owned?
- 3.Runbooks: ask the on-call engineer "what do you do when the API latency alarm fires?" If the answer is "we ssh in and look around," that is a process finding, not an infrastructure finding.
- 4.Deployment process: how does code go from a merged PR to production? Is it automated (CI/CD), semi-automated (manual approval gate), or manual? Manual deployments to production are a reliability and audit risk.
Phase 3 — What’s the risk? (prioritised findings)
Time required: 2–4 hours
Phase 2 gives you a list of findings. Phase 3 orders them by priority so your engineering team can act on them without asking “what should we fix first?” every sprint.
Prioritisation in an AWS infrastructure context has two axes: severity (what is the potential impact if this is exploited or fails) and likelihood (how likely is it that this specific finding leads to an incident in the next 90 days). High severity and high likelihood get fixed in the next sprint. Low severity and low likelihood go on the backlog.
A useful heuristic for the severity axis: think in terms of blast radius. An IAM role that, if compromised, gives an attacker access to one Lambda function is less severe than a role that gives access to all S3 buckets in the account. Both might pass a linter check. The architectural review tells you the blast radius; the linter does not.
What to put in your board report
Your first board report on inherited AWS infrastructure should be readable in five minutes by a non-technical board member. Keep the technical detail in an appendix. The body should contain four sections:
1. What we have
Headline inventory: number of AWS accounts, regions in use, approximate managed resource count, monthly cost as of the audit date, and whether infrastructure is primarily defined as code. This establishes the scope for anyone reading the report later.
2. What the risk picture looks like
Three to five prioritised findings with severity (High / Medium / Low) and business impact described in plain language. "Four IAM roles have wildcard action permissions — if any of the services using these roles is compromised, the attacker gains access to all S3 buckets in the production account" is a business statement. "CKV_AWS_40 fail" is not. Boards fund remediation based on the business statement.
3. What the plan is
A 90-day roadmap with milestones corresponding to risk reduction, not just technical tasks. "Eliminate all static IAM access keys and replace with OIDC federation by end of Q3" is a milestone. "Sprint 8: refactor CI" is a task. The board tracks milestones.
4. What we need
Any decisions or approvals the board needs to unblock the plan: headcount to execute the remediation work, budget for tooling, a decision to adopt a multi-account structure that requires the CFO to approve a new billing organisation. Be specific about what you are asking for and by when.
The week-one audit checklist
Use this as a reference alongside your conversations with the team. Not every item will apply to every environment — the goal is structured coverage, not mechanical completion.
1Phase 1 — Inventory
Terraform state
AWS account structure
Security tooling
2Phase 2 — Architecture quality
Security
Reliability
Cost
Operations
3Phase 3 — Risk prioritisation guidance
For the full six-step Terraform architecture review process — including how to evaluate each Well-Architected pillar in depth and produce a stakeholder-ready report — see the Terraform Architecture Review: A Complete Guide.
Frequently asked questions
What should a new CTO check in their first week on AWS?↓
The highest-signal activities in the first week are: understand the Terraform state (what is managed as code and what is not), map the AWS account structure (single account or Organizations, what boundaries exist), verify that core security controls are enabled (CloudTrail, GuardDuty, AWS Config), understand the IAM patterns (how developers and services access AWS), and check the monthly cost trend for the past 90 days. These five areas give you an accurate picture of infrastructure maturity before you form any opinions about what to change.
How long does an AWS infrastructure audit take?↓
A surface-level inventory — account structure, Terraform state, cost trend, and basic security control status — can be completed in two to three hours for a typical startup environment. A full architecture review against the AWS Well-Architected Framework typically takes one to three days depending on infrastructure complexity and whether Terraform is the authoritative source of truth. Automated tooling can compress the architecture review phase by processing Terraform state and flagging patterns that need human review, letting you focus your time on the judgement calls rather than the scanning.
What is the AWS Well-Architected Framework and why does it matter for a CTO audit?↓
The AWS Well-Architected Framework (WAF) is Amazon's structured set of best practices for cloud architecture, covering six pillars: Security, Reliability, Cost Optimization, Operational Excellence, Performance Efficiency, and Sustainability. For a new CTO audit, it matters because it provides a vendor-recognised, widely understood standard. Using the WAF as your audit framework means your findings can be communicated clearly to your board, your engineering team, and any external reviewers — investors, auditors, compliance assessors — because the standard is public and well understood.
What should go in a board report about inherited AWS infrastructure?↓
A first board report on inherited AWS infrastructure should contain four sections: (1) What we have — headline inventory numbers: account count, regions, monthly cost, whether infrastructure is defined as code; (2) What the risk picture looks like — three to five prioritised findings with severity and business impact in plain language, not just technical descriptions; (3) What the plan is — a 90-day roadmap with milestones corresponding to risk reduction; and (4) What decisions are needed from the board — budget, headcount, vendor relationship changes. Keep technical detail in an appendix; the report body should be readable by a non-technical board member in five minutes.
How do I audit Terraform in an inherited AWS environment?↓
Start by running `terraform state list` in each root module directory to understand what Terraform manages. Resources not in state are managed manually and represent invisible risk. Then look at the module structure, state storage location, and workspace configuration. Run a static analysis tool (Checkov, tfsec, or Trivy) to get a baseline misconfiguration count. Then conduct an architectural review — either manually against the AWS Well-Architected Framework or using a purpose-built tool — to evaluate patterns that linters do not catch: blast radius, cross-service dependencies, and whether the architecture is appropriate for a production workload.
What is the difference between a Terraform security scan and a Well-Architected review?↓
A security scan (Checkov, tfsec, Trivy) checks Terraform configuration against deterministic rules: is IMDSv2 enabled, is encryption configured, are public access blocks set. It answers 'is this configuration valid?' quickly and consistently. A Well-Architected review evaluates whether the overall architecture is sound for the workload: is the IAM design appropriate for the blast radius if one service is compromised, does the RDS configuration match the reliability requirements of a production database, is the observability stack sufficient to detect an incident before customers do. Both are necessary and complementary; neither replaces the other.
Get an AI-Powered Review of Your Inherited Infrastructure
Upload your Terraform and get a Well-Architected review covering all four pillars — with findings mapped by severity and blast radius, and a PDF report you can share with your board and engineering leads.
Findings include WAF pillar references, severity by workload context, and HCL remediation examples.