Personal Infrastructure Monitor · CloudWatch Synthetics + Terraform

public status page · status.markandrewmarquez.com — // cover · public status page at status.markandrewmarquez.com

Commercial uptime monitoring is a solved problem — Datadog, Pingdom, UptimeRobot, Better Stack. They all work. They also all cost $20–50/month by the time you add enough endpoints and a public status page, which is absurd for a personal setup where I just want to know if my portfolio site, a few dev-tool registries, and the cloud services I depend on are actually reachable. This project is the AWS-native alternative: one CloudWatch Synthetics canary, seven endpoints, a Terraform stack that deploys end-to-end, and a public status page at status.markandrewmarquez.com — all for about the price of a cup of coffee per year.

01 // The Challenge

The brief I set for myself: monitor the services I actually depend on, alert me when they break, and publish uptime history somewhere public — all without the monthly SaaS bill and without writing a single thing I couldn't rebuild from code. The hard constraint was budget: $1–2/month, total. That constraint drove almost every architectural decision that followed.

Two things had to be true for this to work. First, every resource needed to be defined in Terraform so the whole stack could be destroyed and re-applied without ceremony. Second, the monitoring architecture had to collapse cost aggressively — commercial tools charge per monitor, and AWS Synthetics charges per canary run, so the obvious first instinct of "one canary per endpoint" would blow the budget before the first alarm ever fired.

02 // Architecture & Stack

The entire stack lives in a single AWS region (us-east-1, because CloudFront requires its ACM certificate there anyway) and deploys with terraform apply:

Canary — CloudWatch Synthetics, runtime syn-nodejs-puppeteer-13.1 (Node.js 22, Puppeteer 24). One canary, seven executeHttpStep() calls per run.
Scheduler — rate(30 minutes). Tight enough to catch real outages, loose enough to stay inside the free tier's edges.
Alerts — CloudWatch Alarms → SNS topic → email. One alarm per endpoint, bound to its step's SuccessPercent metric.
Status page generator — Python Lambda queries CloudWatch via GetMetricData, writes status.json to S3 on a schedule.
Status page frontend — static HTML/CSS/JS on S3, fronted by CloudFront with an ACM cert, at status.markandrewmarquez.com.
IaC — Terraform 1.5+, AWS provider 5.x, archive provider 2.x for zipping the canary script.
DNS — GoDaddy CNAME pointing at the CloudFront distribution (the one manual step in the whole build).

03 // Architecture Overview

                 ┌──────────────────────────────┐
                 │  CloudWatch Synthetics Canary │
                 │     (1 canary · 30 min)      │
                 └──────────────┬───────────────┘
                                │  executeHttpStep × 7
                ┌───────────────┼───────────────┐
                ▼               ▼               ▼
        portfolio-site   github-status   ms-graph-api   …
                │               │               │
                └──────────┬────┴────┬──────────┘
                           ▼         ▼
                  SuccessPercent metric  (per step)
                           │
              ┌────────────┼────────────┐
              ▼                         ▼
      CloudWatch Alarms         Lambda (5 min schedule)
              │                         │
              ▼                         ▼
           SNS topic             status.json on S3
              │                         │
              ▼                         ▼
           Email inbox         CloudFront + ACM
                                        │
                                        ▼
                     status.markandrewmarquez.com

04 // The Single-Canary Trick

This is the piece of the architecture I'm most happy with. Synthetics bills per canary run, not per step. A canary that runs every 30 minutes costs roughly the same whether it checks one endpoint or ten. The catch is that a naive implementation — a single fetch() loop with a pass/fail return — gives you one metric for the whole canary, so you can't tell which endpoint broke.

The fix is executeHttpStep(). Each call emits its own SuccessPercent metric in CloudWatch, dimensioned by CanaryName and StepName. So one canary, seven steps, seven independent metrics, seven independent alarms — for the price of one canary run. The script iterates over the monitors list and calls executeHttpStep once per endpoint:

const monitors = ${jsonencode(monitors)};

exports.handler = async () => {
  for (const monitor of monitors) {
    const url = new URL(monitor.url);

    const requestOptions = {
      hostname: url.hostname,
      path:     url.pathname + url.search,
      port:     url.port || (url.protocol === 'https:' ? 443 : 80),
      protocol: url.protocol,
      method:   'GET',
      headers:  { 'User-Agent': 'Amazon-CloudWatch-Synthetics' },
    };

    const stepConfig = {
      includeRequestHeaders:  false,
      includeResponseHeaders: false,
      includeRequestBody:     false,
      includeResponseBody:    monitor.type === 'api',
      continueOnStepFailure:  true,
    };

    await synthetics.executeHttpStep(
      monitor.name,
      requestOptions,
      validateResponse(monitor),
      stepConfig
    );
  }
};

Two subtleties I only found after breaking things:

continueOnStepFailure: true is non-negotiable. Without it, the first failing endpoint aborts the whole canary run and every endpoint after it gets no metric — which means the remaining alarms would silently flip to "insufficient data" whenever the first endpoint was actually down. You want the canary to keep going through the list even when something is broken.
includeResponseBody has to be true for API checks. Otherwise response.body is empty inside the validation callback and you can't do JSON validation on things like the GitHub Status or Azure DevOps health endpoints. For website-type checks it stays false to avoid hauling HTML back through CloudWatch.

05 // Endpoint Selection

Seven endpoints, chosen to reflect the services I actually touch in a normal week — not a synthetic "demo" list:

portfolio-site — markandrewmarquez.com. Website check — simple HTTP 200.
github-status — www.githubstatus.com/api/v2/status.json. API check — validates status.indicator exists in the JSON.
ms-graph-api — graph.microsoft.com/v1.0/$metadata. API check — validates the XML contains the microsoft.graph namespace. The $metadata endpoint is unauthenticated and returns the service schema, so it's a clean liveness signal for the Graph API without needing a tenant token.
azure-devops-status — status.dev.azure.com/_apis/status/health. API check — validates the JSON has a status field. Azure's main status page (azure.status.microsoft) is client-side rendered with no public JSON, so Azure DevOps is the closest public health API I could wire up.
docker-hub — hub.docker.com. Website check. If this goes down, half my CI does too.
ollama-registry — registry.ollama.ai. Website check. Feeds the local Ollama install on my AI assistant PC.
m365-portal — www.office.com. Website check. The tenant-level M365 Service Health API requires ServiceHealth.Read.All, so the front-door HTTP check is the right proxy for "is M365 reachable from the public internet".

The endpoint list is a Terraform variable. Adding or removing a monitored service is a one-line change and a terraform apply — the script template re-renders, the zip re-packages, and for_each creates or destroys the matching CloudWatch alarm automatically.

06 // Infrastructure as Code

The Terraform project is seven files, split by responsibility so each one does exactly one thing:

cloudwatch-monitor/
├── main.tf                  # Provider config (aws + archive)
├── variables.tf             # Endpoint list, interval, email
├── canary.tf                # S3 artifacts bucket, IAM role, canary
├── alarms.tf                # SNS topic + per-endpoint alarms
├── status-page.tf           # S3, CloudFront, Lambda, ACM cert
├── outputs.tf               # Console URLs, status page URL
└── canary-script/
    └── index.js.tftpl       # Node.js canary template

Terraform apply updating the Synthetics canary resource — // terraform apply — updating the canary monitor resource in place

The alarms file is the piece that benefits most from for_each — one block generates every per-endpoint alarm from the shared monitors variable:

resource "aws_cloudwatch_metric_alarm" "endpoint_alarms" {
  for_each = { for m in var.monitors : m.name => m }

  alarm_name          = "$${var.project_name}-$${each.key}-failed"
  comparison_operator = "LessThanThreshold"
  threshold           = 100
  evaluation_periods  = 2
  datapoints_to_alarm = 2

  metric_name = "SuccessPercent"
  namespace   = "CloudWatchSynthetics"
  statistic   = "Average"
  period      = var.canary_interval_minutes * 60

  dimensions = {
    CanaryName = aws_synthetics_canary.monitor.name
    StepName   = each.key
  }

  treat_missing_data = "breaching"
  alarm_actions      = [aws_sns_topic.alarm_notifications.arn]
  ok_actions         = [aws_sns_topic.alarm_notifications.arn]
}

Two consecutive 100%-success-threshold breaches are required before the alarm fires — one transient network blip at 30-minute intervals isn't enough to wake me up, but an hour of sustained failure will. Missing data is treated as breaching, so if the canary itself stops running entirely, that counts as failure too. ok_actions is wired to the same SNS topic so I get a "back to normal" email when things recover.

CloudWatch Alarms dashboard — 7 endpoint alarms, all OK — // all seven per-endpoint alarms in OK state — one alarm per executeHttpStep, driven by Terraform for_each

07 // The Status Page Pipeline

The public status page is deliberately boring infrastructure: no server, no database, no framework. Just a JSON file on S3 and a static HTML page that fetches it.

CloudWatch metrics (SuccessPercent per step)
       │
       ▼
Lambda (Python · every 5 min)
  GetMetricData → build status.json
       │
       ▼
S3 bucket (status.markandrewmarquez.com)
       │
       ▼
CloudFront + ACM cert (us-east-1)
       │
       ▼
Browser ← fetch('/status.json') every 30s

The Lambda function is about 60 lines of Python. It reads the endpoint list from an environment variable, calls GetMetricData for each step's SuccessPercent over the last 24 hours, and writes a small JSON document with each endpoint's current status and recent uptime percentage. The static frontend polls that file every 30 seconds and re-renders — green dots for healthy, red for failing, a last-updated timestamp so it's obvious when the data is stale.

The DNS handoff is the only piece Terraform doesn't manage: GoDaddy hosts the zone for markandrewmarquez.com, so adding the CNAME for the status subdomain is a manual two-minute step in the GoDaddy console after terraform apply finishes. Terraform outputs the CloudFront distribution domain as a convenience so there's nothing to copy-paste from the console.

CloudWatch Synthetics canary run — passed, 15.6s duration, scheduled run — // canary run detail — 101% success across all steps, 15.6s total duration

08 // Results

▸ cost

~$1.73/monthSingle canary at 30-minute intervals, dominating the bill. Everything else sits in free-tier noise.

▸ coverage

7 endpoints · 1 canaryPer-step metrics mean every endpoint gets its own alarm without per-endpoint billing.

▸ deploy

One commandterraform apply stands up the entire stack from zero, including CloudFront and ACM.

▸ alerting

Email in under 2 minAlarm fires after 2 breaches × 30-minute period; SNS delivery is effectively instant.

▸ public

Status page livestatus.markandrewmarquez.com · static, cached at the edge, polls status.json every 30 s.

▸ state

Fully reproducibleDestroy & re-apply in under 10 minutes. Secrets stay out of the repo; state stays local.

09 // What I Took From It

Cost-aware architecture pays off. The single-canary-with-steps pattern is the whole reason the bill sits at $1.73 instead of $12. Reading the Synthetics pricing page carefully before writing a line of code was worth more than any IaC discipline afterwards.
Terraform rewards boringness. Splitting the stack across files by responsibility — canary, alarms, status page — and driving per-endpoint resources with for_each meant that adding the seventh endpoint was genuinely a one-line change. No surgery, no copy-paste.
Public status pages don't need a framework. Lambda → JSON on S3 → static HTML is a pipeline that will keep working for years without maintenance. Compare that to anything with a runtime, a database, or a dependency tree.
The right monitored endpoints matter more than the monitor. Picking the services I actually depend on — GitHub, Docker Hub, Ollama registry, M365 — makes the status page useful to me. Picking whatever is easiest to check would have made it decoration.
The small gotchas are where the time goes. continueOnStepFailure, includeResponseBody, S3 lifecycle rules needing an explicit filter {} block, ACM having to live in us-east-1 — none of these are hard, but every one of them is an hour of debugging the first time you hit it.

The biggest lesson is the same one that keeps showing up across every build: constraints make architecture better. A $2/month budget eliminated every lazy option and forced the single-canary pattern, which turned out to be the cleanest design anyway.

10 // Try It

Open-source at github.com/marky224/cloudwatch-monitor. Live status page at status.markandrewmarquez.com. Full setup: an AWS account with CLI credentials configured, Terraform 1.5+, an email address to receive alerts, and a domain if you want the public status page. Clone, copy terraform.tfvars.example to terraform.tfvars, set your email, then terraform init && terraform apply. The endpoint list lives in variables.tf — swap in the services you actually care about and re-apply.

Personal Infrastructure Monitor
with CloudWatch Synthetics
& Terraform

01 // The Challenge

02 // Architecture & Stack

03 // Architecture Overview

04 // The Single-Canary Trick

05 // Endpoint Selection

06 // Infrastructure as Code

07 // The Status Page Pipeline

08 // Results

09 // What I Took From It

10 // Try It

Open to remote
Technical Support Engineer roles

01 // The Challenge

02 // Architecture & Stack

03 // Architecture Overview

04 // The Single-Canary Trick

05 // Endpoint Selection

06 // Infrastructure as Code

07 // The Status Page Pipeline

08 // Results

09 // What I Took From It

10 // Try It

Open to remoteTechnical Support Engineer roles

Open to remote
Technical Support Engineer roles