CI/CD & DevOps
"If it hurts, do it more often." — Jez Humble (paraphrasing Martin Fowler) [^humble2010]
CI vs CD vs CD — Clarifying the Terms
| Acronym |
Full Term |
Purpose |
Trigger |
| CI |
Continuous Integration |
Merge early, test often |
Every push/PR |
| CD |
Continuous Delivery |
Ready to deploy anytime |
Every CI pass |
| CD |
Continuous Deployment |
Auto-deploy to prod |
Every CD pass |
graph LR
A[Code Push] --> B[CI: Build, Lint, Test]
B --> C[CD: Deploy to Staging]
C --> D[CD: Deploy to Production]
D --> E[Monitor & Alert]
Pipeline Stages in Detail
1. CI — Continuous Integration
# .github/workflows/ci.yml (example)
name: CI
on: [push, pull_request]
jobs:
lint-and-typecheck:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
- run: pip install ruff mypy
- run: ruff check .
- run: mypy src/
unit-tests:
runs-on: ubuntu-latest
needs: lint-and-typecheck
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
- run: pip install pytest pytest-cov
- run: pytest --cov=src --cov-fail-under=85
mutation-tests:
runs-on: ubuntu-latest
needs: unit-tests
if: github.event_name == 'pull_request'
steps:
- uses: actions/checkout@v4
- run: pip install mutmut
- run: mutmut run --paths-to-mutate=src/
contract-tests:
runs-on: ubuntu-latest
needs: unit-tests
steps:
- uses: actions/checkout@v4
- uses: pact-foundation/pact-cli@v1
- run: pact-verifier ...
# .github/workflows/ci.yml (example)
name: CI
on: [push, pull_request]
jobs:
lint-and-typecheck:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install tools
run: |
wget -q https://github.com/llvm/llvm-project/releases/download/llvmorg-17.0.6/clang+llvm-17.0.6-x86_64-linux-gnu-ubuntu-22.04.tar.xz
tar -xf clang+llvm-17.0.6-x86_64-linux-gnu-ubuntu-22.04.tar.xz
export PATH=$PWD/clang+llvm-17.0.6-x86_64-linux-gnu-ubuntu-22.04/bin:$PATH
- run: clang-tidy src/**/*.cpp -- -Iinclude -std=c++20
- run: clang-format --dry-run --Werror src/**/*.cpp include/**/*.hpp
unit-tests:
runs-on: ubuntu-latest
needs: lint-and-typecheck
steps:
- uses: actions/checkout@v4
- name: Install Catch2
run: |
git clone https://github.com/catchorg/Catch2.git
cd Catch2 && cmake -Bbuild -DCMAKE_BUILD_TYPE=Release && cmake --build build --target install
- run: cmake -Bbuild -DCMAKE_BUILD_TYPE=Debug && cmake --build build
- run: ./build/tests --reporter=junit > test-results.xml
// .github/workflows/ci.yml (example)
name: CI
on: [push, pull_request]
jobs:
lint-and-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-java@v4
with:
distribution: 'temurin'
java-version: '21'
- name: Cache Maven
uses: actions/cache@v4
with:
path: ~/.m2/repository
key: maven-${{ hashFiles('**/pom.xml') }}
- run: mvn checkstyle:check spotbugs:check
- run: mvn test jacoco:report
- uses: codecov/codecov-action@v4
with:
files: target/site/jacoco/jacoco.xml
# .github/workflows/ci.yml (example)
name: CI
on: [push, pull_request]
jobs:
build-and-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-dotnet@v4
with:
dotnet-version: '8.0.x'
- run: dotnet format --verify-no-changes
- run: dotnet build --no-restore --configuration Release
- run: dotnet test --collect:"XPlat Code Coverage" --threshold=85
- uses: codecov/codecov-action@v4
| CI Gate |
Purpose |
Fail Fast? |
| Lint/Format |
Consistent style, catch syntax errors |
Yes (<1 min) |
| Type Check |
Catch type errors early |
Yes (<2 min) |
| Unit Tests |
Verify logic correctness |
Yes (<3 min) |
| Mutation Tests |
Test suite quality |
Optional (slow) |
| Contract Tests |
Prevent breaking API changes |
Yes |
| Security Scan |
SAST/SCA/Secrets |
Yes |
| Build Artifact |
Publish Docker/image |
On main only |
2. CD (Delivery) — Staging Deployment
# .github/workflows/cd-staging.yml
name: CD Staging
on:
workflow_run:
workflows: ["CI"]
types: [completed]
branches: [main]
jobs:
deploy-staging:
needs: ci
if: ${{ github.event.workflow_run.conclusion == 'success' }}
environment: staging
steps:
- uses: actions/checkout@v4
- name: Build & Push Image
run: |
docker build -t ghcr.io/${{ github.repository }}:${{ github.sha }} .
docker push ghcr.io/${{ github.repository }}:${{ github.sha }}
- name: Deploy to Staging
run: |
kubectl set image deployment/myapp \
myapp=ghcr.io/${{ github.repository }}:${{ github.sha }} \
-n staging
- name: Smoke Tests
run: |
sleep 30 # wait for rollout
curl -f https://staging.myapp.com/health || exit 1
# .github/workflows/cd-staging.yml
name: CD Staging
on:
workflow_run:
workflows: ["CI"]
types: [completed]
branches: [main]
jobs:
deploy-staging:
if: ${{ github.event.workflow_run.conclusion == 'success' }}
runs-on: ubuntu-latest
environment: staging
steps:
- uses: actions/checkout@v4
- name: Build & Push Image
run: |
docker build -t ghcr.io/${{ github.repository }}:${{ github.sha }} .
docker push ghcr.io/${{ github.repository }}:${{ github.sha }}
- name: Deploy to Staging
run: |
kubectl set image deployment/myapp \
myapp=ghcr.io/${{ github.repository }}/${{ github.sha }} \
-n staging
- name: Smoke Tests
run: |
sleep 30
curl -f https://staging.myapp.com/health || exit 1
// .github/workflows/cd-staging.yml
name: CD Staging
on:
workflow_run:
workflows: ["CI"]
types: [completed]
branches: [main]
jobs:
deploy-staging:
if: ${{ github.event.workflow_run.conclusion == 'success' }}
runs-on: ubuntu-latest
environment: staging
steps:
- uses: actions/checkout@v4
- name: Build & Push Image
run: |
docker build -t ghcr.io/${{ github.repository }}:${{ github.sha }} .
docker push ghcr.io/${{ github.repository }}:${{ github.sha }}
- name: Deploy to Staging
run: |
kubectl set image deployment/myapp \
myapp=ghcr.io/${{ github.repository }}:${{ github.sha }} \
-n staging
- name: Smoke Tests
run: |
sleep 30
curl -f https://staging.myapp.com/health || exit 1
# .github/workflows/cd-staging.yml
name: CD Staging
on:
workflow_run:
workflows: ["CI"]
types: [completed]
branches: [main]
jobs:
deploy-staging:
if: ${{ github.event.workflow_run.conclusion == 'success' }}
runs-on: ubuntu-latest
environment: staging
steps:
- uses: actions/checkout@v4
- name: Build & Push Image
run: |
docker build -t ghcr.io/${{ github.repository }}:${{ github.sha }} .
docker push ghcr.io/${{ github.repository }}:${{ github.sha }}
- name: Deploy to Staging
run: |
kubectl set image deployment/myapp \
myapp=ghcr.io/${{ github.repository }}:${{ github.sha }} \
-n staging
- name: Smoke Tests
run: |
sleep 30
curl -f https://staging.myapp.com/health || exit 1
3. CD (Deployment) — Production Strategies
| Strategy |
Description |
Risk |
Rollback Time |
| Blue/Green |
Two identical envs, switch DNS |
Low |
Seconds |
| Canary |
Route % traffic to new version |
Very Low |
Seconds |
| Rolling |
Replace pods incrementally |
Low |
Minutes |
| Feature Flags |
Toggle at runtime |
Lowest |
Instant |
# Canary deployment example
deploy-canary:
needs: deploy-staging
environment: production
steps:
- name: Deploy Canary (10%)
run: |
kubectl set image deployment/myapp \
myapp=ghcr.io/${{ github.repository }}:${{ github.sha }} \
-n production --replicas=10%
- name: Monitor Metrics (15 min)
run: |
# Check error rate, latency, business metrics
./scripts/check-canary-health.sh
- name: Promote or Rollback
run: |
if [ $CANARY_HEALTHY = true ]; then
kubectl scale deployment/myapp --replicas=100%
else
kubectl rollout undo deployment/myapp
fi
# .github/workflows/cd-canary.yml
deploy-canary:
needs: deploy-staging
environment: production
steps:
- name: Deploy Canary (10%)
run: |
kubectl set image deployment/myapp \
myapp=ghcr.io/${{ github.repository }}:${{ github.sha }} \
-n production --replicas=10%
- name: Monitor Metrics (15 min)
run: |
# Check error rate, latency, business metrics
./scripts/check-canary-health.sh
- name: Promote or Rollback
run: |
if [ $CANARY_HEALTHY = true ]; then
kubectl scale deployment/myapp --replicas=100%
else
kubectl rollout undo deployment/myapp
fi
// .github/workflows/cd-canary.yml
deploy-canary:
needs: deploy-staging
environment: production
steps:
- name: Deploy Canary (10%)
run: |
kubectl set image deployment/myapp \
myapp=ghcr.io/${{ github.repository }}:${{ github.sha }} \
-n production --replicas=10%
- name: Monitor Metrics (15 min)
run: |
# Check error rate, latency, business metrics
./scripts/check-canary-health.sh
- name: Promote or Rollback
run: |
if [ $CANARY_HEALTHY = true ]; then
kubectl scale deployment/myapp --replicas=100%
else
kubectl rollout undo deployment/myapp
fi
# .github/workflows/cd-canary.yml
deploy-canary:
needs: deploy-staging
environment: production
steps:
- name: Deploy Canary (10%)
run: |
kubectl set image deployment/myapp \
myapp=ghcr.io/${{ github.repository }}:${{ github.sha }} \
-n production --replicas=10%
- name: Monitor Metrics (15 min)
run: |
# Check error rate, latency, business metrics
./scripts/check-canary-health.sh
- name: Promote or Rollback
run: |
if [ $CANARY_HEALTHY = true ]; then
kubectl scale deployment/myapp --replicas=100%
else
kubectl rollout undo deployment/myapp
fi
The Four Key Metrics (DORA)
From Accelerate [^forsgren2018] — Predictors of software delivery performance
| Metric |
Elite |
High |
Medium |
Low |
| Deployment Frequency |
On-demand |
Weekly |
Monthly |
<Monthly |
| Lead Time for Changes |
<1 hour |
<1 day |
<1 week |
>1 month |
| Mean Time to Restore |
<1 hour |
<1 day |
<1 week |
>1 month |
| Change Failure Rate |
0–15% |
15–30% |
30–45% |
>45% |
Key finding: These four metrics correlate with organizational performance (profitability, productivity, customer satisfaction).
Pipeline as Code — Best Practices
| Principle |
Practice |
| Version controlled |
Pipeline in repo, not UI |
| Reproducible |
Same pipeline runs locally & CI |
| Declarative |
What, not how (YAML > scripts) |
| Fast feedback |
Fail fast, parallelize, cache |
| Secure by default |
Least privilege, signed artifacts |
Caching Strategy
# .github/workflows/ci.yml - caching example
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
# Python pip cache
- uses: actions/cache@v4
with:
path: ~/.cache/pip
key: pip-${{ runner.os }}-${{ hashFiles('requirements.txt', 'pyproject.toml') }}
restore-keys: pip-${{ runner.os }}-
# Mypy cache
- uses: actions/cache@v4
with:
path: .mypy_cache
key: mypy-${{ runner.os }}-${{ hashFiles('**/*.py') }}
restore-keys: mypy-${{ runner.os }}-
- run: pip install -r requirements.txt
- run: mypy src/
# .github/workflows/ci.yml - caching example
jobs:
build:
steps:
- uses: actions/checkout@v4
# CMake/ccache
- uses: actions/cache@v4
with:
path: |
~/.ccache
~/.cmake
key: cmake-${{ runner.os }}-${{ hashFiles('CMakeLists.txt', 'CMakePresets.json') }}
restore-keys: cmake-${{ runner.os }}-
- uses: actions/cache@v4
with:
path: ~/.conan/data
key: conan-${{ runner.os }}-${{ hashFiles('conanfile.txt') }}
restore-keys: conan-${{ runner.os }}-
- name: Build with ccache
run: |
export CC="ccache gcc"
export CXX="ccache g++"
cmake -Bbuild -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache
cmake --build build
# .github/workflows/ci.yml - caching example
jobs:
build:
steps:
- uses: actions/checkout@v4
- uses: actions/setup-java@v4
with:
distribution: 'temurin'
java-version: '21'
- name: Cache Maven
uses: actions/cache@v4
with:
path: ~/.m2/repository
key: maven-${{ runner.os }}-${{ hashFiles('**/pom.xml') }}
restore-keys: maven-${{ runner.os }}-
- name: Cache Gradle
uses: actions/cache@v4
with:
path: |
~/.gradle/caches
~/.gradle/wrapper
key: gradle-${{ runner.os }}-${{ hashFiles('**/*.gradle*') }}
restore-keys: gradle-${{ runner.os }}-
- run: mvn -B verify
# .github/workflows/ci.yml - caching example
jobs:
build:
steps:
- uses: actions/checkout@v4
- uses: actions/setup-dotnet@v4
with:
dotnet-version: '8.0.x'
- name: Cache NuGet
uses: actions/cache@v4
with:
path: ~/.nuget/packages
key: nuget-${{ runner.os }}-${{ hashFiles('**/*.csproj') }}
restore-keys: nuget-${{ runner.os }}-
- name: Cache dotnet CLI
uses: actions/cache@v4
with:
path: ~/.dotnet
key: dotnetcli-${{ runner.os }
- run: dotnet restore
- run: dotnet build --no-restore --configuration Release
- run: dotnet test --collect:"XPlat Code Coverage" --threshold=85
Security Gates
# Security gates in CI
jobs:
security:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
# SAST - Semgrep
- uses: returntocorp/semgrep-action@v1
with:
config: p/security-audit
# SCA - Dependabot / OWASP Dependency Check
- uses: dependency-check/Dependency-Check_Action@main
with:
project: 'myapp'
format: 'JSON'
out: 'reports'
# Secret scanning
- uses: trufflesecurity/trufflehog@main
with:
path: ./
base: ${{ github.event.pull_request.base.sha }}
head: ${{ github.event.pull_request.head.sha }}
# Container scanning
- uses: aquasecurity/trivy-action@master
with:
image-ref: 'ghcr.io/${{ github.repository }}:${{ github.sha }}'
format: 'sarif'
output: 'trivy-results.sarif'
- uses: github/codeql-action/upload-sarif@v3
if: always()
with:
sarif_file: 'trivy-results.sarif'
# Security gates in CI (C++)
jobs:
security:
steps:
- uses: actions/checkout@v4
# SAST - CodeQL
- uses: github/codeql-action/analyze@v3
with:
languages: cpp
queries: security-extended
# SCA - Dependency check
- uses: dependency-check/Dependency-Check_Action@main
with:
project: 'myapp'
format: 'HTML'
out: 'reports'
# Container scan
- uses: aquasecurity/trivy-action@master
with:
image-ref: 'ghcr.io/${{ github.repository }}:${{ github.sha }}'
# Security gates in CI (Java)
jobs:
security:
steps:
- uses: actions/checkout@v4
# SAST - CodeQL
- uses: github/codeql-action/analyze@v3
with:
languages: java
# SCA - OWASP Dependency Check
- uses: dependency-check/Dependency-Check_Action@main
with:
project: 'myapp'
format: 'XML'
# Secret scanning
- uses: trufflesecurity/trufflehog@main
with:
path: ./
# Container scan
- uses: aquasecurity/trivy-action@master
with:
image-ref: 'ghcr.io/${{ github.repository }}:${{ github.sha }}'
# Security gates in CI (C#)
jobs:
security:
steps:
- uses: actions/checkout@v4
# SAST - CodeQL
- uses: github/codeql-action/analyze@v3
with:
languages: csharp
# SCA - NuGet audit
- run: dotnet nuget-audit --threshold High
# Secret scanning
- uses: trufflesecurity/trufflehog@main
with:
path: ./
# Container scan
- uses: aquasecurity/trivy-action@master
with:
image-ref: 'ghcr.io/${{ github.repository }}:${{ github.sha }}'
Infrastructure as Code
# main.tf - AWS EKS cluster with Terraform
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 20.0"
cluster_name = "myapp-${var.environment}"
cluster_version = "1.28"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
eks_managed_node_group_defaults = {
ami_type = "AL2_x86_64"
instance_types = ["t3.medium"]
}
eks_managed_node_groups = {
general = {
min_size = 3
max_size = 10
desired_size = 5
instance_types = ["t3.medium"]
}
spot = {
min_size = 0
max_size = 20
desired_size = 5
instance_types = ["t3.medium", "t3a.medium"]
capacity_type = "SPOT"
}
}
}
# CMakeLists.txt - C++ build infrastructure
cmake_minimum_required(VERSION 3.25)
project(MyApp VERSION 1.0.0 LANGUAGES CXX)
set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_CXX_EXTENSIONS OFF)
# Toolchain for vcpkg
set(CMAKE_TOOLCHAIN_FILE ${VCPKG_ROOT}/scripts/buildsystems/vcpkg.cmake)
find_package(fmt REQUIRED)
find_package(spdlog REQUIRED)
find_package(boost REQUIRED COMPONENTS system thread)
add_executable(main src/main.cpp)
target_link_libraries(main PRIVATE fmt::fmt spdlog::spdlog boost::system boost::thread)
# CI configuration
enable_testing()
add_subdirectory(tests)
// build.gradle.kts - Java/Gradle infrastructure
plugins {
id("java")
id("application")
id("org.springframework.boot") version "3.2.0"
id("io.spring.dependency-management") version "1.1.4"
}
java {
toolchain.languageVersion.set(JavaLanguageVersion.of(21))
}
repositories {
mavenCentral()
maven { url = uri("https://repo.spring.io/milestone") }
}
dependencies {
implementation(platform("org.springframework.boot:spring-boot-dependencies:3.2.0"))
implementation("org.springframework.boot:spring-boot-starter-web")
implementation("org.springframework.boot:spring-boot-starter-data-jpa")
implementation("org.postgresql:postgresql")
testImplementation("org.springframework.boot:spring-boot-starter-test")
}
tasks.withType(Test) {
useJUnitPlatform()
testLogging.events = setOf("passed", "failed", "skipped")
}
# azure-pipelines.yml - C#/Azure DevOps
trigger:
- main
pool:
vmImage: 'ubuntu-latest'
variables:
buildConfiguration: 'Release'
steps:
- task: UseDotNet@2
inputs:
version: '8.0.x'
performMultiLevelLookup: true
- task: DotNetCoreCLI@2
displayName: 'Restore'
inputs:
command: 'restore'
projects: '**/*.csproj'
- task: DotNetCoreCLI@2
displayName: 'Build'
inputs:
command: 'build'
arguments: '--configuration $(buildConfiguration) --no-restore'
projects: '**/*.csproj'
- task: DotNetCoreCLI@2
displayName: 'Test'
inputs:
command: 'test'
projects: '**/*Tests.csproj'
arguments: '--configuration $(buildConfiguration) --collect:"XPlat Code Coverage" --threshold 85'
- task: PublishCodeCoverageResults@1
inputs:
codeCoverageTool: 'Cobertura'
summaryFileLocation: '**/coverage.cobertura.xml'
Observability Pipeline
| Layer |
Tools |
Purpose |
| Logs |
Loki, ELK, Datadog |
Debugging, audit |
| Metrics |
Prometheus, Grafana, Datadog |
Alerting, SLOs |
| Traces |
Jaeger, Zipkin, Tempo |
Distributed debugging |
| Alerts |
Alertmanager, PagerDuty |
On-call routing |
# Prometheus metrics in Python
from prometheus_client import Counter, Histogram, Gauge, start_http_server
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'Request latency', ['method', 'endpoint'])
ACTIVE_CONNECTIONS = Gauge('active_connections', 'Active connections')
@app.middleware("http")
async def metrics_middleware(request, call_next):
start = time.time()
response = await call_next(request)
REQUEST_COUNT.labels(request.method, request.url.path, response.status_code).inc()
REQUEST_LATENCY.labels(request.method, request.url.path).observe(time.time() - start)
return response
# Expose metrics
start_http_server(8000)
// Prometheus metrics in C++
#include <prometheus/counter.h>
#include <prometheus/histogram.h>
#include <prometheus/gauge.h>
#include <prometheus/exposer.h>
using prometheus::Counter;
using prometheus::Histogram;
using prometheus::Gauge;
using prometheus::Exposer;
Exposer exposer{"localhost:9090"};
auto registry = std::make_shared<prometheus::Registry>();
auto& requestCount = Counter::Build().Name("http_requests_total").Help("Total HTTP requests").Register(*registry);
auto& requestLatency = Histogram::Build().Name("http_request_duration_seconds").Help("Request latency").Register(*registry);
auto& activeConnections = Gauge::Build().Name("active_connections").Help("Active connections").Register(*registry);
class MetricsMiddleware {
public:
void operator()(const Request& req, Response& res, std::function<void()> next) {
auto start = std::chrono::steady_clock::now();
next();
auto elapsed = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start);
requestCount.Add({{"method", req.method}, {"path", req.path}, {"status", std::to_string(res.status)}})->Increment();
requestLatency.Observe(elapsed.count() / 1e6);
}
};
// Micrometer + Prometheus in Java
import io.micrometer.core.instrument.*;
import io.micrometer.prometheus.*;
MeterRegistry registry = new PrometheusMeterRegistry(PrometheusConfig.DEFAULT);
Counter requestCount = Counter.builder("http_requests_total")
.description("Total HTTP requests")
.register(registry);
Timer requestLatency = Timer.builder("http_request_duration_seconds")
.description("Request latency")
.register(registry);
Gauge.builder("active_connections", connectionPool, ConnectionPool::getActiveCount)
.description("Active connections")
.register(registry);
// Spring Boot actuator exposes /actuator/prometheus automatically
// Prometheus metrics in C#
using Prometheus;
var requestCount = Metrics.CreateCounter("http_requests_total", "Total HTTP requests", new[] { "method", "endpoint", "status" });
var requestLatency = Metrics.CreateHistogram("http_request_duration_seconds", "Request latency", new[] { "method", "endpoint" });
var activeConnections = Metrics.CreateGauge("active_connections", "Active connections");
app.Use(async (context, next) =>
{
var start = DateTime.UtcNow;
await next.Invoke();
var elapsed = DateTime.UtcNow - start;
requestCount.WithLabels(context.Request.Method, context.Request.Path, context.Response.StatusCode.ToString()).Inc();
requestLatency.WithLabels(context.Request.Method, context.Request.Path).Observe(elapsed.TotalSeconds);
});
// Expose /metrics endpoint
app.MapMetrics();
GitOps & ArgoCD
# argocd-application.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: myapp
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/myorg/myapp-config
targetRevision: HEAD
path: overlays/production
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
Disaster Recovery & Backup
| RPO |
RTO |
Strategy |
| 0 |
0 |
Multi-region active-active |
| <1 min |
<5 min |
Synchronous replication + auto-failover |
| <1 hour |
<4 hours |
Async replication + warm standby |
| <24 hours |
<24 hours |
Daily backups + documented restore |
Resources
Books
- Continuous Delivery — Jez Humble & David Farley
- Accelerate — Nicole Forsgren, Jez Humble, Gene Kim
- The DevOps Handbook — Gene Kim, Jez Humble, Patrick Debois, John Willis
- Team Topologies — Matthew Skelton & Manuel Pais
- Building Secure and Reliable Systems — Heather Adkins et al. (Google SRE)
Talks & Papers
- "Continuous Delivery" by Jez Humble — GOTO Conference
- "The Four Key Metrics" by Nicole Forsgren — DevOps Enterprise Summit
- "GitOps: What, Why, How" by Alexis Richardson — CNCF
| Category |
Tools |
| CI/CD |
GitHub Actions, GitLab CI, CircleCI, Buildkite, Tekton |
| GitOps |
ArgoCD, Flux, Fleet |
| Infrastructure |
Terraform, Pulumi, Crossplane, Pulumi |
| Container |
Docker, Podman, Buildah, Kaniko |
| Orchestration |
Kubernetes, Nomad, ECS |
| Observability |
Prometheus, Grafana, Loki, Tempo, Jaeger |
| Security |
Trivy, Trivy, Syft, Cosign, Kyverno |
Summary: CI/CD Maturity Model
| Level |
CI |
CD |
Monitoring |
Culture |
| 1. Initial |
Manual builds |
Manual deploy |
Reactive |
Silos |
| 2. Managed |
Automated CI |
Staging auto-deploy |
Basic metrics |
Dev/QA/Ops separate |
| 3. Defined |
CI + CD pipeline |
Automated to prod |
DORA metrics tracked |
DevOps collaboration |
| 4. Optimized |
Full automation |
Progressive deploy |
Predictive analytics |
Full ownership |
| 5. Optimizing |
Self-healing |
AI-driven |
Chaos engineering |
Continuous learning |
Start where you are. Improve continuously. Measure everything.