⚙️ Software Engineering

CI/CD & DevOps

"If it hurts, do it more often." — Jez Humble (paraphrasing Martin Fowler) [^humble2010]

CI vs CD vs CD — Clarifying the Terms

Acronym Full Term Purpose Trigger
CI Continuous Integration Merge early, test often Every push/PR
CD Continuous Delivery Ready to deploy anytime Every CI pass
CD Continuous Deployment Auto-deploy to prod Every CD pass
graph LR A[Code Push] --> B[CI: Build, Lint, Test] B --> C[CD: Deploy to Staging] C --> D[CD: Deploy to Production] D --> E[Monitor & Alert]

Pipeline Stages in Detail

1. CI — Continuous Integration

# .github/workflows/ci.yml (example)
name: CI
on: [push, pull_request]

jobs:
  lint-and-typecheck:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
      - run: pip install ruff mypy
      - run: ruff check .
      - run: mypy src/

  unit-tests:
    runs-on: ubuntu-latest
    needs: lint-and-typecheck
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
      - run: pip install pytest pytest-cov
      - run: pytest --cov=src --cov-fail-under=85

  mutation-tests:
    runs-on: ubuntu-latest
    needs: unit-tests
    if: github.event_name == 'pull_request'
    steps:
      - uses: actions/checkout@v4
      - run: pip install mutmut
      - run: mutmut run --paths-to-mutate=src/

  contract-tests:
    runs-on: ubuntu-latest
    needs: unit-tests
    steps:
      - uses: actions/checkout@v4
      - uses: pact-foundation/pact-cli@v1
      - run: pact-verifier ...
# .github/workflows/ci.yml (example)
name: CI
on: [push, pull_request]

jobs:
  lint-and-typecheck:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install tools
        run: |
          wget -q https://github.com/llvm/llvm-project/releases/download/llvmorg-17.0.6/clang+llvm-17.0.6-x86_64-linux-gnu-ubuntu-22.04.tar.xz
          tar -xf clang+llvm-17.0.6-x86_64-linux-gnu-ubuntu-22.04.tar.xz
          export PATH=$PWD/clang+llvm-17.0.6-x86_64-linux-gnu-ubuntu-22.04/bin:$PATH
      - run: clang-tidy src/**/*.cpp -- -Iinclude -std=c++20
      - run: clang-format --dry-run --Werror src/**/*.cpp include/**/*.hpp

  unit-tests:
    runs-on: ubuntu-latest
    needs: lint-and-typecheck
    steps:
      - uses: actions/checkout@v4
      - name: Install Catch2
        run: |
          git clone https://github.com/catchorg/Catch2.git
          cd Catch2 && cmake -Bbuild -DCMAKE_BUILD_TYPE=Release && cmake --build build --target install
      - run: cmake -Bbuild -DCMAKE_BUILD_TYPE=Debug && cmake --build build
      - run: ./build/tests --reporter=junit > test-results.xml
// .github/workflows/ci.yml (example)
name: CI
on: [push, pull_request]

jobs:
  lint-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-java@v4
        with:
          distribution: 'temurin'
          java-version: '21'
      - name: Cache Maven
        uses: actions/cache@v4
        with:
          path: ~/.m2/repository
          key: maven-${{ hashFiles('**/pom.xml') }}
      - run: mvn checkstyle:check spotbugs:check
      - run: mvn test jacoco:report
      - uses: codecov/codecov-action@v4
        with:
          files: target/site/jacoco/jacoco.xml
# .github/workflows/ci.yml (example)
name: CI
on: [push, pull_request]

jobs:
  build-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-dotnet@v4
        with:
          dotnet-version: '8.0.x'
      - run: dotnet format --verify-no-changes
      - run: dotnet build --no-restore --configuration Release
      - run: dotnet test --collect:"XPlat Code Coverage" --threshold=85
      - uses: codecov/codecov-action@v4
CI Gate Purpose Fail Fast?
Lint/Format Consistent style, catch syntax errors Yes (<1 min)
Type Check Catch type errors early Yes (<2 min)
Unit Tests Verify logic correctness Yes (<3 min)
Mutation Tests Test suite quality Optional (slow)
Contract Tests Prevent breaking API changes Yes
Security Scan SAST/SCA/Secrets Yes
Build Artifact Publish Docker/image On main only

2. CD (Delivery) — Staging Deployment

# .github/workflows/cd-staging.yml
name: CD Staging
on:
  workflow_run:
    workflows: ["CI"]
    types: [completed]
    branches: [main]

jobs:
  deploy-staging:
    needs: ci
    if: ${{ github.event.workflow_run.conclusion == 'success' }}
    environment: staging
    steps:
      - uses: actions/checkout@v4
      - name: Build & Push Image
        run: |
          docker build -t ghcr.io/${{ github.repository }}:${{ github.sha }} .
          docker push ghcr.io/${{ github.repository }}:${{ github.sha }}
      - name: Deploy to Staging
        run: |
          kubectl set image deployment/myapp \
            myapp=ghcr.io/${{ github.repository }}:${{ github.sha }} \
            -n staging
      - name: Smoke Tests
        run: |
          sleep 30  # wait for rollout
          curl -f https://staging.myapp.com/health || exit 1
# .github/workflows/cd-staging.yml
name: CD Staging
on:
  workflow_run:
    workflows: ["CI"]
    types: [completed]
    branches: [main]

jobs:
  deploy-staging:
    if: ${{ github.event.workflow_run.conclusion == 'success' }}
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - uses: actions/checkout@v4
      - name: Build & Push Image
        run: |
          docker build -t ghcr.io/${{ github.repository }}:${{ github.sha }} .
          docker push ghcr.io/${{ github.repository }}:${{ github.sha }}
      - name: Deploy to Staging
        run: |
          kubectl set image deployment/myapp \
            myapp=ghcr.io/${{ github.repository }}/${{ github.sha }} \
            -n staging
      - name: Smoke Tests
        run: |
          sleep 30
          curl -f https://staging.myapp.com/health || exit 1
// .github/workflows/cd-staging.yml
name: CD Staging
on:
  workflow_run:
    workflows: ["CI"]
    types: [completed]
    branches: [main]

jobs:
  deploy-staging:
    if: ${{ github.event.workflow_run.conclusion == 'success' }}
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - uses: actions/checkout@v4
      - name: Build & Push Image
        run: |
          docker build -t ghcr.io/${{ github.repository }}:${{ github.sha }} .
          docker push ghcr.io/${{ github.repository }}:${{ github.sha }}
      - name: Deploy to Staging
        run: |
          kubectl set image deployment/myapp \
            myapp=ghcr.io/${{ github.repository }}:${{ github.sha }} \
            -n staging
      - name: Smoke Tests
        run: |
          sleep 30
          curl -f https://staging.myapp.com/health || exit 1
# .github/workflows/cd-staging.yml
name: CD Staging
on:
  workflow_run:
    workflows: ["CI"]
    types: [completed]
    branches: [main]

jobs:
  deploy-staging:
    if: ${{ github.event.workflow_run.conclusion == 'success' }}
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - uses: actions/checkout@v4
      - name: Build & Push Image
        run: |
          docker build -t ghcr.io/${{ github.repository }}:${{ github.sha }} .
          docker push ghcr.io/${{ github.repository }}:${{ github.sha }}
      - name: Deploy to Staging
        run: |
          kubectl set image deployment/myapp \
            myapp=ghcr.io/${{ github.repository }}:${{ github.sha }} \
            -n staging
      - name: Smoke Tests
        run: |
          sleep 30
          curl -f https://staging.myapp.com/health || exit 1

3. CD (Deployment) — Production Strategies

Strategy Description Risk Rollback Time
Blue/Green Two identical envs, switch DNS Low Seconds
Canary Route % traffic to new version Very Low Seconds
Rolling Replace pods incrementally Low Minutes
Feature Flags Toggle at runtime Lowest Instant
# Canary deployment example
deploy-canary:
  needs: deploy-staging
  environment: production
  steps:
    - name: Deploy Canary (10%)
      run: |
        kubectl set image deployment/myapp \
          myapp=ghcr.io/${{ github.repository }}:${{ github.sha }} \
          -n production --replicas=10%
    - name: Monitor Metrics (15 min)
      run: |
        # Check error rate, latency, business metrics
        ./scripts/check-canary-health.sh
    - name: Promote or Rollback
      run: |
        if [ $CANARY_HEALTHY = true ]; then
          kubectl scale deployment/myapp --replicas=100%
        else
          kubectl rollout undo deployment/myapp
        fi
# .github/workflows/cd-canary.yml
deploy-canary:
  needs: deploy-staging
  environment: production
  steps:
    - name: Deploy Canary (10%)
      run: |
        kubectl set image deployment/myapp \
          myapp=ghcr.io/${{ github.repository }}:${{ github.sha }} \
          -n production --replicas=10%
    - name: Monitor Metrics (15 min)
      run: |
        # Check error rate, latency, business metrics
        ./scripts/check-canary-health.sh
    - name: Promote or Rollback
      run: |
        if [ $CANARY_HEALTHY = true ]; then
          kubectl scale deployment/myapp --replicas=100%
        else
          kubectl rollout undo deployment/myapp
        fi
// .github/workflows/cd-canary.yml
deploy-canary:
  needs: deploy-staging
  environment: production
  steps:
    - name: Deploy Canary (10%)
      run: |
        kubectl set image deployment/myapp \
          myapp=ghcr.io/${{ github.repository }}:${{ github.sha }} \
          -n production --replicas=10%
    - name: Monitor Metrics (15 min)
      run: |
        # Check error rate, latency, business metrics
        ./scripts/check-canary-health.sh
    - name: Promote or Rollback
      run: |
        if [ $CANARY_HEALTHY = true ]; then
          kubectl scale deployment/myapp --replicas=100%
        else
          kubectl rollout undo deployment/myapp
        fi
# .github/workflows/cd-canary.yml
deploy-canary:
  needs: deploy-staging
  environment: production
  steps:
    - name: Deploy Canary (10%)
      run: |
        kubectl set image deployment/myapp \
          myapp=ghcr.io/${{ github.repository }}:${{ github.sha }} \
          -n production --replicas=10%
    - name: Monitor Metrics (15 min)
      run: |
        # Check error rate, latency, business metrics
        ./scripts/check-canary-health.sh
    - name: Promote or Rollback
      run: |
        if [ $CANARY_HEALTHY = true ]; then
          kubectl scale deployment/myapp --replicas=100%
        else
          kubectl rollout undo deployment/myapp
        fi

The Four Key Metrics (DORA)

From Accelerate [^forsgren2018] — Predictors of software delivery performance

Metric Elite High Medium Low
Deployment Frequency On-demand Weekly Monthly <Monthly
Lead Time for Changes <1 hour <1 day <1 week >1 month
Mean Time to Restore <1 hour <1 day <1 week >1 month
Change Failure Rate 0–15% 15–30% 30–45% >45%

Key finding: These four metrics correlate with organizational performance (profitability, productivity, customer satisfaction).


Pipeline as Code — Best Practices

Principle Practice
Version controlled Pipeline in repo, not UI
Reproducible Same pipeline runs locally & CI
Declarative What, not how (YAML > scripts)
Fast feedback Fail fast, parallelize, cache
Secure by default Least privilege, signed artifacts

Caching Strategy

# .github/workflows/ci.yml - caching example
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      # Python pip cache
      - uses: actions/cache@v4
        with:
          path: ~/.cache/pip
          key: pip-${{ runner.os }}-${{ hashFiles('requirements.txt', 'pyproject.toml') }}
          restore-keys: pip-${{ runner.os }}-

      # Mypy cache
      - uses: actions/cache@v4
        with:
          path: .mypy_cache
          key: mypy-${{ runner.os }}-${{ hashFiles('**/*.py') }}
          restore-keys: mypy-${{ runner.os }}-

      - run: pip install -r requirements.txt
      - run: mypy src/
# .github/workflows/ci.yml - caching example
jobs:
  build:
    steps:
      - uses: actions/checkout@v4

      # CMake/ccache
      - uses: actions/cache@v4
        with:
          path: |
            ~/.ccache
            ~/.cmake
          key: cmake-${{ runner.os }}-${{ hashFiles('CMakeLists.txt', 'CMakePresets.json') }}
          restore-keys: cmake-${{ runner.os }}-

      - uses: actions/cache@v4
        with:
          path: ~/.conan/data
          key: conan-${{ runner.os }}-${{ hashFiles('conanfile.txt') }}
          restore-keys: conan-${{ runner.os }}-

      - name: Build with ccache
        run: |
          export CC="ccache gcc"
          export CXX="ccache g++"
          cmake -Bbuild -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache
          cmake --build build
# .github/workflows/ci.yml - caching example
jobs:
  build:
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-java@v4
        with:
          distribution: 'temurin'
          java-version: '21'

      - name: Cache Maven
        uses: actions/cache@v4
        with:
          path: ~/.m2/repository
          key: maven-${{ runner.os }}-${{ hashFiles('**/pom.xml') }}
          restore-keys: maven-${{ runner.os }}-

      - name: Cache Gradle
        uses: actions/cache@v4
        with:
          path: |
            ~/.gradle/caches
            ~/.gradle/wrapper
          key: gradle-${{ runner.os }}-${{ hashFiles('**/*.gradle*') }}
          restore-keys: gradle-${{ runner.os }}-

      - run: mvn -B verify
# .github/workflows/ci.yml - caching example
jobs:
  build:
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-dotnet@v4
        with:
          dotnet-version: '8.0.x'

      - name: Cache NuGet
        uses: actions/cache@v4
        with:
          path: ~/.nuget/packages
          key: nuget-${{ runner.os }}-${{ hashFiles('**/*.csproj') }}
          restore-keys: nuget-${{ runner.os }}-

      - name: Cache dotnet CLI
        uses: actions/cache@v4
        with:
          path: ~/.dotnet
          key: dotnetcli-${{ runner.os }

      - run: dotnet restore
      - run: dotnet build --no-restore --configuration Release
      - run: dotnet test --collect:"XPlat Code Coverage" --threshold=85

Security Gates

# Security gates in CI
jobs:
  security:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      # SAST - Semgrep
      - uses: returntocorp/semgrep-action@v1
        with:
          config: p/security-audit
      # SCA - Dependabot / OWASP Dependency Check
      - uses: dependency-check/Dependency-Check_Action@main
        with:
          project: 'myapp'
          format: 'JSON'
          out: 'reports'
      # Secret scanning
      - uses: trufflesecurity/trufflehog@main
        with:
          path: ./
          base: ${{ github.event.pull_request.base.sha }}
          head: ${{ github.event.pull_request.head.sha }}
      # Container scanning
      - uses: aquasecurity/trivy-action@master
        with:
          image-ref: 'ghcr.io/${{ github.repository }}:${{ github.sha }}'
          format: 'sarif'
          output: 'trivy-results.sarif'
      - uses: github/codeql-action/upload-sarif@v3
        if: always()
        with:
          sarif_file: 'trivy-results.sarif'
# Security gates in CI (C++)
jobs:
  security:
    steps:
      - uses: actions/checkout@v4
      # SAST - CodeQL
      - uses: github/codeql-action/analyze@v3
        with:
          languages: cpp
          queries: security-extended
      # SCA - Dependency check
      - uses: dependency-check/Dependency-Check_Action@main
        with:
          project: 'myapp'
          format: 'HTML'
          out: 'reports'
      # Container scan
      - uses: aquasecurity/trivy-action@master
        with:
          image-ref: 'ghcr.io/${{ github.repository }}:${{ github.sha }}'
# Security gates in CI (Java)
jobs:
  security:
    steps:
      - uses: actions/checkout@v4
      # SAST - CodeQL
      - uses: github/codeql-action/analyze@v3
        with:
          languages: java
      # SCA - OWASP Dependency Check
      - uses: dependency-check/Dependency-Check_Action@main
        with:
          project: 'myapp'
          format: 'XML'
      # Secret scanning
      - uses: trufflesecurity/trufflehog@main
        with:
          path: ./
      # Container scan
      - uses: aquasecurity/trivy-action@master
        with:
          image-ref: 'ghcr.io/${{ github.repository }}:${{ github.sha }}'
# Security gates in CI (C#)
jobs:
  security:
    steps:
      - uses: actions/checkout@v4
      # SAST - CodeQL
      - uses: github/codeql-action/analyze@v3
        with:
          languages: csharp
      # SCA - NuGet audit
      - run: dotnet nuget-audit --threshold High
      # Secret scanning
      - uses: trufflesecurity/trufflehog@main
        with:
          path: ./
      # Container scan
      - uses: aquasecurity/trivy-action@master
        with:
          image-ref: 'ghcr.io/${{ github.repository }}:${{ github.sha }}'

Infrastructure as Code

# main.tf - AWS EKS cluster with Terraform
module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.0"

  cluster_name    = "myapp-${var.environment}"
  cluster_version = "1.28"

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets

  eks_managed_node_group_defaults = {
    ami_type       = "AL2_x86_64"
    instance_types = ["t3.medium"]
  }

  eks_managed_node_groups = {
    general = {
      min_size     = 3
      max_size     = 10
      desired_size = 5
      instance_types = ["t3.medium"]
    }

    spot = {
      min_size     = 0
      max_size     = 20
      desired_size = 5
      instance_types = ["t3.medium", "t3a.medium"]
      capacity_type  = "SPOT"
    }
  }
}
# CMakeLists.txt - C++ build infrastructure
cmake_minimum_required(VERSION 3.25)
project(MyApp VERSION 1.0.0 LANGUAGES CXX)

set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_CXX_EXTENSIONS OFF)

# Toolchain for vcpkg
set(CMAKE_TOOLCHAIN_FILE ${VCPKG_ROOT}/scripts/buildsystems/vcpkg.cmake)

find_package(fmt REQUIRED)
find_package(spdlog REQUIRED)
find_package(boost REQUIRED COMPONENTS system thread)

add_executable(main src/main.cpp)
target_link_libraries(main PRIVATE fmt::fmt spdlog::spdlog boost::system boost::thread)

# CI configuration
enable_testing()
add_subdirectory(tests)
// build.gradle.kts - Java/Gradle infrastructure
plugins {
    id("java")
    id("application")
    id("org.springframework.boot") version "3.2.0"
    id("io.spring.dependency-management") version "1.1.4"
}

java {
    toolchain.languageVersion.set(JavaLanguageVersion.of(21))
}

repositories {
    mavenCentral()
    maven { url = uri("https://repo.spring.io/milestone") }
}

dependencies {
    implementation(platform("org.springframework.boot:spring-boot-dependencies:3.2.0"))
    implementation("org.springframework.boot:spring-boot-starter-web")
    implementation("org.springframework.boot:spring-boot-starter-data-jpa")
    implementation("org.postgresql:postgresql")
    testImplementation("org.springframework.boot:spring-boot-starter-test")
}

tasks.withType(Test) {
    useJUnitPlatform()
    testLogging.events = setOf("passed", "failed", "skipped")
}
# azure-pipelines.yml - C#/Azure DevOps
trigger:
  - main

pool:
  vmImage: 'ubuntu-latest'

variables:
  buildConfiguration: 'Release'

steps:
- task: UseDotNet@2
  inputs:
    version: '8.0.x'
    performMultiLevelLookup: true

- task: DotNetCoreCLI@2
  displayName: 'Restore'
  inputs:
    command: 'restore'
    projects: '**/*.csproj'

- task: DotNetCoreCLI@2
  displayName: 'Build'
  inputs:
    command: 'build'
    arguments: '--configuration $(buildConfiguration) --no-restore'
    projects: '**/*.csproj'

- task: DotNetCoreCLI@2
  displayName: 'Test'
  inputs:
    command: 'test'
    projects: '**/*Tests.csproj'
    arguments: '--configuration $(buildConfiguration) --collect:"XPlat Code Coverage" --threshold 85'

- task: PublishCodeCoverageResults@1
  inputs:
    codeCoverageTool: 'Cobertura'
    summaryFileLocation: '**/coverage.cobertura.xml'

Observability Pipeline

Layer Tools Purpose
Logs Loki, ELK, Datadog Debugging, audit
Metrics Prometheus, Grafana, Datadog Alerting, SLOs
Traces Jaeger, Zipkin, Tempo Distributed debugging
Alerts Alertmanager, PagerDuty On-call routing
# Prometheus metrics in Python
from prometheus_client import Counter, Histogram, Gauge, start_http_server

REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'Request latency', ['method', 'endpoint'])
ACTIVE_CONNECTIONS = Gauge('active_connections', 'Active connections')

@app.middleware("http")
async def metrics_middleware(request, call_next):
    start = time.time()
    response = await call_next(request)
    REQUEST_COUNT.labels(request.method, request.url.path, response.status_code).inc()
    REQUEST_LATENCY.labels(request.method, request.url.path).observe(time.time() - start)
    return response

# Expose metrics
start_http_server(8000)
// Prometheus metrics in C++
#include <prometheus/counter.h>
#include <prometheus/histogram.h>
#include <prometheus/gauge.h>
#include <prometheus/exposer.h>

using prometheus::Counter;
using prometheus::Histogram;
using prometheus::Gauge;
using prometheus::Exposer;

Exposer exposer{"localhost:9090"};
auto registry = std::make_shared<prometheus::Registry>();

auto& requestCount = Counter::Build().Name("http_requests_total").Help("Total HTTP requests").Register(*registry);
auto& requestLatency = Histogram::Build().Name("http_request_duration_seconds").Help("Request latency").Register(*registry);
auto& activeConnections = Gauge::Build().Name("active_connections").Help("Active connections").Register(*registry);

class MetricsMiddleware {
public:
    void operator()(const Request& req, Response& res, std::function<void()> next) {
        auto start = std::chrono::steady_clock::now();
        next();
        auto elapsed = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start);
        requestCount.Add({{"method", req.method}, {"path", req.path}, {"status", std::to_string(res.status)}})->Increment();
        requestLatency.Observe(elapsed.count() / 1e6);
    }
};
// Micrometer + Prometheus in Java
import io.micrometer.core.instrument.*;
import io.micrometer.prometheus.*;

MeterRegistry registry = new PrometheusMeterRegistry(PrometheusConfig.DEFAULT);

Counter requestCount = Counter.builder("http_requests_total")
    .description("Total HTTP requests")
    .register(registry);

Timer requestLatency = Timer.builder("http_request_duration_seconds")
    .description("Request latency")
    .register(registry);

Gauge.builder("active_connections", connectionPool, ConnectionPool::getActiveCount)
    .description("Active connections")
    .register(registry);

// Spring Boot actuator exposes /actuator/prometheus automatically
// Prometheus metrics in C#
using Prometheus;

var requestCount = Metrics.CreateCounter("http_requests_total", "Total HTTP requests", new[] { "method", "endpoint", "status" });
var requestLatency = Metrics.CreateHistogram("http_request_duration_seconds", "Request latency", new[] { "method", "endpoint" });
var activeConnections = Metrics.CreateGauge("active_connections", "Active connections");

app.Use(async (context, next) =>
{
    var start = DateTime.UtcNow;
    await next.Invoke();
    var elapsed = DateTime.UtcNow - start;

    requestCount.WithLabels(context.Request.Method, context.Request.Path, context.Response.StatusCode.ToString()).Inc();
    requestLatency.WithLabels(context.Request.Method, context.Request.Path).Observe(elapsed.TotalSeconds);
});

// Expose /metrics endpoint
app.MapMetrics();

GitOps & ArgoCD

# argocd-application.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: myapp
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/myorg/myapp-config
    targetRevision: HEAD
    path: overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
    - CreateNamespace=true

Disaster Recovery & Backup

RPO RTO Strategy
0 0 Multi-region active-active
<1 min <5 min Synchronous replication + auto-failover
<1 hour <4 hours Async replication + warm standby
<24 hours <24 hours Daily backups + documented restore

Resources

Books

  • Continuous Delivery — Jez Humble & David Farley
  • Accelerate — Nicole Forsgren, Jez Humble, Gene Kim
  • The DevOps Handbook — Gene Kim, Jez Humble, Patrick Debois, John Willis
  • Team Topologies — Matthew Skelton & Manuel Pais
  • Building Secure and Reliable Systems — Heather Adkins et al. (Google SRE)

Talks & Papers

  • "Continuous Delivery" by Jez Humble — GOTO Conference
  • "The Four Key Metrics" by Nicole Forsgren — DevOps Enterprise Summit
  • "GitOps: What, Why, How" by Alexis Richardson — CNCF

Tools

Category Tools
CI/CD GitHub Actions, GitLab CI, CircleCI, Buildkite, Tekton
GitOps ArgoCD, Flux, Fleet
Infrastructure Terraform, Pulumi, Crossplane, Pulumi
Container Docker, Podman, Buildah, Kaniko
Orchestration Kubernetes, Nomad, ECS
Observability Prometheus, Grafana, Loki, Tempo, Jaeger
Security Trivy, Trivy, Syft, Cosign, Kyverno

Summary: CI/CD Maturity Model

Level CI CD Monitoring Culture
1. Initial Manual builds Manual deploy Reactive Silos
2. Managed Automated CI Staging auto-deploy Basic metrics Dev/QA/Ops separate
3. Defined CI + CD pipeline Automated to prod DORA metrics tracked DevOps collaboration
4. Optimized Full automation Progressive deploy Predictive analytics Full ownership
5. Optimizing Self-healing AI-driven Chaos engineering Continuous learning

Start where you are. Improve continuously. Measure everything.