CI/CD & DevOps

"If it hurts, do it more often." — Jez Humble (paraphrasing Martin Fowler) [^humble2010]

CI vs CD vs CD — Clarifying the Terms

Acronym	Full Term	Purpose	Trigger
CI	Continuous Integration	Merge early, test often	Every push/PR
CD	Continuous Delivery	Ready to deploy anytime	Every CI pass
CD	Continuous Deployment	Auto-deploy to prod	Every CD pass

graph LR A[Code Push] --> B[CI: Build, Lint, Test] B --> C[CD: Deploy to Staging] C --> D[CD: Deploy to Production] D --> E[Monitor & Alert]

Pipeline Stages in Detail

1. CI — Continuous Integration

# .github/workflows/ci.yml (example)
name: CI
on: [push, pull_request]

jobs:
  lint-and-typecheck:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
      - run: pip install ruff mypy
      - run: ruff check .
      - run: mypy src/

  unit-tests:
    runs-on: ubuntu-latest
    needs: lint-and-typecheck
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
      - run: pip install pytest pytest-cov
      - run: pytest --cov=src --cov-fail-under=85

  mutation-tests:
    runs-on: ubuntu-latest
    needs: unit-tests
    if: github.event_name == 'pull_request'
    steps:
      - uses: actions/checkout@v4
      - run: pip install mutmut
      - run: mutmut run --paths-to-mutate=src/

  contract-tests:
    runs-on: ubuntu-latest
    needs: unit-tests
    steps:
      - uses: actions/checkout@v4
      - uses: pact-foundation/pact-cli@v1
      - run: pact-verifier ...

# .github/workflows/ci.yml (example)
name: CI
on: [push, pull_request]

jobs:
  lint-and-typecheck:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install tools
        run: |
          wget -q https://github.com/llvm/llvm-project/releases/download/llvmorg-17.0.6/clang+llvm-17.0.6-x86_64-linux-gnu-ubuntu-22.04.tar.xz
          tar -xf clang+llvm-17.0.6-x86_64-linux-gnu-ubuntu-22.04.tar.xz
          export PATH=$PWD/clang+llvm-17.0.6-x86_64-linux-gnu-ubuntu-22.04/bin:$PATH
      - run: clang-tidy src/**/*.cpp -- -Iinclude -std=c++20
      - run: clang-format --dry-run --Werror src/**/*.cpp include/**/*.hpp

  unit-tests:
    runs-on: ubuntu-latest
    needs: lint-and-typecheck
    steps:
      - uses: actions/checkout@v4
      - name: Install Catch2
        run: |
          git clone https://github.com/catchorg/Catch2.git
          cd Catch2 && cmake -Bbuild -DCMAKE_BUILD_TYPE=Release && cmake --build build --target install
      - run: cmake -Bbuild -DCMAKE_BUILD_TYPE=Debug && cmake --build build
      - run: ./build/tests --reporter=junit > test-results.xml

// .github/workflows/ci.yml (example)
name: CI
on: [push, pull_request]

jobs:
  lint-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-java@v4
        with:
          distribution: 'temurin'
          java-version: '21'
      - name: Cache Maven
        uses: actions/cache@v4
        with:
          path: ~/.m2/repository
          key: maven-${{ hashFiles('**/pom.xml') }}
      - run: mvn checkstyle:check spotbugs:check
      - run: mvn test jacoco:report
      - uses: codecov/codecov-action@v4
        with:
          files: target/site/jacoco/jacoco.xml

# .github/workflows/ci.yml (example)
name: CI
on: [push, pull_request]

jobs:
  build-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-dotnet@v4
        with:
          dotnet-version: '8.0.x'
      - run: dotnet format --verify-no-changes
      - run: dotnet build --no-restore --configuration Release
      - run: dotnet test --collect:"XPlat Code Coverage" --threshold=85
      - uses: codecov/codecov-action@v4

CI Gate	Purpose	Fail Fast?
Lint/Format	Consistent style, catch syntax errors	Yes (<1 min)
Type Check	Catch type errors early	Yes (<2 min)
Unit Tests	Verify logic correctness	Yes (<3 min)
Mutation Tests	Test suite quality	Optional (slow)
Contract Tests	Prevent breaking API changes	Yes
Security Scan	SAST/SCA/Secrets	Yes
Build Artefact	Publish Docker/image	On main only

2. CD (Delivery) — Staging Deployment

A CI/CD pipeline definition is YAML consumed by the CI provider (GitHub Actions here), not source code in any of the languages a project happens to use — it's identical regardless of whether the application being deployed is written in Python, C++, Java, or C#, so it doesn't belong behind a per-language tab switcher:

# .github/workflows/cd-staging.yml
name: CD Staging
on:
  workflow_run:
    workflows: ["CI"]
    types: [completed]
    branches: [main]

jobs:
  deploy-staging:
    if: ${{ github.event.workflow_run.conclusion == 'success' }}
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - uses: actions/checkout@v4
      - name: Build & Push Image
        run: |
          docker build -t ghcr.io/${{ github.repository }}:${{ github.sha }} .
          docker push ghcr.io/${{ github.repository }}:${{ github.sha }}
      - name: Deploy to Staging
        run: |
          kubectl set image deployment/myapp \
            myapp=ghcr.io/${{ github.repository }}:${{ github.sha }} \
            -n staging
      - name: Smoke Tests
        run: |
          sleep 30  # wait for rollout
          curl -f https://staging.myapp.com/health || exit 1

3. CD (Deployment) — Production Strategies

Strategy	Description	Risk	Rollback Time
Blue/Green	Two identical envs, switch DNS	Low	Seconds
Canary	Route % traffic to new version	Very Low	Seconds
Rolling	Replace pods incrementally	Low	Minutes
Feature Flags	Toggle at runtime	Lowest	Instant

Again, a canary-deployment pipeline step is YAML consumed by the CI provider, the same regardless of the application's implementation language:

# .github/workflows/cd-canary.yml
deploy-canary:
  needs: deploy-staging
  environment: production
  steps:
    - name: Deploy Canary (10%)
      run: |
        kubectl set image deployment/myapp \
          myapp=ghcr.io/${{ github.repository }}:${{ github.sha }} \
          -n production --replicas=10%
    - name: Monitor Metrics (15 min)
      run: |
        # Check error rate, latency, business metrics
        ./scripts/check-canary-health.sh
    - name: Promote or Rollback
      run: |
        if [ $CANARY_HEALTHY = true ]; then
          kubectl scale deployment/myapp --replicas=100%
        else
          kubectl rollout undo deployment/myapp
        fi

The Four Key Metrics (DORA)

From Accelerate [^forsgren2018] — Predictors of software delivery performance

Metric	Elite	High	Medium	Low
Deployment Frequency	On-demand	Weekly	Monthly	<Monthly
Lead Time for Changes	<1 hour	<1 day	<1 week	>1 month
Mean Time to Restore	<1 hour	<1 day	<1 week	>1 month
Change Failure Rate	0–15%	15–30%	30–45%	>45%

Key finding: These four metrics correlate with organizational performance (profitability, productivity, customer satisfaction).

Pipeline as Code — Best Practices

Principle	Practice
Version controlled	Pipeline in repo, not UI
Reproducible	Same pipeline runs locally & CI
Declarative	What, not how (YAML > scripts)
Fast feedback	Fail fast, parallelise, cache
Secure by default	Least privilege, signed artefacts

Caching Strategy

# .github/workflows/ci.yml - caching example
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      # Python pip cache
      - uses: actions/cache@v4
        with:
          path: ~/.cache/pip
          key: pip-${{ runner.os }}-${{ hashFiles('requirements.txt', 'pyproject.toml') }}
          restore-keys: pip-${{ runner.os }}-

      # Mypy cache
      - uses: actions/cache@v4
        with:
          path: .mypy_cache
          key: mypy-${{ runner.os }}-${{ hashFiles('**/*.py') }}
          restore-keys: mypy-${{ runner.os }}-

      - run: pip install -r requirements.txt
      - run: mypy src/

# .github/workflows/ci.yml - caching example
jobs:
  build:
    steps:
      - uses: actions/checkout@v4

      # CMake/ccache
      - uses: actions/cache@v4
        with:
          path: |
            ~/.ccache
            ~/.cmake
          key: cmake-${{ runner.os }}-${{ hashFiles('CMakeLists.txt', 'CMakePresets.json') }}
          restore-keys: cmake-${{ runner.os }}-

      - uses: actions/cache@v4
        with:
          path: ~/.conan/data
          key: conan-${{ runner.os }}-${{ hashFiles('conanfile.txt') }}
          restore-keys: conan-${{ runner.os }}-

      - name: Build with ccache
        run: |
          export CC="ccache gcc"
          export CXX="ccache g++"
          cmake -Bbuild -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache
          cmake --build build

# .github/workflows/ci.yml - caching example
jobs:
  build:
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-java@v4
        with:
          distribution: 'temurin'
          java-version: '21'

      - name: Cache Maven
        uses: actions/cache@v4
        with:
          path: ~/.m2/repository
          key: maven-${{ runner.os }}-${{ hashFiles('**/pom.xml') }}
          restore-keys: maven-${{ runner.os }}-

      - name: Cache Gradle
        uses: actions/cache@v4
        with:
          path: |
            ~/.gradle/caches
            ~/.gradle/wrapper
          key: gradle-${{ runner.os }}-${{ hashFiles('**/*.gradle*') }}
          restore-keys: gradle-${{ runner.os }}-

      - run: mvn -B verify

# .github/workflows/ci.yml - caching example
jobs:
  build:
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-dotnet@v4
        with:
          dotnet-version: '8.0.x'

      - name: Cache NuGet
        uses: actions/cache@v4
        with:
          path: ~/.nuget/packages
          key: nuget-${{ runner.os }}-${{ hashFiles('**/*.csproj') }}
          restore-keys: nuget-${{ runner.os }}-

      - name: Cache dotnet CLI
        uses: actions/cache@v4
        with:
          path: ~/.dotnet
          key: dotnetcli-${{ runner.os }

      - run: dotnet restore
      - run: dotnet build --no-restore --configuration Release
      - run: dotnet test --collect:"XPlat Code Coverage" --threshold=85

Security Gates

# Security gates in CI
jobs:
  security:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      # SAST - Semgrep
      - uses: returntocorp/semgrep-action@v1
        with:
          config: p/security-audit
      # SCA - Dependabot / OWASP Dependency Check
      - uses: dependency-check/Dependency-Check_Action@main
        with:
          project: 'myapp'
          format: 'JSON'
          out: 'reports'
      # Secret scanning
      - uses: trufflesecurity/trufflehog@main
        with:
          path: ./
          base: ${{ github.event.pull_request.base.sha }}
          head: ${{ github.event.pull_request.head.sha }}
      # Container scanning
      - uses: aquasecurity/trivy-action@master
        with:
          image-ref: 'ghcr.io/${{ github.repository }}:${{ github.sha }}'
          format: 'sarif'
          output: 'trivy-results.sarif'
      - uses: github/codeql-action/upload-sarif@v3
        if: always()
        with:
          sarif_file: 'trivy-results.sarif'

# Security gates in CI (C++)
jobs:
  security:
    steps:
      - uses: actions/checkout@v4
      # SAST - CodeQL
      - uses: github/codeql-action/analyze@v3
        with:
          languages: cpp
          queries: security-extended
      # SCA - Dependency check
      - uses: dependency-check/Dependency-Check_Action@main
        with:
          project: 'myapp'
          format: 'HTML'
          out: 'reports'
      # Container scan
      - uses: aquasecurity/trivy-action@master
        with:
          image-ref: 'ghcr.io/${{ github.repository }}:${{ github.sha }}'

# Security gates in CI (Java)
jobs:
  security:
    steps:
      - uses: actions/checkout@v4
      # SAST - CodeQL
      - uses: github/codeql-action/analyze@v3
        with:
          languages: java
      # SCA - OWASP Dependency Check
      - uses: dependency-check/Dependency-Check_Action@main
        with:
          project: 'myapp'
          format: 'XML'
      # Secret scanning
      - uses: trufflesecurity/trufflehog@main
        with:
          path: ./
      # Container scan
      - uses: aquasecurity/trivy-action@master
        with:
          image-ref: 'ghcr.io/${{ github.repository }}:${{ github.sha }}'

# Security gates in CI (C#)
jobs:
  security:
    steps:
      - uses: actions/checkout@v4
      # SAST - CodeQL
      - uses: github/codeql-action/analyze@v3
        with:
          languages: csharp
      # SCA - NuGet audit
      - run: dotnet nuget-audit --threshold High
      # Secret scanning
      - uses: trufflesecurity/trufflehog@main
        with:
          path: ./
      # Container scan
      - uses: aquasecurity/trivy-action@master
        with:
          image-ref: 'ghcr.io/${{ github.repository }}:${{ github.sha }}'

Infrastructure as Code

# main.tf - AWS EKS cluster with Terraform
module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.0"

  cluster_name    = "myapp-${var.environment}"
  cluster_version = "1.28"

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets

  eks_managed_node_group_defaults = {
    ami_type       = "AL2_x86_64"
    instance_types = ["t3.medium"]
  }

  eks_managed_node_groups = {
    general = {
      min_size     = 3
      max_size     = 10
      desired_size = 5
      instance_types = ["t3.medium"]
    }

    spot = {
      min_size     = 0
      max_size     = 20
      desired_size = 5
      instance_types = ["t3.medium", "t3a.medium"]
      capacity_type  = "SPOT"
    }
  }
}

# CMakeLists.txt - C++ build infrastructure
cmake_minimum_required(VERSION 3.25)
project(MyApp VERSION 1.0.0 LANGUAGES CXX)

set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_CXX_EXTENSIONS OFF)

# Toolchain for vcpkg
set(CMAKE_TOOLCHAIN_FILE ${VCPKG_ROOT}/scripts/buildsystems/vcpkg.cmake)

find_package(fmt REQUIRED)
find_package(spdlog REQUIRED)
find_package(boost REQUIRED COMPONENTS system thread)

add_executable(main src/main.cpp)
target_link_libraries(main PRIVATE fmt::fmt spdlog::spdlog boost::system boost::thread)

# CI configuration
enable_testing()
add_subdirectory(tests)

// build.gradle.kts - Java/Gradle infrastructure
plugins {
    id("java")
    id("application")
    id("org.springframework.boot") version "3.2.0"
    id("io.spring.dependency-management") version "1.1.4"
}

java {
    toolchain.languageVersion.set(JavaLanguageVersion.of(21))
}

repositories {
    mavenCentral()
    maven { url = uri("https://repo.spring.io/milestone") }
}

dependencies {
    implementation(platform("org.springframework.boot:spring-boot-dependencies:3.2.0"))
    implementation("org.springframework.boot:spring-boot-starter-web")
    implementation("org.springframework.boot:spring-boot-starter-data-jpa")
    implementation("org.postgresql:postgresql")
    testImplementation("org.springframework.boot:spring-boot-starter-test")
}

tasks.withType(Test) {
    useJUnitPlatform()
    testLogging.events = setOf("passed", "failed", "skipped")
}

# azure-pipelines.yml - C#/Azure DevOps
trigger:
  - main

pool:
  vmImage: 'ubuntu-latest'

variables:
  buildConfiguration: 'Release'

steps:
- task: UseDotNet@2
  inputs:
    version: '8.0.x'
    performMultiLevelLookup: true

- task: DotNetCoreCLI@2
  displayName: 'Restore'
  inputs:
    command: 'restore'
    projects: '**/*.csproj'

- task: DotNetCoreCLI@2
  displayName: 'Build'
  inputs:
    command: 'build'
    arguments: '--configuration $(buildConfiguration) --no-restore'
    projects: '**/*.csproj'

- task: DotNetCoreCLI@2
  displayName: 'Test'
  inputs:
    command: 'test'
    projects: '**/*Tests.csproj'
    arguments: '--configuration $(buildConfiguration) --collect:"XPlat Code Coverage" --threshold 85'

- task: PublishCodeCoverageResults@1
  inputs:
    codeCoverageTool: 'Cobertura'
    summaryFileLocation: '**/coverage.cobertura.xml'

Observability Pipeline

Layer	Tools	Purpose
Logs	Loki, ELK, Datadog	Debugging, audit
Metrics	Prometheus, Grafana, Datadog	Alerting, SLOs
Traces	Jaeger, Zipkin, Tempo	Distributed debugging
Alerts	Alertmanager, PagerDuty	On-call routing

# Prometheus metrics in Python
from prometheus_client import Counter, Histogram, Gauge, start_http_server

REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'Request latency', ['method', 'endpoint'])
ACTIVE_CONNECTIONS = Gauge('active_connections', 'Active connections')

@app.middleware("http")
async def metrics_middleware(request, call_next):
    start = time.time()
    response = await call_next(request)
    REQUEST_COUNT.labels(request.method, request.url.path, response.status_code).inc()
    REQUEST_LATENCY.labels(request.method, request.url.path).observe(time.time() - start)
    return response

# Expose metrics
start_http_server(8000)

// Prometheus metrics in C++
#include <prometheus/counter.h>
#include <prometheus/histogram.h>
#include <prometheus/gauge.h>
#include <prometheus/exposer.h>

using prometheus::Counter;
using prometheus::Histogram;
using prometheus::Gauge;
using prometheus::Exposer;

Exposer exposer{"localhost:9090"};
auto registry = std::make_shared<prometheus::Registry>();

auto& requestCount = Counter::Build().Name("http_requests_total").Help("Total HTTP requests").Register(*registry);
auto& requestLatency = Histogram::Build().Name("http_request_duration_seconds").Help("Request latency").Register(*registry);
auto& activeConnections = Gauge::Build().Name("active_connections").Help("Active connections").Register(*registry);

class MetricsMiddleware {
public:
    void operator()(const Request& req, Response& res, std::function<void()> next) {
        auto start = std::chrono::steady_clock::now();
        next();
        auto elapsed = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start);
        requestCount.Add({{"method", req.method}, {"path", req.path}, {"status", std::to_string(res.status)}})->Increment();
        requestLatency.Observe(elapsed.count() / 1e6);
    }
};

// Micrometer + Prometheus in Java
import io.micrometer.core.instrument.*;
import io.micrometer.prometheus.*;

MeterRegistry registry = new PrometheusMeterRegistry(PrometheusConfig.DEFAULT);

Counter requestCount = Counter.builder("http_requests_total")
    .description("Total HTTP requests")
    .register(registry);

Timer requestLatency = Timer.builder("http_request_duration_seconds")
    .description("Request latency")
    .register(registry);

Gauge.builder("active_connections", connectionPool, ConnectionPool::getActiveCount)
    .description("Active connections")
    .register(registry);

// Spring Boot actuator exposes /actuator/prometheus automatically

// Prometheus metrics in C#
using Prometheus;

var requestCount = Metrics.CreateCounter("http_requests_total", "Total HTTP requests", new[] { "method", "endpoint", "status" });
var requestLatency = Metrics.CreateHistogram("http_request_duration_seconds", "Request latency", new[] { "method", "endpoint" });
var activeConnections = Metrics.CreateGauge("active_connections", "Active connections");

app.Use(async (context, next) =>
{
    var start = DateTime.UtcNow;
    await next.Invoke();
    var elapsed = DateTime.UtcNow - start;

    requestCount.WithLabels(context.Request.Method, context.Request.Path, context.Response.StatusCode.ToString()).Inc();
    requestLatency.WithLabels(context.Request.Method, context.Request.Path).Observe(elapsed.TotalSeconds);
});

// Expose /metrics endpoint
app.MapMetrics();

# Prometheus metrics in Ruby (prometheus-client gem)
require "prometheus/client"

prometheus = Prometheus::Client.registry

REQUEST_COUNT = prometheus.counter(
  :http_requests_total, docstring: "Total HTTP requests",
  labels: %i[method endpoint status]
)
REQUEST_LATENCY = prometheus.histogram(
  :http_request_duration_seconds, docstring: "Request latency",
  labels: %i[method endpoint]
)
ACTIVE_CONNECTIONS = prometheus.gauge(
  :active_connections, docstring: "Active connections"
)

# Rack middleware
class MetricsMiddleware
  def initialize(app)
    @app = app
  end

  def call(env)
    start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
    status, headers, body = @app.call(env)
    labels = { method: env["REQUEST_METHOD"], endpoint: env["PATH_INFO"] }
    REQUEST_COUNT.increment(labels: labels.merge(status: status))
    REQUEST_LATENCY.observe(
      Process.clock_gettime(Process::CLOCK_MONOTONIC) - start, labels: labels
    )
    [status, headers, body]
  end
end

# Expose metrics: mount Prometheus::Middleware::Exporter in config.ru

GitOps & ArgoCD

# argocd-application.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: myapp
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/myorg/myapp-config
    targetRevision: HEAD
    path: overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
    - CreateNamespace=true

Disaster Recovery & Backup

RPO	RTO	Strategy
0	0	Multi-region active-active
<1 min	<5 min	Synchronous replication + auto-failover
<1 hour	<4 hours	Async replication + warm standby
<24 hours	<24 hours	Daily backups + documented restore

Resources

Books

Continuous Delivery — Jez Humble & David Farley
Accelerate — Nicole Forsgren, Jez Humble, Gene Kim
The DevOps Handbook — Gene Kim, Jez Humble, Patrick Debois, John Willis
Team Topologies — Matthew Skelton & Manuel Pais
Building Secure and Reliable Systems — Heather Adkins et al. (Google SRE)

Talks & Papers

"Continuous Delivery" by Jez Humble — GOTO Conference
"The Four Key Metrics" by Nicole Forsgren — DevOps Enterprise Summit
"GitOps: What, Why, How" by Alexis Richardson — CNCF

Tools

Category	Tools
CI/CD	GitHub Actions, GitLab CI, CircleCI, Buildkite, Tekton
GitOps	ArgoCD, Flux, Fleet
Infrastructure	Terraform, Pulumi, Crossplane, Pulumi
Container	Docker, Podman, Buildah, Kaniko
Orchestration	Kubernetes, Nomad, ECS
Observability	Prometheus, Grafana, Loki, Tempo, Jaeger
Security	Trivy, Trivy, Syft, Cosign, Kyverno

Summary: CI/CD Maturity Model

Level	CI	CD	Monitoring	Culture
1. Initial	Manual builds	Manual deploy	Reactive	Silos
2. Managed	Automated CI	Staging auto-deploy	Basic metrics	Dev/QA/Ops separate
3. Defined	CI + CD pipeline	Automated to prod	DORA metrics tracked	DevOps collaboration
4. Optimised	Full automation	Progressive deploy	Predictive analytics	Full ownership
5. Optimising	Self-healing	AI-driven	Chaos engineering	Continuous learning

Start where you are. Improve continuously. Measure everything.