利用 Tekton 与 PVC 优化大规模 Gatsby 微前端的增量构建流水线

DevOps

文章字数: 3k

阅读时长: 13 分

团队接手了一个包含超过五十个 Gatsby 站点的项目，它们被统一管理在一个 pnpm monorepo 中。这些站点作为微前端，共同构成了一个大型内容门户。初期的 CI 流程简单粗暴：任何代码合并到主干，都会触发一个 Jenkins 作业，该作业拉取整个仓库，执行 pnpm install，然后串行地为每一个微前端执行 gatsby build。全量构建一次耗时接近四十分钟，这对于一个追求快速迭代的团队来说，是个灾难。问题很明确：99% 的构建时间都浪费在未变更的站点和重复的依赖安装上。

初步构想是建立一个智能的、增量的构建系统。它必须满足几个核心要求：

变更检测：只构建自上次成功部署以来代码发生变化的微前端。
智能缓存：node_modules 目录和每个 Gatsby 站点的 .cache 与 public 目录必须被有效缓存，以加速依赖安装和 Gatsby 的增量构建。
并行执行：如果多个微前端同时发生变更，它们的构建过程应该并行执行以缩短总耗时。
云原生：整个流程需要在 Kubernetes 上运行，以利用其弹性和可扩展性。

技术选型上，Tekton 成了首选。它作为 Kubernetes 原生的 CI/CD 框架，其 Task, Pipeline, Workspace 的设计理念与我们的需求高度契合。Workspace 可以通过 PersistentVolumeClaim (PVC) 实现，这为解决跨 Task 运行的缓存问题提供了直接且可靠的方案。Gatsby 本身对构建缓存有良好的支持，只要我们能持久化 .cache 和 public 目录，就能获得显著的速度提升。

架构设计与流水线规划

我们的目标是设计一个 Tekton Pipeline，它能自动化地完成整个增量构建和部署流程。

整个流水线的核心逻辑可以分解为以下几个关键步骤（Tasks）：

clone-repo: 克隆代码仓库。
detect-changes: 这是流水线的大脑。该任务会比较当前 HEAD 与上一次成功构建的 commit SHA，计算出哪些微前端（即 packages/* 目录）发生了变更，并将结果输出。
restore-cache: 在安装依赖和构建之前，从 PVC 支持的 Workspace 中恢复 node_modules 和各个微前端的历史缓存。
pnpm-install: 安装所有依赖。由于缓存的存在，这个过程会非常快。
build-and-deploy: 这是一个并行执行的任务。基于 detect-changes 的输出，为每一个需要构建的微前端启动一个任务实例。每个实例负责运行 gatsby build 并将构建产物上传到对象存储。
update-cache: 构建成功后，将最新的 node_modules 和已构建站点的缓存目录写回到 PVC 中，供下次使用。

为了可视化这个流程，我们可以使用 Mermaid.js 来描绘这个 Pipeline 的结构。

graph TD
    A[Start] --> B(clone-repo);
    B --> C(detect-changes);
    C --> D(restore-cache);
    D --> E(pnpm-install);
    E --> F{Changed Micro-frontends?};
    F -- Yes --> G(Matrix: Parallel build-and-deploy);
    F -- No --> I(End);
    G --> H(update-cache);
    H --> I;

核心 Task 实现

在真实项目中，配置的健壮性至关重要。下面是几个核心 Task 的 YAML 定义，包含了生产环境中必要的细节和注释。

Task 1: `detect-changes`

这是整个流水线中最关键的自定义任务。它的逻辑是：获取上一次成功运行的 commit hash（我们可以将其存储在 ConfigMap 或一个特定的 git tag 中），然后与当前 commit 进行 git diff，最终输出一个 JSON 数组，其中包含所有发生变更的微前端包名。

apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
  name: detect-changes
spec:
  description: >-
    Detects which Gatsby micro-frontend packages have changed since the
    last successful build commit.
  workspaces:
    - name: source
      description: The workspace containing the cloned git repository.
  params:
    - name: last-success-commit
      type: string
      description: "The commit SHA of the last successful pipeline run."
    - name: base-branch
      type: string
      description: "The base branch to compare against, e.g., main."
      default: "main"
  results:
    - name: changed-packages
      description: "A JSON array of package names that have changed."
    - name: has-changes
      description: "A string 'true' or 'false' indicating if there are any changes."
  steps:
    - name: detect
      image: alpine/git:v2.36.1
      workingDir: $(workspaces.source.path)
      script: |
        #!/bin/sh
        set -e
        echo "Comparing HEAD with last successful commit: $(params.last-success-commit)"

        # In a real scenario, you'd fetch the base branch to ensure the commit is available
        git fetch origin $(params.base-branch)

        # Check if last-success-commit is empty or doesn't exist, if so, build all
        if [ -z "$(params.last-success-commit)" ] || ! git cat-file -e $(params.last-success-commit)^{commit}; then
          echo "Last successful commit not found or invalid. Building all packages."
          # List all directories under packages/ as potential micro-frontends
          PACKAGES=$(find packages -mindepth 1 -maxdepth 1 -type d -exec basename {} \; | jq -R . | jq -s .)
        else
          # Find files changed between the last successful commit and HEAD
          CHANGED_FILES=$(git diff --name-only $(params.last-success-commit)..HEAD)
          
          # A common pitfall is not handling changes in root-level shared files (e.g., pnpm-lock.yaml, tsconfig.json).
          # If root files change, we must rebuild everything.
          ROOT_CHANGES=$(echo "${CHANGED_FILES}" | grep -v '^packages/' || true)

          if [ -n "${ROOT_CHANGES}" ]; then
            echo "Root level files changed. Rebuilding all packages."
            echo "${ROOT_CHANGES}"
            PACKAGES=$(find packages -mindepth 1 -maxdepth 1 -type d -exec basename {} \; | jq -R . | jq -s .)
          else
            # Filter for changes within the 'packages' directory and extract the package name
            PACKAGES=$(echo "${CHANGED_FILES}" | grep '^packages/' | sed -E 's|packages/([^/]+)/.*|\1|' | sort -u | jq -R . | jq -s .)
          fi
        fi
        
        echo "Changed packages: ${PACKAGES}"
        
        # Output the results for Tekton
        if [ "${PACKAGES}" = "[]" ]; then
          echo -n "false" | tee $(results.has-changes.path)
        else
          echo -n "true" | tee $(results.has-changes.path)
        fi
        echo -n "${PACKAGES}" | tee $(results.changed-packages.path)

这个脚本考虑了一个常见的错误：只检查 packages/ 目录下的变更。在真实项目中，根目录的 pnpm-lock.yaml、tsconfig.json 或共享的工具库发生变化时，需要重新构建所有依赖它的微前端。这里的实现简化为只要根目录有变化就全量构建，更精细的方案需要构建一个完整的依赖图。

Task 2: `build-and-deploy`

这个任务利用了 Tekton 的 Matrix 功能，它会根据 detect-changes 的输出结果，为每个变更的微前端创建一个并行的执行实例。

apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
  name: gatsby-build-and-deploy
spec:
  description: Builds a specific Gatsby site and deploys it to a storage bucket.
  workspaces:
    - name: source
      description: The workspace with source code and node_modules.
  params:
    - name: package-name
      type: string
      description: The name of the package to build (e.g., 'site-a').
    - name: gcs-bucket-name
      type: string
      description: The name of the GCS bucket to deploy to.
  steps:
    - name: build
      image: node:18-alpine
      workingDir: $(workspaces.source.path)
      script: |
        #!/bin/sh
        set -e
        echo "--- Building package: $(params.package-name) ---"
        
        # Navigate to the specific package directory
        cd packages/$(params.package-name)

        # Run the Gatsby build command
        # Gatsby automatically uses .cache and public directories for incremental builds
        npm run build
        
        # Error handling: Check if build output exists
        if [ ! -d "public" ] || [ -z "$(ls -A public)" ]; then
            echo "Error: Build failed, 'public' directory is empty or does not exist."
            exit 1
        fi
        echo "--- Build successful for $(params.package-name) ---"

    - name: deploy
      image: google/cloud-sdk:slim
      workingDir: $(workspaces.source.path)/packages/$(params.package-name)
      script: |
        #!/bin/sh
        set -e
        echo "--- Deploying package: $(params.package-name) to gs://$(params.gcs-bucket-name) ---"
        
        # In a production setup, you would use Workload Identity for auth
        # gcloud auth activate-service-account --key-file=/path/to/sa-key.json
        
        # Use rsync to upload. The -d flag deletes files in the destination that are not in the source.
        # This ensures a clean deployment.
        gsutil -m rsync -d -r public/ gs://$(params.gcs-bucket-name)/$(params.package-name)/
        
        echo "--- Deployment successful for $(params.package-name) ---"

串联一切的 Pipeline

现在，我们将上述 Task 组合成一个完整的 Pipeline。这里最核心的部分是 Workspaces 的定义和 Matrix 的使用。

apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
  name: gatsby-micro-frontends-ci
spec:
  description: >-
    CI pipeline for a monorepo of Gatsby micro-frontends with
    incremental build and caching.
  workspaces:
    - name: shared-data
      description: |
        This workspace will be backed by a PVC to store the git repo,
        caches, and node_modules across tasks.
  params:
    - name: repo-url
      type: string
    - name: repo-revision
      type: string
    - name: last-success-commit
      type: string
    - name: gcs-bucket-name
      type: string

  tasks:
    - name: fetch-source
      taskRef:
        name: git-clone
      workspaces:
        - name: output
          workspace: shared-data
      params:
        - name: url
          value: $(params.repo-url)
        - name: revision
          value: $(params.repo-revision)
        - name: depth
          value: "0" # A full clone is needed for git diff

    - name: find-changes
      runAfter: [fetch-source]
      taskRef:
        name: detect-changes
      workspaces:
        - name: source
          workspace: shared-data
      params:
        - name: last-success-commit
          value: $(params.last-success-commit)

    # Note: Cache restoration/update steps are simplified here for brevity.
    # A robust implementation would involve separate tasks to tar/untar cache directories
    # to and from a specific location within the PVC to avoid conflicts.
    # For this example, we assume direct usage of the workspace.

    - name: install-dependencies
      runAfter: [find-changes]
      taskRef:
        name: pnpm-install # Assuming a predefined pnpm task
      workspaces:
        - name: source
          workspace: shared-data
      when:
        - input: "$(tasks.find-changes.results.has-changes)"
          operator: in
          values: ["true"]

    - name: build-deploy-matrix
      runAfter: [install-dependencies]
      when:
        - input: "$(tasks.find-changes.results.has-changes)"
          operator: in
          values: ["true"]
      taskRef:
        name: gatsby-build-and-deploy
      workspaces:
        - name: source
          workspace: shared-data
      params:
        - name: gcs-bucket-name
          value: $(params.gcs-bucket-name)
        - name: package-name
          value: $(item)
      matrix:
        params:
          - name: item
            value: $(tasks.find-changes.results.changed-packages)

这段 Pipeline 定义中，workspaces.shared-data 是关键。当 PipelineRun 被创建时，我们会为其绑定一个 PersistentVolumeClaim。这个 PVC 会在 fetch-source、install-dependencies 和 build-deploy-matrix 任务之间共享，从而实现了状态（源代码、node_modules、缓存）的传递。

matrix.params 的 value 设置为 $(tasks.find-changes.results.changed-packages)，这正是 Tekton 强大的地方。它会解析 find-changes 任务输出的 JSON 数组，并为数组中的每个元素启动一个 gatsby-build-and-deploy 任务的实例，这些实例是并行执行的。

运行与缓存策略的现实考量

要让这个系统在生产中稳定运行，还有几个坑需要注意：

PVC 的并发问题: ReadWriteOnce (RWO) 访问模式的 PVC 在大多数云提供商上只能被一个节点上的 Pod 挂载。如果你的 Tekton Task Pod 被调度到不同节点，并行任务会失败。解决方案是使用 ReadWriteMany (RWX) 的存储类（如 NFS 或 GlusterFS），或者确保你的 Kubernetes 集群节点有足够资源，让所有并行任务调度到同一个节点。
缓存污染: 如果一个构建任务失败并留下了损坏的缓存文件，它可能会影响后续的流水线运行。一个健壮的 update-cache 任务应该在构建完全成功后才执行。甚至可以考虑在每次运行开始时，基于分支名或 commit hash 创建一个唯一的缓存子目录，实现缓存隔离。
last-success-commit 的管理: 这个值的传递和更新至关重要。一个常见的做法是，在流水线成功结束后，通过一个最终任务（finally task）将当前的 commit hash 更新到一个 Git Tag 或一个专门的 ConfigMap 中。PipelineRun 启动时，再从这个地方读取 last-success-commit 作为参数。

例如，一个 finally task 来更新 commit tag：

finally:
  - name: update-success-tag
    taskSpec:
      workspaces:
        - name: source
      params:
        - name: git-tag
          default: "last-prod-success"
      steps:
        - name: tag-commit
          image: alpine/git:v2.36.1
          workingDir: $(workspaces.source.path)
          script: |
            # Configure git user
            git config --global user.email "[email protected]"
            git config --global user.name "CI Bot"
            # Delete the old tag and create a new one at the current commit
            git tag -d $(params.git-tag) || true
            git push origin :refs/tags/$(params.git-tag) || true
            git tag $(params.git-tag) HEAD
            git push origin --tags

这确保了只有在整个流水线（包括所有并行的构建）都成功后，基准 commit 才会更新。

局限性与未来迭代路径

当前这套基于 PVC 的缓存方案虽然有效，但也存在其物理边界。当微前端数量增长到数百个，或者代码库体积巨大时，单个 PVC 可能会成为 I/O 瓶颈。所有并行的构建任务同时读写同一个磁盘，性能会下降。

未来的一个优化方向是引入分布式缓存。可以改造 build-and-deploy 任务，使其在开始时从 S3 或其他对象存储中下载对应包的缓存 tar.gz 包，在构建结束后再将更新后的缓存上传回去。这需要更复杂的脚本逻辑来管理缓存键（例如，基于包名和 pnpm-lock.yaml 的哈希），但它解耦了计算与存储，扩展性更好。

另一个迭代方向是优化变更检测的粒度。当前的实现对于根目录文件的修改采取了一刀切的全量构建策略。可以引入像 Nx 或 Turborepo 这样的 Monorepo 管理工具。在 Tekton 任务内部调用 npx nx affected:apps --base=$(params.last-success-commit) --head=HEAD，可以精确地获取受变更影响的应用依赖图，从而实现更精细化的增量构建，避免不必要的重复工作。这代表了将领域特定的构建逻辑与通用的 CI 平台相结合的演进方向。