GitHub Actions: Self-hosted Runner 架設與管理

標籤:#DevOps #GitHub Actions #Self-hosted Runner #ARC #Kubernetes

從單機 runner 到 K8s 上的 Actions Runner Controller,涵蓋安裝、autoscaling、安全與實戰

什麼時候該用 Self-hosted Runner?

GitHub Hosted Runner 已經很好用,80% 的情境都應該用 Hosted。會考慮 self-hosted 通常是這幾種:

存取私網資源:VPN 內的 DB / 內部 API / 私有 registry
特殊硬體需求:GPU(ML 訓練)、大記憶體(大型編譯)、ARM 原生(非 emulation)
大量重複任務:每天 10000+ 次 build,Hosted 帳單會爆炸
長時間 job:Hosted 預設 6 小時上限
快取本地化:大型 monorepo 跨 job 共享 Docker layer / build cache
合規要求:某些行業規定 CI 必須在內部環境執行

不應該用的情境:

公開的 open source 專案:fork PR 可以執行任意指令(後面詳述)
單一團隊小規模使用:架設 / 維運成本比 Hosted 還高
只是想免費:Hosted 對 public repo 完全免費

Hosted vs Self-hosted 比較

比較項目	GitHub Hosted	Self-hosted
架設時間	0	數小時到數天
維運成本	0	持續(更新、監控、scaling)
每月費用	按 minute 計費	自有硬體 / 雲費用
環境隔離	每個 job 都是新 VM	預設共用主機,需自行隔離
網路存取	Public internet	可進私網
特殊硬體	限 GitHub 提供的 SKU	任意
啟動時間	~15-30 秒	即時(常駐)~ 1-2 分鐘(動態)
同時並發數	帳號 plan 限制	自行決定
Public repo 安全	✅ 安全	❌ 嚴禁

成本估算 example

假設 Linux x64 大型 runner:

Hosted:USD $0.064/min,每月 10000 min → $640
Self-hosted on AWS c5.2xlarge:約 $200/月(24/7)+ 維運時間

但如果是 64-core ultra:

Hosted:$0.512/min,每月 10000 min → $5120
Self-hosted ARM Graviton c7g.16xlarge:約 $1500/月

→ 大型機器自架就有明顯優勢

基本架設

適合架設位置

單機:測試用
EC2 / VM 常駐:小規模
EC2 Spot Fleet:省錢
Kubernetes(ARC):中大規模推薦

單機 runner 安裝

從 Repo / Org / Enterprise 的 Actions Runners 頁面拿到 token,然後在 runner 機器執行:

# 建立目錄
mkdir actions-runner && cd actions-runner

# 下載最新版(版本和 URL 從 GitHub 頁面取得)
curl -o actions-runner-linux-x64.tar.gz -L \
  https://github.com/actions/runner/releases/download/v2.319.1/actions-runner-linux-x64-2.319.1.tar.gz

tar xzf actions-runner-linux-x64.tar.gz

# 設定(會 prompt 要 URL 和 token)
./config.sh \
  --url https://github.com/my-org/my-repo \
  --token <REGISTRATION_TOKEN> \
  --name my-runner-01 \
  --labels self-hosted,linux,x64,docker \
  --work _work \
  --unattended

# 跑起來
./run.sh

註冊範圍

範圍	設定指令	適用
Repo level	`--url github.com/OWNER/REPO`	單一 repo 專用
Org level	`--url github.com/OWNER`	整個 organization 共用
Enterprise	`--url github.com/enterprises/NAME`	跨 organization

通常選 Org level,然後用 Runner Group 控制哪些 repo 可以用。

設定 systemd 自動啟動

sudo ./svc.sh install
sudo ./svc.sh start
sudo ./svc.sh status

Workflow 使用 self-hosted

jobs:
  build:
    runs-on: [self-hosted, linux, docker]   # 必須符合所有 labels
    steps:
      - uses: actions/checkout@v4
      - run: docker build .

Runner Group 與標籤

Runner Group(只在 Org / Enterprise 可用)

把 runner 分組,控制「哪些 repo 可以用這組 runner」:

Org Runners
├── Default Group         ← 所有 repo 可用
├── Production Group      ← 只有 prod-* repo 可用
│   ├── runner-prod-01
│   └── runner-prod-02
└── GPU Group             ← 只有 ML 團隊的 repo 可用
    ├── gpu-runner-01
    └── gpu-runner-02

設定方式:

Org Settings → Actions → Runner groups → New runner group
限定可用的 repo 與 workflow
安裝 runner 時加 --runnergroup "Production Group"

Labels

每個 runner 可以有多個 labels:

./config.sh ... --labels self-hosted,linux,x64,gpu,nvidia-a100

Workflow 用 runs-on: 指定所有要符合的 labels(AND 邏輯):

runs-on: [self-hosted, gpu, nvidia-a100]   # 必須全部符合

GitHub 預設 labels:

self-hosted
OS:linux、macos、windows
架構:x64、arm、arm64

自訂 labels 用來:

區分用途(build、deploy、gpu)
區分環境(prod-network、vpn-allowed)
區分硬體(high-memory、nvme)

Ephemeral Runner

預設的 runner 是永久型(persistent):

跑完 job 後仍然存在,等下一個 job
環境會殘留(/tmp 檔案、安裝的工具、cache)
跑兩個 job 之間可能互相污染

Ephemeral 模式

跑完一個 job 就自動下線、自我銷毀:

./config.sh ... --ephemeral

優點:

每個 job 都是乾淨環境
安全性高(攻擊者沒辦法持續存留)
避免 state 污染

缺點:

每次都要重新註冊、啟動
啟動時間長(數十秒到 1 分鐘)
沒有 build cache 累積效應

適合 ephemeral 的場景

Public repo:必須 ephemeral(下面安全章節詳述)
K8s 上的 ARC:本來就是 ephemeral 設計
EC2 Auto Scaling + spot:啟動就跑一個 job 就死

Actions Runner Controller (ARC)

ARC 是 GitHub 官方推薦的 K8s 上 self-hosted runner 解法。

ARC 架構

Kubernetes Cluster
├── arc-systems (namespace)
│   └── controller-manager (Deployment)
│         ← 管理 runner 生命週期
│
└── arc-runners (namespace)
    └── AutoscalingRunnerSet
          ├── Listener (Pod)
          │     ← 監聽 GitHub Actions 佇列
          └── Ephemeral Runner Pods
              ├── runner-pod-1
              ├── runner-pod-2
              └── runner-pod-3

ARC 把每個 runner 包成 K8s Pod,跑完即銷毀,可以根據 queue 大小自動 scale。

安裝 ARC

ARC 用 Helm 部署:

# 安裝 controller
helm install arc \
  --namespace arc-systems \
  --create-namespace \
  oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller \
  --version 0.9.3

部署 Runner Scale Set

values.yaml:

githubConfigUrl: "https://github.com/my-org"
githubConfigSecret:
  github_token: ghp_xxxxxxxx   # PAT or GitHub App

runnerScaleSetName: "k8s-runners"

minRunners: 1
maxRunners: 10

containerMode:
  type: "dind"    # docker-in-docker

template:
  spec:
    containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:latest
        resources:
          requests:
            cpu: "1"
            memory: "2Gi"
          limits:
            cpu: "4"
            memory: "8Gi"

helm install k8s-runners \
  --namespace arc-runners \
  --create-namespace \
  --values values.yaml \
  oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set \
  --version 0.9.3

Workflow 使用

jobs:
  build:
    runs-on: k8s-runners   # 直接用 scale set 名字
    steps:
      - uses: actions/checkout@v4
      - run: docker build .

GitHub App vs PAT

ARC 需要認證,可以用兩種方式:

方式	優點	缺點
PAT	設定簡單	跟個人帳號綁定;權限大
GitHub App	用 service account 概念;權限細;不會被個人帳號限制	設定較複雜

正式環境強烈建議用 GitHub App:

githubConfigSecret:
  github_app_id: "12345"
  github_app_installation_id: "67890"
  github_app_private_key: |
    -----BEGIN RSA PRIVATE KEY-----
    ...
    -----END RSA PRIVATE KEY-----

Autoscaling 策略

ARC 內建 autoscaling

AutoscalingRunnerSet 會根據 GitHub Actions 的 pending queue 自動 scale:

minRunners: 0       # 沒事時可以 scale 到 0
maxRunners: 50      # 高峰時最多 50 個

機制:

Listener pod 監聽 GitHub Actions API
看到有 job pending → 通知 controller 增加 runner pod
Job 跑完 → runner pod 自動銷毀
queue 空了 → scale back to minRunners

EC2 Auto Scaling Group

不用 K8s 也可以做 autoscaling,用 ASG + Spot:

ASG min=0, max=10, desired=0
├── Launch template:
│   - 開機時自動執行 `./config.sh --ephemeral && ./run.sh`
│   - User data 帶 RUNNER_TOKEN
└── Scale 規則:
    - CloudWatch alarm 看 GitHub queue length
    - 或用 webhook 觸發 lambda 來改 desired

或用社群 solution 如 Philips Labs Terraform module。

Just-in-time (JIT) registration

進階做法:每次要起 runner 時,先向 GitHub API 拿一次性 JIT config,塞給新 runner:

# 從 GitHub API 拿 JIT config
JIT_CONFIG=$(gh api -X POST /orgs/my-org/actions/runners/generate-jitconfig \
  -f name=runner-$(date +%s) \
  -f runner_group_id=1 \
  -F labels='["self-hosted","linux","ephemeral"]' \
  --jq '.encoded_jit_config')

# Runner 用 JIT config 啟動,跑一個 job 就死
./run.sh --jitconfig "$JIT_CONFIG"

優點:不需要持久 registration token,runner 名字唯一,適合 stateless autoscaling。

安全注意事項

⚠️ 最重要的安全規則

絕對不要在 public repo 用 persistent self-hosted runner

原因:

任何人都可以對你的 repo 開 PR
PR 內的 workflow YAML 會被執行(對 first-time contributor 有 approval gate,但對已 contribute 過的人就會直接執行)
攻擊者可以在 workflow 內執行任意指令(挖礦、撈 secrets、植入後門)
Persistent runner 的環境會被污染:工具被改、secrets 被洩漏、後續 job 都受影響

GitHub 官方明確警告:Self-hosted runners are not recommended for public repositories.

緩解措施(如果一定要 public)

強制 ephemeral:每個 job 都是新環境
第一次 contribution 強制 approval:settings → Actions → Require approval for first-time contributors
隔離網路:runner 不能存取內部資源
限制 secrets:不在這個 runner 上使用敏感 secret
稽核每個 PR 的 workflow 修改:用 CODEOWNERS

Private repo 的安全考量

即使是 private repo:

不要給 runner 過大的雲端權限:用 IAM Role 限縮
網路分段:runner 不該能存取所有內部服務
Container 隔離:用 docker / k8s 容器跑 job,而非裸機
定期更新 runner agent:有 CVE 要快速 patch
日誌集中收集:能追溯誰跑了什麼

Runner Token 的處理

註冊 token 有效期 1 小時,而且寫入 .runner 檔案後可以反推。最佳實踐:

用 JIT config 取代長期 registration token
用 GitHub App 而非 PAT
不要把 runner config 檔提交到 git

監控與維運

健康檢查

Runner 跑著時 process 名稱為 Runner.Listener 和 Runner.Worker:

# 看 listener 是不是還在跟 GitHub 通訊
ps aux | grep Runner.Listener

# 看 worker 是否在跑 job
ps aux | grep Runner.Worker

日誌位置

actions-runner/
├── _diag/          # 啟動 / 連線日誌
│   └── Runner_*.log
└── _work/          # 每個 job 的 workspace
    └── _logs/      # job 內步驟的日誌

_diag/Runner_*.log 是 troubleshoot 的關鍵。

Prometheus metrics

ARC 內建 metrics export:

# controller values
metrics:
  controllerManagerAddr: ":8080"
  listenerAddr: ":8080"

關鍵指標:

gha_assigned_jobs(分配中的 job 數)
gha_running_jobs(執行中的 job 數)
gha_pending_jobs(等待的 job 數)
gha_started_jobs_total(累計)
gha_completed_jobs_total{job_result="success|failure"}

告警建議

指標	條件	含義
`gha_pending_jobs`	> 5 持續 10 分鐘	runner 數量不夠或卡住
Runner pod restart 率	高	OOM 或 runner agent 異常
`gha_completed_jobs_total{result="failure"}` 上升	異常	環境問題
Runner offline	> 5 分鐘	runner 跟 GitHub 連線斷

實戰場景

場景 1:用 self-hosted runner 存取私網 DB

jobs:
  integration-test:
    runs-on: [self-hosted, linux, vpn-allowed]   # 這個 label 的 runner 在 VPN 內
    steps:
      - uses: actions/checkout@v4
      - run: npm test
        env:
          DB_HOST: internal-db.corp.local   # 只有 VPN 內看得到
          DB_PASSWORD: ${{ secrets.DB_PASSWORD }}

部署:把這組 runner 放在 VPN 子網,只讓需要存取內部 DB 的 repo 用這個 runner group。

場景 2:GPU 訓練任務

jobs:
  train:
    runs-on: [self-hosted, gpu, nvidia-a100]
    steps:
      - uses: actions/checkout@v4
      - run: |
          nvidia-smi
          python train.py
        env:
          CUDA_VISIBLE_DEVICES: 0

K8s 上要設定 GPU resource:

template:
  spec:
    containers:
      - name: runner
        resources:
          limits:
            nvidia.com/gpu: 1
        env:
          - name: NVIDIA_VISIBLE_DEVICES
            value: all

場景 3:Docker in Docker 共用 layer cache

# ARC values.yaml
containerMode:
  type: "dind"

template:
  spec:
    containers:
      - name: dind
        image: docker:24-dind
        securityContext:
          privileged: true   # dind 需要
        volumeMounts:
          - name: docker-cache
            mountPath: /var/lib/docker
    volumes:
      - name: docker-cache
        persistentVolumeClaim:
          claimName: docker-cache-pvc   # 跨 runner pod 共用

這樣多次 build 共享 layer cache,大幅加速。

場景 4:Spot Instance 省錢

EC2 Spot + Auto Scaling Group:

# Terraform
resource "aws_autoscaling_group" "runners" {
  min_size         = 0
  max_size         = 50
  desired_capacity = 0

  mixed_instances_policy {
    instances_distribution {
      on_demand_base_capacity                  = 0
      on_demand_percentage_above_base_capacity = 0  # 100% spot
      spot_allocation_strategy                  = "price-capacity-optimized"
    }
    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.runner.id
      }
      override { instance_type = "c5.4xlarge" }
      override { instance_type = "c5a.4xlarge" }
      override { instance_type = "c5n.4xlarge" }
    }
  }
}

搭配 ephemeral runner:每個 EC2 spot 啟動就跑一個 job,跑完直接 terminate。

最佳實踐

1. 永遠用 ephemeral(除非有強烈不用的理由)

./config.sh ... --ephemeral

或 ARC 預設就是 ephemeral。

2. 限制 workflow scope

Org / Repo Settings → Actions → General:

Fork pull request workflows:設為 require approval
Workflow permissions:預設 read-only

3. Label 命名一致

整個 org 用一致的 label naming:

self-hosted, linux, x64, [purpose], [environment]

例:
  self-hosted, linux, x64, build, default
  self-hosted, linux, x64, deploy, production
  self-hosted, linux, arm64, build, default

4. Runner 跟 secret 分層

Runner Group	Workflow secrets 可用範圍
`default`	只能用 public-grade secrets
`production`	可以 access prod secrets
`gpu`	不能 access 任何 deploy secrets

5. 定期更新 runner agent

GitHub 不會自動更新 self-hosted runner,要建立更新流程:

# 停止
sudo ./svc.sh stop

# 下載新版
curl -o ... && tar xzf ...

# 啟動
sudo ./svc.sh start

或 ARC 自動透過 helm upgrade 升級。

6. 用 K8s 容器隔離

裸機 self-hosted runner 多 job 之間沒有隔離,改用 K8s 容器:

每個 job 跑在獨立 pod
跑完銷毀
Cap 各種 Linux capabilities
AppArmor / SELinux profile

常見問題

Q1:Runner 出現 `Runner is offline` 但 process 還在?

可能原因:

網路問題:runner 跟 GitHub 連不上,檢查 https://api.github.com/_ping
Token 過期:registration token 1 小時就過期,要重新 config
GitHub side issue:看 status.github.com
磁碟空間滿了:_work 累積太多

Q2:Job 跑到一半 stuck,沒有日誌

通常是:

Runner 重啟導致斷線
K8s pod OOM 被殺:看 kubectl get events
Workflow 內某個指令 hang(忘記設 timeout)

預防:

jobs:
  test:
    runs-on: self-hosted
    timeout-minutes: 30   # 強制 timeout

Q3:ARC controller 報 `unauthorized`

通常是 GitHub App / PAT 權限不夠:

PAT 需要 repo、workflow、admin:org(org level)
GitHub App 需要 Actions: Read & Write、Administration: Read & Write

Q4:Ephemeral runner 啟動很慢

啟動時間包含:

Pod 排程(K8s scheduler 找 node)
Image pull(用本地 mirror / cache 加速)
config.sh 註冊到 GitHub(~10-30 秒)
拿到 job 開始執行

優化:

Pre-pull image
Reserve node capacity(node selector)
JIT config 取代 registration

Q5:Runner pod 跑 `docker build` 報權限錯誤

K8s 上跑 docker 需要 dind 或 podman:

containerMode:
  type: "dind"

或改用 BuildKit / kaniko 等 rootless build tool。

Q6:能不能 self-hosted runner 跟 GitHub Hosted 混用?

可以,workflow 內每個 job 可以指定不同 runner:

jobs:
  build:
    runs-on: ubuntu-latest   # GitHub Hosted
  deploy:
    runs-on: [self-hosted, prod-network]   # 自架
    needs: build

常見模式:build 用 hosted,deploy 用 self-hosted(因為要進私網)。

Q7:Runner 跑完後 `_work` 沒清掉,磁碟越來越滿

Persistent runner 預設不會自動清理。手動處理:

# 加 cron 定期清
0 3 * * * find /opt/actions-runner/_work -mtime +7 -delete

或改用 ephemeral 從根本解決。

Q8:多個 organization 想共用 self-hosted runner?

只有 Enterprise plan 才能跨 org 共用 runner group。免費 / 一般 org 只能各自設定。

總結

核心要點

80% 用 GitHub Hosted:省事、安全、便宜
特殊需求才 self-hosted:私網、GPU、大量計算、長時間
永遠 ephemeral:避免環境污染與安全風險
Public repo 嚴禁 persistent self-hosted:fork PR 可執行任意指令
K8s 上用 ARC:autoscaling、ephemeral、隔離、監控完整
Runner Group + Labels:做權限與用途劃分
用 GitHub App 而非 PAT:細權限、不綁個人

決策樹

要 self-hosted runner 嗎?
├─ 公開 repo? → 用 Hosted(別開 self-hosted 給 public)
├─ 私網存取需求? → 用 self-hosted,放 VPN 內
├─ 特殊硬體(GPU)? → 用 self-hosted
├─ 量大省成本? → 用 self-hosted(spot / ARC)
└─ 都不是 → 用 Hosted

要怎麼架?
├─ < 10 runner,單純 → EC2 / VM
├─ 中大規模,K8s 已有 → ARC
└─ 量大但無 K8s → EC2 ASG + Spot

速查指令

# 註冊
./config.sh --url https://github.com/ORG --token TOKEN \
  --name $(hostname) --labels self-hosted,linux,x64 --ephemeral --unattended

# Service
sudo ./svc.sh install && sudo ./svc.sh start

# 看狀態
sudo ./svc.sh status

# 拆除
./config.sh remove --token REMOVE_TOKEN

# ARC 安裝
helm install arc oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller
helm install my-runners oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set \
  --values values.yaml

速查 workflow

jobs:
  build:
    runs-on: [self-hosted, linux, x64, my-label]
    timeout-minutes: 30
    steps:
      - uses: actions/checkout@v4
      - run: ./build.sh

建立日期:2026-05-25

🔗相關文章