Metal 4 游戏开发深度探索

一句话判断

Metal 4 的核心变化是”编码统一、资源规模化、管线预编译”——如果你的游戏引擎还在用 Metal 3 的 encoder 体系，这次升级的 ROI 非常高。

这场 Session 讲了什么

这是 Metal 4 四部曲的第二集，由 GPU Driver Engineers Jason 和 Yang 主讲，聚焦游戏引擎优化的三个维度：编码效率、资源管理、管线加载。

编码层面，Metal 4 把最常用的操作统一为 render 和 compute 两个 encoder 类。所有 compute 操作（kernel dispatch、blit、acceleration structure build）可以在一个 compute encoder 中编码，默认并发执行，用 Pass Barrier 处理数据依赖。Color attachment mapping 允许在同一个 render encoder 中切换 fragment shader 输出到不同的 attachment，不需要为不同的 render target 布局创建新的 encoder。Command Allocator 显式管理编码时的内存分配，支持 reset 后复用。Command Buffer 支持多线程编码，Suspend/Resume 选项可以把多个 render encoder 合并为一个 GPU pass。

资源管理层面，Argument Table 按索引绑定根级资源，与 argument buffer 的 bindless 模型配合扩展到数千个资源。Residency Set 批量管理 GPU 可见性，推荐少而大的 set。Drawable 资源现在由开发者显式管理 residency（通过 layer 的 dynamic residency set）和同步（queue 上的 wait/signal）。Queue Barrier 在 encoder 间表达数据依赖，按 stage 粒度过滤以最大化并发。Texture View Pool 预分配 texture view 内存，支持编码时创建轻量级 view。Placement Sparse Heap 让开发者控制 sparse 资源的内存映射。

管线加载层面，Flexible Render Pipeline State 通过先编译 unspecialized pipeline 再按需 specialize，复用 vertex/fragment binary body，大幅减少编译时间。多线程编译支持 GCD 和自定义线程池，关键是设置合适的 QoS 避免与渲染线程争抢。Ahead-of-time 编译通过 Pipeline Data Set Serializer 收集管线描述符、metal-tt 构建 GPU binary archive、运行时从 archive 中查找管线，将管线加载时间降到接近零。

值得深挖的点

Color Attachment Mapping 对渲染架构的影响

Color attachment mapping 是 Metal 4 中对渲染架构影响最大的特性之一。在 Metal 3 中，如果两个 draw call 需要写入不同组合的 render target，必须创建不同的 render encoder——每个 encoder 的 attachment 布局在创建时就固定了。这导致很多小的 render pass，每个 pass 都有 load/store 的开销。

Metal 4 的解决方案是：一个 render encoder 拥有所有可能用到的 attachment，通过 color attachment map 动态映射 shader 输出到物理 attachment。map 是预构建的、可复用的对象，设置 pipeline 时同时绑定 map。如果下一个 draw call 需要不同的输出组合，切换 map 而不是切换 encoder。

这对 forward+ 或 deferred rendering 等需要频繁切换 render target 的架构特别有价值。以前可能需要十几个 render pass 的场景，现在可能只需要两三个。

Trade-off：虽然减少了 encoder 数量和 pass 开销，但 attachment 的总数可能会增加（因为你需要在单个 encoder 中容纳所有可能用到的 attachment），这会增加 tile memory 的压力。在内存受限的设备上，需要评估 attachment 数量是否超过了硬件限制。建议先 profiling 再做架构调整，不要盲目合并所有 pass。

Flexible Render Pipeline State 的编译时间 vs 运行时性能权衡

Flexible render pipeline 的核心思路是”先编译主体，后 specialize 输出”。一个场景建造游戏中，房屋有三种渲染状态：全息预览（additive blend）、建造中（transparent blend）、建成（opaque blend）。三种状态的 vertex shader 和 fragment body 完全相同，只有 fragment output（blend state、write mask 等）不同。

传统方式编译三次完整的 pipeline，大部分工作重复。Metal 4 的方式是先编译一个 unspecialized pipeline（把 color attachment 配置全部设为 unspecialized），然后用不同配置快速生成 specialized pipeline。Specialization 只需要重新生成 fragment output 部分，速度极快。

但有性能代价。Fragment shader 如果写了四个颜色通道而 attachment 只有一个通道，编译器不再能优化掉未使用的通道。Fragment binary body 到 fragment output part 之间的跳转也有微小开销。对大多数 shader 这个开销可以忽略，但对某些高频执行的 fragment shader 可能有可测量的影响。

实际策略：先全部用 unspecialized + specialize 的方式上线，然后用 Instruments Metal System Trace 找出性能差异明显的 shader，对这些 shader 在后台编译 full-state 版本并替换。这样既获得了快速加载的好处，又保证了关键路径的运行时性能。

代码片段

1. Color Attachment Mapping

场景：在同一个 render encoder 中切换不同的 fragment 输出目标。

// 构建 superset 的 attachment
let passDescriptor = MTL4RenderPassDescriptor()
passDescriptor.colorAttachments[0].texture = albedoTexture
passDescriptor.colorAttachments[1].texture = normalTexture
passDescriptor.colorAttachments[2].texture = emissionTexture

// 创建 attachment map
let map = MTL4ColorAttachmentMap()
map.setRemapEntries([
    (logicalIndex: 0, physicalIndex: 0),  // shader output 0 → albedo
    (logicalIndex: 1, physicalIndex: 2),  // shader output 1 → emission
])

// 编码时切换
encoder.setRenderPipelineState(pipelineA, colorAttachmentMap: mapA)
encoder.drawPrimitives(...)
encoder.setRenderPipelineState(pipelineB, colorAttachmentMap: mapB)
encoder.drawPrimitives(...)

坑：attachment map 对象应该在编码前构建并复用，不要每帧创建新的。map 本身是轻量的，但构建过程有开销。

2. Flexible Render Pipeline Specialization

场景：先编译 unspecialized pipeline，再 specialize 为不同 blend state。

// 1. 创建 unspecialized pipeline
let descriptor = MTL4RenderPipelineDescriptor()
descriptor.vertexFunction = vertexFunc
descriptor.fragmentFunction = fragmentFunc
for i in 0..<descriptor.colorAttachmentCount {
    descriptor.colorAttachments[i].pixelFormat = .unspecialized
    descriptor.colorAttachments[i].writeMask = .unspecialized
    descriptor.colorAttachments[i].blendingState = .unspecialized
}
let unspecialized = try device.makeRenderPipelineState(descriptor: descriptor)

// 2. Specialize 为 opaque
let opaqueDesc = MTL4RenderPipelineDescriptor(unspecializedPipeline: unspecialized)
opaqueDesc.colorAttachments[0].pixelFormat = .bgra8Unorm
opaqueDesc.colorAttachments[0].writeMask = .all
opaqueDesc.colorAttachments[0].blendingState = .disabled
let opaquePipeline = try device.makeRenderPipelineState(descriptor: opaqueDesc)

坑：unspecialized pipeline 的 fragment shader 如果写了未使用的 attachment 通道，编译器无法优化。对性能关键的 shader，后续用 Instruments 排查并考虑 full-state 编译。

3. Ahead-of-time Pipeline 编译与查找

场景：在开发时预编译管线，运行时从 archive 加载。

// 运行时查找
let archive = try MTL4Archive(url: archiveURL)
let pipeline = try archive.makeRenderPipelineState(descriptor: descriptor) {
    // fallback: archive miss 时编译 on-device
    try device.makeRenderPipelineState(descriptor: descriptor)
}

坑：Archive 查找可能 miss（不匹配的 pipeline、不兼容的 OS 或 GPU 架构）。务必提供 fallback 到 on-device 编译的逻辑，否则 miss 时会直接失败。

最佳实践

用 color attachment mapping 合并 render pass，减少 encoder 数量和 load/store 开销，但先 profiling 确认 tile memory 压力。
所有 compute 操作用统一的 compute encoder 编码，用 Pass Barrier 表达串行依赖，默认享受并发执行。
Pipeline 加载全部先走 unspecialized + specialize 路径，上线后用 Metal System Trace 找出需要 full-state 编译的 shader。
多线程编译设置 QoS 为 default，避免与渲染线程竞争 CPU 资源。
Residency set 宜少不宜多——每个 set 包含尽可能多的资源，不变的 set 挂到 command queue，频繁变化的挂到 command buffer。
Ahead-of-time 编译用 pipeline script 收集描述符、metal-tt 构建 archive，运行时查找 miss 时必须有 fallback。

还什么值得关注

Command Allocator 不是线程安全的，不同线程必须使用不同的 allocator。
Queue Barrier 按 stage 粒度过滤（dispatch、fragment、vertex 等），Metal debugger 可视化 barrier 位置和影响范围。
Placement Sparse Heap 支持多种 page size（如 64KB），大 page 有性能优势但内存对齐开销更大，需要根据资源特性选择。