Kubernetes distributions

CNCF-conformant Kubernetes 1.27+ is the bar. Per-distribution notes below cover ingress, storage, and GPU operator differences.

Wordcab targets CNCF-conformant Kubernetes. The chart has been tested on the distributions below; distribution-specific notes cover ingress, storage, and the GPU operator.

Amazon EKS

  • Node group: g6e.2xlarge (L40S) or p5.48xlarge (H100) for production.
  • GPU operator: the NVIDIA device plugin via EKS add-on, or the full NVIDIA GPU Operator for time-slicing.
  • Storage: gp3 for hot transcripts, S3 for recordings + archive.
  • Ingress: AWS Load Balancer Controller (NLB for media streams, ALB for HTTPS API).
  • Private networking: PrivateLink endpoint supported; disables the public ingress by default.

Azure AKS

  • Node pool: Standard_NC40ads_H100_v5 or Standard_NV36ads_A10_v5.
  • GPU operator: NVIDIA device plugin add-on, or GPU Operator via Helm.
  • Storage: Azure Premium SSD v2 for hot data, Blob for recordings.
  • Ingress: Application Gateway Ingress Controller or nginx. Private Endpoint supported.

Google GKE

  • Node pool: a3-highgpu-8g (H100) or g2-standard-24 (L4).
  • GPU operator: GKE-managed DaemonSet.
  • Storage: pd-ssd (or Filestore for shared model cache), GCS for recordings.
  • Ingress: GKE Gateway API or nginx. Private Service Connect supported.

OpenShift 4.14+

  • GPU operator: NVIDIA GPU Operator + Node Feature Discovery from OperatorHub.
  • Storage: ODF, NetApp Trident, or customer-provided StorageClass.
  • Ingress: OpenShift Route (auto-created from Ingress) or HAProxy external.
  • SCC: restricted-v2 by default. Model pods need anyuid where the container runs tooling that needs a specific UID; the chart ships an SCC manifest.

SUSE RKE2 / air-gapped vanilla

  • GPU operator: NVIDIA GPU Operator installed from the internal registry mirror.
  • Storage: Longhorn or Rancher Local-Path for single-node, customer block storage for multi-node.
  • Ingress: nginx-ingress, MetalLB for on-prem L4.
  • For airgap, see the airgap install guide.

Common prerequisites

bash
# Confirm the cluster can see GPUs
kubectl get nodes -o json | jq '.items[].status.allocatable["nvidia.com/gpu"]'

# Confirm a default StorageClass exists
kubectl get storageclass

# Run the Wordcab preflight suite
wordcab deploy preflight
GPU drivers

All shapes require an NVIDIA GPU operator or device plugin. For older RHEL / SLES kernels, use the operator's driver.enabled=true mode so the right signed kernel modules are built in-cluster.