GPU Setup

Connect GPU compute providers, configure pod defaults, set budget limits, and optimize for cost.

This guide walks you through connecting GPU compute to your lab. You need GPU access to run experiments that require heavy computation (MCMC chains, model training, large-scale data processing).

Connect RunPod

Get your RunPod API key

  1. Go to runpod.io and sign in
  2. Navigate to Settings > API Keys
  3. Create a new API key with full access
  4. Copy the key

Add the key to Hubify

  1. Go to Lab Settings > Compute
  2. Click Connect RunPod
  3. Paste your API key
  4. Click Verify & Save
hubify pod config --provider runpod --api-key "your-runpod-api-key"

Verify the connection

hubify pod config --test
RunPod connection: OK
Available GPUs: H200, H100, A100, A40
Account balance: $245.00

Set Default GPU

Configure which GPU type is used when experiments do not specify one:

# Set default GPU
hubify pod config --default-gpu h100

# Set default timeout
hubify pod config --default-timeout 4h

# View current config
hubify pod config --show

Budget Controls

Set spending limits to avoid surprises:

# Monthly budget cap (pods queue when reached)
hubify pod budget --monthly 500

# Per-experiment cap
hubify pod budget --per-experiment 50

# Alert threshold (notify at 80% of budget)
hubify pod budget --alert-threshold 0.8

When the monthly budget is reached:

  • New experiments queue instead of launching
  • You receive a notification
  • The orchestrator suggests cost-saving alternatives
  • Running experiments continue until completion

Pod Templates

Create reusable pod configurations for common experiment types:

# Create a template
hubify pod template create \
  --name "mcmc-standard" \
  --gpu h100 \
  --timeout 4h \
  --docker-image "hubify/cosmo:latest" \
  --env "OMP_NUM_THREADS=16" \
  --storage 50GB

# Use a template
hubify experiment run --name "my-chain" --pod-template mcmc-standard

# List templates
hubify pod template list

GPU Selection Guide

Experiment TypeRecommended GPUWhy
MCMC chains (< 100K samples)H100Good balance of cost and speed
MCMC chains (> 100K samples)H200Large memory prevents OOM
Neural network trainingH100 or H200Depends on model size
Anomaly detection (large catalog)H200141 GB VRAM for full dataset
Data preprocessingCPUNo GPU needed, save money
Figure generationCPU or A40Lightweight, save money

Persistent Storage

Configure persistent storage for datasets and results:

# View storage usage
hubify pod storage list

# Upload a dataset (available to all pods)
hubify pod storage upload ./planck_likelihood.tar.gz

# Set retention policy
hubify pod storage config --retain-days 90

Persistent storage survives pod teardowns. Pre-stage large datasets here so experiments start instantly instead of waiting for downloads.

SSH Keys

Add SSH keys for direct pod access:

# Add your SSH key
hubify pod ssh-key add --file ~/.ssh/id_ed25519.pub

# List configured keys
hubify pod ssh-key list

Monitoring

Monitor active pods from Captain View or CLI:

# Real-time pod status
hubify pod list --watch

# GPU utilization
hubify pod metrics pod-abc123

# Cost tracking
hubify pod cost --month current --breakdown
MONTH      TOTAL    H200    H100    A100    CPU
2026-04    $312     $180    $112    $20     $0

Coming Soon: Modal

Modal integration will add serverless GPU functions. Instead of managing pods, you deploy functions that run on-demand and charge per second. Ideal for:

  • Short-lived tasks (< 10 minutes)
  • Bursty workloads
  • Figure generation
  • Small inferences

Modal support is currently in development.

← Back to docs index