RunPod Integration

Connect RunPod GPU pods to Hubify Labs for experiment execution.

RunPod is the primary GPU compute provider for Hubify Labs. This guide covers connecting your RunPod account, configuring pods, and optimizing for cost.

Connecting RunPod

Create a RunPod account

Sign up at runpod.io and add billing information.

Generate an API key

Go to RunPod Settings > API Keys and create a key with full access.

Add to Hubify

hubify pod config --provider runpod --api-key "your-runpod-api-key"

Verify

hubify pod config --test
RunPod connection: OK
Available GPUs: H200, H100, A100, A40, RTX 4090
Account balance: $245.00

Available GPU Types

GPUVRAMBest ForApprox. Cost/hr
H200141 GBLarge models, full-dataset anomaly detection$3.89
H10080 GBMCMC chains, training, most experiments$2.49
A10080 GBGeneral GPU compute$1.64
A4048 GBMedium workloads, figure generation$0.79
RTX 409024 GBSmall models, prototyping$0.44

Pricing varies by availability and region. Spot instances can be up to 80% cheaper.

Pod Configuration

Default Settings

# Set defaults for all new pods
hubify pod config --default-gpu h100
hubify pod config --default-region us-east
hubify pod config --idle-timeout 15m

Docker Images

Hubify provides pre-built images with common scientific packages:

ImageContents
hubify/base:latestPython 3.11, CUDA 12, PyTorch 2.1
hubify/cosmo:latestBase + Cobaya, GetDist, Astropy, HEALPy
hubify/ml:latestBase + Transformers, Accelerate, Datasets
hubify/astro:latestBase + Astropy, Photutils, SEP, Source Extractor
hubify pod config --default-image hubify/cosmo:latest

SSH Access

# Add your SSH key
hubify pod ssh-key add --file ~/.ssh/id_ed25519.pub

# SSH into a running pod
hubify pod ssh pod-abc123

Performance Tips

  • Use DataLoader for GPU inference: num_workers=16, pin_memory=True, prefetch_factor=4 gives a 32x speedup over serial processing
  • Pre-stage large datasets on persistent storage so pods start instantly
  • Use spot instances for non-urgent experiments (set --spot flag)
  • Match GPU to workload: do not use an H200 for figure generation
# Run on a spot instance
hubify experiment run --name "overnight-chain" --pod h100 --spot

Cost Management

# Set monthly budget
hubify pod budget --monthly 500

# Set per-experiment cap
hubify pod budget --per-experiment 50

# View current spend
hubify pod budget --show

# Alert at 80% of budget
hubify pod budget --alert-threshold 0.8

Persistent Storage

Upload datasets to RunPod persistent storage so they survive pod restarts:

# Upload a dataset
hubify pod storage upload ./planck_likelihood.tar.gz

# Mount in experiments
hubify experiment run --name "my-chain" --storage planck_likelihood.tar.gz

Troubleshooting

<AccordionGroup>

Pod stuck in provisioning

The requested GPU type may be sold out. Try a different GPU or region:

hubify pod list --available

Out of memory (OOM)

Upgrade to a GPU with more VRAM, or reduce batch size. H200 (141 GB) handles the largest workloads.

Spot instance preempted

Spot instances can be reclaimed. Use checkpointing for long experiments:

hubify experiment resume EXP-051 --from-checkpoint latest
</AccordionGroup>
← Back to docs index