RunPod Integration
Connect RunPod GPU pods to Hubify Labs for experiment execution.
RunPod is the primary GPU compute provider for Hubify Labs. This guide covers connecting your RunPod account, configuring pods, and optimizing for cost.
Connecting RunPod
Create a RunPod account
Sign up at runpod.io and add billing information.
Generate an API key
Go to RunPod Settings > API Keys and create a key with full access.
Add to Hubify
hubify pod config --provider runpod --api-key "your-runpod-api-key"
Verify
hubify pod config --test
RunPod connection: OK
Available GPUs: H200, H100, A100, A40, RTX 4090
Account balance: $245.00
Available GPU Types
| GPU | VRAM | Best For | Approx. Cost/hr |
|---|---|---|---|
| H200 | 141 GB | Large models, full-dataset anomaly detection | $3.89 |
| H100 | 80 GB | MCMC chains, training, most experiments | $2.49 |
| A100 | 80 GB | General GPU compute | $1.64 |
| A40 | 48 GB | Medium workloads, figure generation | $0.79 |
| RTX 4090 | 24 GB | Small models, prototyping | $0.44 |
Pricing varies by availability and region. Spot instances can be up to 80% cheaper.
Pod Configuration
Default Settings
# Set defaults for all new pods
hubify pod config --default-gpu h100
hubify pod config --default-region us-east
hubify pod config --idle-timeout 15m
Docker Images
Hubify provides pre-built images with common scientific packages:
| Image | Contents |
|---|---|
hubify/base:latest | Python 3.11, CUDA 12, PyTorch 2.1 |
hubify/cosmo:latest | Base + Cobaya, GetDist, Astropy, HEALPy |
hubify/ml:latest | Base + Transformers, Accelerate, Datasets |
hubify/astro:latest | Base + Astropy, Photutils, SEP, Source Extractor |
hubify pod config --default-image hubify/cosmo:latest
SSH Access
# Add your SSH key
hubify pod ssh-key add --file ~/.ssh/id_ed25519.pub
# SSH into a running pod
hubify pod ssh pod-abc123
Performance Tips
- Use DataLoader for GPU inference:
num_workers=16,pin_memory=True,prefetch_factor=4gives a 32x speedup over serial processing - Pre-stage large datasets on persistent storage so pods start instantly
- Use spot instances for non-urgent experiments (set
--spotflag) - Match GPU to workload: do not use an H200 for figure generation
# Run on a spot instance
hubify experiment run --name "overnight-chain" --pod h100 --spot
Cost Management
# Set monthly budget
hubify pod budget --monthly 500
# Set per-experiment cap
hubify pod budget --per-experiment 50
# View current spend
hubify pod budget --show
# Alert at 80% of budget
hubify pod budget --alert-threshold 0.8
Persistent Storage
Upload datasets to RunPod persistent storage so they survive pod restarts:
# Upload a dataset
hubify pod storage upload ./planck_likelihood.tar.gz
# Mount in experiments
hubify experiment run --name "my-chain" --storage planck_likelihood.tar.gz
Troubleshooting
<AccordionGroup>Pod stuck in provisioning
The requested GPU type may be sold out. Try a different GPU or region:
hubify pod list --available
Out of memory (OOM)
Upgrade to a GPU with more VRAM, or reduce batch size. H200 (141 GB) handles the largest workloads.
Spot instance preempted
Spot instances can be reclaimed. Use checkpointing for long experiments:
hubify experiment resume EXP-051 --from-checkpoint latest
</AccordionGroup>