RunPod Integration

RunPod is the primary GPU compute provider for Hubify Labs. This guide covers connecting your RunPod account, configuring pods, and optimizing for cost.

Connecting RunPod

Create a RunPod account

Generate an API key

Go to RunPod Settings > API Keys and create a key with full access.

Add to Hubify

hubify pod config --provider runpod --api-key "your-runpod-api-key"

Verify

hubify pod config --test

RunPod connection: OK
Available GPUs: H200, H100, A100, A40, RTX 4090
Account balance: $245.00

Available GPU Types

GPU	VRAM	Best For	Approx. Cost/hr
H200	141 GB	Large models, full-dataset anomaly detection	$3.89
H100	80 GB	MCMC chains, training, most experiments	$2.49
A100	80 GB	General GPU compute	$1.64
A40	48 GB	Medium workloads, figure generation	$0.79
RTX 4090	24 GB	Small models, prototyping	$0.44

Pricing varies by availability and region. Spot instances can be up to 80% cheaper.

Pod Configuration

Default Settings

# Set defaults for all new pods
hubify pod config --default-gpu h100
hubify pod config --default-region us-east
hubify pod config --idle-timeout 15m

Docker Images

Hubify provides pre-built images with common scientific packages:

Image	Contents
`hubify/base:latest`	Python 3.11, CUDA 12, PyTorch 2.1
`hubify/cosmo:latest`	Base + Cobaya, GetDist, Astropy, HEALPy
`hubify/ml:latest`	Base + Transformers, Accelerate, Datasets
`hubify/astro:latest`	Base + Astropy, Photutils, SEP, Source Extractor

hubify pod config --default-image hubify/cosmo:latest

SSH Access

# Add your SSH key
hubify pod ssh-key add --file ~/.ssh/id_ed25519.pub

# SSH into a running pod
hubify pod ssh pod-abc123

Performance Tips

Use DataLoader for GPU inference: num_workers=16, pin_memory=True, prefetch_factor=4 gives a 32x speedup over serial processing
Pre-stage large datasets on persistent storage so pods start instantly
Use spot instances for non-urgent experiments (set --spot flag)
Match GPU to workload: do not use an H200 for figure generation

# Run on a spot instance
hubify experiment run --name "overnight-chain" --pod h100 --spot

Cost Management

# Set monthly budget
hubify pod budget --monthly 500

# Set per-experiment cap
hubify pod budget --per-experiment 50

# View current spend
hubify pod budget --show

# Alert at 80% of budget
hubify pod budget --alert-threshold 0.8

Persistent Storage

Upload datasets to RunPod persistent storage so they survive pod restarts:

# Upload a dataset
hubify pod storage upload ./planck_likelihood.tar.gz

# Mount in experiments
hubify experiment run --name "my-chain" --storage planck_likelihood.tar.gz

Troubleshooting

Pod stuck in provisioning

The requested GPU type may be sold out. Try a different GPU or region:

hubify pod list --available

Out of memory (OOM)

Upgrade to a GPU with more VRAM, or reduce batch size. H200 (141 GB) handles the largest workloads.

Spot instance preempted

Spot instances can be reclaimed. Use checkpointing for long experiments:

hubify experiment resume EXP-051 --from-checkpoint latest

Documentation Index

​RunPod Integration

​Connecting RunPod

​Available GPU Types

​Pod Configuration

​Default Settings

​Docker Images

​SSH Access

​Performance Tips

​Cost Management

​Persistent Storage

​Troubleshooting