GPU Compute
Manage GPU pods for experiments — provision H100/H200, monitor utilization, optimize costs, and SSH into running pods.
Hubify Labs integrates directly with GPU cloud providers to give you on-demand access to high-end compute. Currently powered by RunPod, with Modal serverless functions coming soon.
Pod Management
Provision
Specify GPU type and duration. The system finds the cheapest available pod matching your requirements.
Initialize
Your lab's environment is set up automatically: Python packages, data mounts, SSH keys, and monitoring agents.
Execute
Run experiments. Logs stream in real time. Intermediate results checkpoint to persistent storage.
Monitor
Track GPU utilization, memory, and cost in real time from Captain View or CLI.
Teardown
Pods shut down automatically when experiments complete. Results are synced before teardown.
GPU Options
| GPU | VRAM | Use Case | Approx. Cost |
|---|---|---|---|
| NVIDIA H200 | 141 GB | Large MCMC, multi-survey sweeps, foundation models | $4-6/hr |
| NVIDIA H100 | 80 GB | Training, medium MCMC, anomaly detection | $2-4/hr |
| NVIDIA A100 | 80 GB | General GPU compute, inference | $1-2/hr |
| NVIDIA A40 | 48 GB | Light GPU tasks, development | $0.50-1/hr |
Cost Controls
Set a monthly budget cap per lab:
# Set budget cap
hubify pod budget --monthly 500
# View current spend
hubify pod cost --month current
# Get cost forecast
hubify pod cost --forecast
When you approach the budget limit:
- New experiments queue instead of launching
- You receive a notification
- The orchestrator suggests cost-saving alternatives (smaller GPU, CPU-only preprocessing)
Auto-Optimization
The system picks the cheapest option for each experiment:
Experiment needs ~2 hours on H100 ($2/hr) = $4
Same experiment runs ~45 min on H200 ($5/hr) = $3.75
→ System picks H200 (cheaper overall despite higher hourly rate)
Override with explicit pod selection when needed.
Persistent Storage
Each lab gets persistent storage:
- Survives pod teardowns
- Pre-stage large datasets for instant access
- Experiment outputs sync automatically
- Configurable retention policies
# List persistent storage
hubify pod storage list
# Upload data to persistent storage
hubify pod storage upload ./large_dataset.fits
# Download results
hubify pod storage download /results/chain_samples.txt
SSH Access
Every running pod is accessible via SSH:
# Auto-connect to a pod
hubify pod ssh pod-abc123
# Get connection details
hubify pod info pod-abc123
# → SSH: root@205.196.19.52 -p 11452
Idle Detection
Warning: An idle GPU is wasted money. Hubify monitors utilization and takes action when pods sit idle.
When a pod finishes its experiment and nothing is queued:
- Alert sent to you and the orchestrator
- System suggests next experiments that could use the pod
- If auto-schedule is enabled, the next experiment deploys automatically
- If nothing is queued for 15 minutes, the pod tears down
DataLoader Best Practices
For production GPU inference, always use optimized DataLoaders:
from torch.utils.data import DataLoader
loader = DataLoader(
dataset,
batch_size=64,
num_workers=16, # Parallel data loading
pin_memory=True, # Fast GPU transfer
prefetch_factor=4, # Prefetch batches
persistent_workers=True
)
This pattern provides a 32x speedup over serial processing.
CLI Reference
hubify pod list # List all pods
hubify pod create --gpu h100 # Launch a pod
hubify pod status <id> # Check pod status
hubify pod ssh <id> # SSH into a pod
hubify pod stop <id> # Terminate a pod
hubify pod cost # View cost summary
hubify pod budget # Manage budget