GPU Compute

Manage GPU pods for experiments — provision H100/H200, monitor utilization, optimize costs, and SSH into running pods.

Hubify Labs integrates directly with GPU cloud providers to give you on-demand access to high-end compute. Currently powered by RunPod, with Modal serverless functions coming soon.

Pod Management

Provision

Specify GPU type and duration. The system finds the cheapest available pod matching your requirements.

Initialize

Your lab's environment is set up automatically: Python packages, data mounts, SSH keys, and monitoring agents.

Execute

Run experiments. Logs stream in real time. Intermediate results checkpoint to persistent storage.

Monitor

Track GPU utilization, memory, and cost in real time from Captain View or CLI.

Teardown

Pods shut down automatically when experiments complete. Results are synced before teardown.

GPU Options

GPUVRAMUse CaseApprox. Cost
NVIDIA H200141 GBLarge MCMC, multi-survey sweeps, foundation models$4-6/hr
NVIDIA H10080 GBTraining, medium MCMC, anomaly detection$2-4/hr
NVIDIA A10080 GBGeneral GPU compute, inference$1-2/hr
NVIDIA A4048 GBLight GPU tasks, development$0.50-1/hr

Cost Controls

Set a monthly budget cap per lab:

# Set budget cap
hubify pod budget --monthly 500

# View current spend
hubify pod cost --month current

# Get cost forecast
hubify pod cost --forecast

When you approach the budget limit:

  1. New experiments queue instead of launching
  2. You receive a notification
  3. The orchestrator suggests cost-saving alternatives (smaller GPU, CPU-only preprocessing)

Auto-Optimization

The system picks the cheapest option for each experiment:

Experiment needs ~2 hours on H100 ($2/hr) = $4
Same experiment runs ~45 min on H200 ($5/hr) = $3.75
→ System picks H200 (cheaper overall despite higher hourly rate)

Override with explicit pod selection when needed.

Persistent Storage

Each lab gets persistent storage:

  • Survives pod teardowns
  • Pre-stage large datasets for instant access
  • Experiment outputs sync automatically
  • Configurable retention policies
# List persistent storage
hubify pod storage list

# Upload data to persistent storage
hubify pod storage upload ./large_dataset.fits

# Download results
hubify pod storage download /results/chain_samples.txt

SSH Access

Every running pod is accessible via SSH:

# Auto-connect to a pod
hubify pod ssh pod-abc123

# Get connection details
hubify pod info pod-abc123
# → SSH: root@205.196.19.52 -p 11452

Idle Detection

Warning: An idle GPU is wasted money. Hubify monitors utilization and takes action when pods sit idle.

When a pod finishes its experiment and nothing is queued:

  1. Alert sent to you and the orchestrator
  2. System suggests next experiments that could use the pod
  3. If auto-schedule is enabled, the next experiment deploys automatically
  4. If nothing is queued for 15 minutes, the pod tears down

DataLoader Best Practices

For production GPU inference, always use optimized DataLoaders:

from torch.utils.data import DataLoader

loader = DataLoader(
    dataset,
    batch_size=64,
    num_workers=16,       # Parallel data loading
    pin_memory=True,      # Fast GPU transfer
    prefetch_factor=4,    # Prefetch batches
    persistent_workers=True
)

This pattern provides a 32x speedup over serial processing.

CLI Reference

hubify pod list              # List all pods
hubify pod create --gpu h100 # Launch a pod
hubify pod status <id>       # Check pod status
hubify pod ssh <id>          # SSH into a pod
hubify pod stop <id>         # Terminate a pod
hubify pod cost              # View cost summary
hubify pod budget            # Manage budget
← Back to docs index