hubify experiment
Experiment lifecycle commands — create, run, monitor, and manage GPU-powered experiments.
Manage the full experiment lifecycle: creation, execution, monitoring, and result retrieval.
Commands
hubify experiment run
Run a new experiment. Accepts natural language or a config file:
# Natural language
hubify experiment run "MCMC chain, 10K samples, Planck+BAO, H100 pod"
# From config file
hubify experiment run --file experiment.yaml
# With explicit options
hubify experiment run \
--name "planck-bao-chain" \
--script run_cobaya.py \
--config planck_bao.yaml \
--pod h100 \
--timeout 4h
hubify experiment list
List experiments in the active lab:
hubify experiment list
ID NAME STATUS POD DURATION QC
EXP-054 planck-bao-chain complete h100 2h 14m PASS
EXP-053 act-anomaly-sweep running h200 1h 03m —
EXP-052 sdss-cross-match complete h100 45m PASS
EXP-051 test-convergence failed h100 12m FAIL
| Option | Description |
|---|---|
--status <s> | Filter by status: queued, running, complete, failed |
--limit <n> | Number of results (default: 20) |
--all | Show all experiments |
hubify experiment status
Get detailed status of a specific experiment:
hubify experiment status EXP-054
ID: EXP-054
Name: planck-bao-chain
Status: complete
Pod: h100-abc123
Started: 2026-04-14 10:42:01 UTC
Completed: 2026-04-14 12:56:15 UTC
Duration: 2h 14m
QC: PASS (R-hat: 1.03, samples: 10,241)
Outputs: chain_samples.txt, posterior_plot.png, qc_report.json
Cost: $4.28
hubify experiment outputs
Download experiment outputs:
# Download all outputs
hubify experiment outputs EXP-054 --download ./results/
# List outputs without downloading
hubify experiment outputs EXP-054 --list
# Download a specific file
hubify experiment outputs EXP-054 --file posterior_plot.png --download ./
hubify experiment rerun
Rerun a completed or failed experiment:
# Rerun with same config
hubify experiment rerun EXP-051
# Rerun with modified parameters
hubify experiment rerun EXP-051 --override "pod=h200,timeout=8h"
hubify experiment resume
Resume an experiment from the last checkpoint:
hubify experiment resume EXP-051 --from-checkpoint latest
hubify experiment stop
Stop a running experiment:
hubify experiment stop EXP-053
hubify experiment qc
View QC gate results:
hubify experiment qc EXP-054
QC Gate: PASS
Convergence (R-hat): 1.03 (threshold: 1.10) ✓
Minimum samples: 10,241 (threshold: 1,000) ✓
Chain completeness: 100% ✓
NaN/Inf check: Clean ✓
Experiment Config Format
# experiment.yaml
name: "planck-bao-chain"
description: "Full MCMC chain on Planck+BAO likelihood"
script: run_cobaya.py
config: planck_bao.yaml
pod:
gpu: h100
timeout: 4h
storage: 20GB
outputs:
- chain_samples.txt
- posterior_plot.png
qc:
convergence_threshold: 1.10
min_samples: 5000
depends_on:
- EXP-050 # Must complete first
Examples
# Run and follow logs in real time
hubify experiment run --file chain.yaml && hubify logs EXP-055 --follow
# List all failed experiments this week
hubify experiment list --status failed --since 7d
# Batch rerun all failed experiments
hubify experiment list --status failed --json | jq -r '.[].id' | xargs -I{} hubify experiment rerun {}