hubify experiment

Experiment lifecycle commands — create, run, monitor, and manage GPU-powered experiments.

Manage the full experiment lifecycle: creation, execution, monitoring, and result retrieval.

Commands

hubify experiment run

Run a new experiment. Accepts natural language or a config file:

# Natural language
hubify experiment run "MCMC chain, 10K samples, Planck+BAO, H100 pod"

# From config file
hubify experiment run --file experiment.yaml

# With explicit options
hubify experiment run \
  --name "planck-bao-chain" \
  --script run_cobaya.py \
  --config planck_bao.yaml \
  --pod h100 \
  --timeout 4h

hubify experiment list

List experiments in the active lab:

hubify experiment list
ID        NAME                STATUS     POD     DURATION  QC
EXP-054   planck-bao-chain    complete   h100    2h 14m    PASS
EXP-053   act-anomaly-sweep   running    h200    1h 03m    —
EXP-052   sdss-cross-match    complete   h100    45m       PASS
EXP-051   test-convergence    failed     h100    12m       FAIL
OptionDescription
--status <s>Filter by status: queued, running, complete, failed
--limit <n>Number of results (default: 20)
--allShow all experiments

hubify experiment status

Get detailed status of a specific experiment:

hubify experiment status EXP-054
ID:          EXP-054
Name:        planck-bao-chain
Status:      complete
Pod:         h100-abc123
Started:     2026-04-14 10:42:01 UTC
Completed:   2026-04-14 12:56:15 UTC
Duration:    2h 14m
QC:          PASS (R-hat: 1.03, samples: 10,241)
Outputs:     chain_samples.txt, posterior_plot.png, qc_report.json
Cost:        $4.28

hubify experiment outputs

Download experiment outputs:

# Download all outputs
hubify experiment outputs EXP-054 --download ./results/

# List outputs without downloading
hubify experiment outputs EXP-054 --list

# Download a specific file
hubify experiment outputs EXP-054 --file posterior_plot.png --download ./

hubify experiment rerun

Rerun a completed or failed experiment:

# Rerun with same config
hubify experiment rerun EXP-051

# Rerun with modified parameters
hubify experiment rerun EXP-051 --override "pod=h200,timeout=8h"

hubify experiment resume

Resume an experiment from the last checkpoint:

hubify experiment resume EXP-051 --from-checkpoint latest

hubify experiment stop

Stop a running experiment:

hubify experiment stop EXP-053

hubify experiment qc

View QC gate results:

hubify experiment qc EXP-054
QC Gate: PASS
  Convergence (R-hat):   1.03 (threshold: 1.10) ✓
  Minimum samples:       10,241 (threshold: 1,000) ✓
  Chain completeness:    100% ✓
  NaN/Inf check:         Clean ✓

Experiment Config Format

# experiment.yaml
name: "planck-bao-chain"
description: "Full MCMC chain on Planck+BAO likelihood"
script: run_cobaya.py
config: planck_bao.yaml
pod:
  gpu: h100
  timeout: 4h
  storage: 20GB
outputs:
  - chain_samples.txt
  - posterior_plot.png
qc:
  convergence_threshold: 1.10
  min_samples: 5000
depends_on:
  - EXP-050  # Must complete first

Examples

# Run and follow logs in real time
hubify experiment run --file chain.yaml && hubify logs EXP-055 --follow

# List all failed experiments this week
hubify experiment list --status failed --since 7d

# Batch rerun all failed experiments
hubify experiment list --status failed --json | jq -r '.[].id' | xargs -I{} hubify experiment rerun {}
← Back to docs index