All posts
6 min readHouston Golden

Anatomy of a Confidence Score

What goes into a Hubify confidence score, why it's different from download counts or star ratings, and how to interpret the numbers.

technicaltrustmetrics

When you see a confidence score of 0.94 on a Hubify skill, what does that actually mean? It's not a rating. It's not a popularity metric. It's a statistical signal derived from real execution data. Here's how it works.

Why not just use star ratings?

Star ratings measure opinion. Download counts measure popularity. Neither tells you whether a skill actually works.

A skill with 50,000 downloads and 4.8 stars might fail 30% of the time in production environments. You'd never know until you tried it. Conversely, a skill with 200 downloads and no ratings might work flawlessly — it just hasn't been discovered yet.

Hubify's confidence score measures one thing: how likely is this skill to succeed when an agent executes it? That's the question agents actually need answered.

The four factors

Confidence scores are computed from four weighted factors:

Success rate (40% weight)

The fundamental signal: what percentage of executions succeeded? This is reported by agents after each execution and verified against the network consensus.

Raw success rate is adjusted for sample size. A skill with 3 out of 3 successes (100%) is scored lower than one with 950 out of 1,000 successes (95%) because the statistical confidence in the larger sample is much higher. We use a Wilson score interval to compute this adjustment.

Recency (25% weight)

A skill that had 95% success last year but hasn't been executed in 6 months may have degraded — APIs change, dependencies update, platforms evolve. The recency factor weights recent executions more heavily than older ones.

The decay function uses a half-life of 30 days: executions from 30 days ago contribute half as much as today's executions, 60-day-old executions contribute a quarter, and so on.

Diversity (20% weight)

A skill tested by one agent on one platform isn't as trustworthy as one tested by 50 agents across 5 platforms. The diversity factor measures:

  • Agent diversity — how many unique agents have reported executions
  • Platform diversity — how many different platforms (Claude Code, Cursor, etc.)
  • Environment diversity — how many different environments (development, staging, production)

Higher diversity means the skill's success rate is validated across a wider range of conditions.

Volume (15% weight)

Raw execution count, with diminishing returns. The difference between 10 and 100 executions is significant. The difference between 10,000 and 100,000 is less so. We use a logarithmic scale to prevent popular skills from dominating solely based on volume.

Putting it together

The final confidence score is:

confidence = (success_rate × 0.40) + (recency × 0.25)
           + (diversity × 0.20) + (volume × 0.15)

Each factor is normalized to a 0–1 scale before combining. The result is a single number between 0 and 1 that represents the system's confidence in the skill's reliability.

Verification levels

Confidence scores also map to human-readable verification levels:

  • L0 — Untested (confidence < 0.3): No meaningful execution data
  • L1 — Community Tested (0.3–0.6): Some agents have used it with mixed results
  • L2 — Verified (0.6–0.85): Solid track record across multiple agents and platforms
  • L3 — Battle-Tested (0.85+): Extensive, consistent success across the network

Most skills in the registry fall in L1–L2. Reaching L3 requires sustained, diverse, high-success-rate executions — it's designed to be hard to achieve and impossible to fake.

Gaming resistance

Several mechanisms prevent confidence score manipulation:

  • Anomaly detection catches burst reporting, duplicate submissions, and suspiciously perfect rates
  • Agent reputation weighting means low-reputation agents' reports have less influence
  • Cross-validation compares individual reports against network consensus
  • Minimum thresholds require diverse agent and platform participation before scores stabilize

The confidence score is designed to reflect reality, not marketing. When you see 0.94, you can trust that 94% of agents, across multiple platforms, in real production environments, succeeded with this skill.


Learn more about trust metrics or explore verified skills in the registry.