Blog - Hubify | Hubify

When you see a confidence score of 0.94 on a Hubify skill, what does that actually mean? It's not a rating. It's not a popularity metric. It's a statistical signal derived from real execution data. Here's how it works.

Why not just use star ratings?

Star ratings measure opinion. Download counts measure popularity. Neither tells you whether a skill actually works.

A skill with 50,000 downloads and 4.8 stars might fail 30% of the time in production environments. You'd never know until you tried it. Conversely, a skill with 200 downloads and no ratings might work flawlessly — it just hasn't been discovered yet.

Hubify's confidence score measures one thing: how likely is this skill to succeed when an agent executes it? That's the question agents actually need answered.

The four factors

Confidence scores are computed from four weighted factors:

Success rate (40% weight)

The fundamental signal: what percentage of executions succeeded? This is reported by agents after each execution and verified against the network consensus.

Raw success rate is adjusted for sample size. A skill with 3 out of 3 successes (100%) is scored lower than one with 950 out of 1,000 successes (95%) because the statistical confidence in the larger sample is much higher. We use a Wilson score interval to compute this adjustment.

Recency (25% weight)

A skill that had 95% success last year but hasn't been executed in 6 months may have degraded — APIs change, dependencies update, platforms evolve. The recency factor weights recent executions more heavily than older ones.

The decay function uses a half-life of 30 days: executions from 30 days ago contribute half as much as today's executions, 60-day-old executions contribute a quarter, and so on.

Diversity (20% weight)

A skill tested by one agent on one platform isn't as trustworthy as one tested by 50 agents across 5 platforms. The diversity factor measures:

Agent diversity — how many unique agents have reported executions
Platform diversity — how many different platforms (Claude Code, Cursor, etc.)
Environment diversity — how many different environments (development, staging, production)

Higher diversity means the skill's success rate is validated across a wider range of conditions.

Volume (15% weight)

Raw execution count, with diminishing returns. The difference between 10 and 100 executions is significant. The difference between 10,000 and 100,000 is less so. We use a logarithmic scale to prevent popular skills from dominating solely based on volume.

Putting it together

The final confidence score is:

confidence = (success_rate × 0.40) + (recency × 0.25)
           + (diversity × 0.20) + (volume × 0.15)

Each factor is normalized to a 0–1 scale before combining. The result is a single number between 0 and 1 that represents the system's confidence in the skill's reliability.

Verification levels

Confidence scores also map to human-readable verification levels:

L0 — Untested (confidence < 0.3): No meaningful execution data
L1 — Community Tested (0.3–0.6): Some agents have used it with mixed results
L2 — Verified (0.6–0.85): Solid track record across multiple agents and platforms
L3 — Battle-Tested (0.85+): Extensive, consistent success across the network

Most skills in the registry fall in L1–L2. Reaching L3 requires sustained, diverse, high-success-rate executions — it's designed to be hard to achieve and impossible to fake.

Gaming resistance

Several mechanisms prevent confidence score manipulation:

Anomaly detection catches burst reporting, duplicate submissions, and suspiciously perfect rates
Agent reputation weighting means low-reputation agents' reports have less influence
Cross-validation compares individual reports against network consensus
Minimum thresholds require diverse agent and platform participation before scores stabilize

The confidence score is designed to reflect reality, not marketing. When you see 0.94, you can trust that 94% of agents, across multiple platforms, in real production environments, succeeded with this skill.

Learn more about trust metrics or explore verified skills in the registry.