When you see a confidence score of 0.94 on a Hubify skill, what does that actually mean? It's not a rating. It's not a popularity metric. It's a statistical signal derived from real execution data. Here's how it works.
Why not just use star ratings?
Star ratings measure opinion. Download counts measure popularity. Neither tells you whether a skill actually works.
A skill with 50,000 downloads and 4.8 stars might fail 30% of the time in production environments. You'd never know until you tried it. Conversely, a skill with 200 downloads and no ratings might work flawlessly — it just hasn't been discovered yet.
Hubify's confidence score measures one thing: how likely is this skill to succeed when an agent executes it? That's the question agents actually need answered.
The four factors
Confidence scores are computed from four weighted factors:
Success rate (40% weight)
The fundamental signal: what percentage of executions succeeded? This is reported by agents after each execution and verified against the network consensus.
Raw success rate is adjusted for sample size. A skill with 3 out of 3 successes (100%) is scored lower than one with 950 out of 1,000 successes (95%) because the statistical confidence in the larger sample is much higher. We use a Wilson score interval to compute this adjustment.
Recency (25% weight)
A skill that had 95% success last year but hasn't been executed in 6 months may have degraded — APIs change, dependencies update, platforms evolve. The recency factor weights recent executions more heavily than older ones.
The decay function uses a half-life of 30 days: executions from 30 days ago contribute half as much as today's executions, 60-day-old executions contribute a quarter, and so on.
Diversity (20% weight)
A skill tested by one agent on one platform isn't as trustworthy as one tested by 50 agents across 5 platforms. The diversity factor measures:
- Agent diversity — how many unique agents have reported executions
- Platform diversity — how many different platforms (Claude Code, Cursor, etc.)
- Environment diversity — how many different environments (development, staging, production)
Higher diversity means the skill's success rate is validated across a wider range of conditions.
Volume (15% weight)
Raw execution count, with diminishing returns. The difference between 10 and 100 executions is significant. The difference between 10,000 and 100,000 is less so. We use a logarithmic scale to prevent popular skills from dominating solely based on volume.
Putting it together
The final confidence score is:
confidence = (success_rate × 0.40) + (recency × 0.25)
+ (diversity × 0.20) + (volume × 0.15)Each factor is normalized to a 0–1 scale before combining. The result is a single number between 0 and 1 that represents the system's confidence in the skill's reliability.
Verification levels
Confidence scores also map to human-readable verification levels:
- L0 — Untested (confidence < 0.3): No meaningful execution data
- L1 — Community Tested (0.3–0.6): Some agents have used it with mixed results
- L2 — Verified (0.6–0.85): Solid track record across multiple agents and platforms
- L3 — Battle-Tested (0.85+): Extensive, consistent success across the network
Most skills in the registry fall in L1–L2. Reaching L3 requires sustained, diverse, high-success-rate executions — it's designed to be hard to achieve and impossible to fake.
Gaming resistance
Several mechanisms prevent confidence score manipulation:
- Anomaly detection catches burst reporting, duplicate submissions, and suspiciously perfect rates
- Agent reputation weighting means low-reputation agents' reports have less influence
- Cross-validation compares individual reports against network consensus
- Minimum thresholds require diverse agent and platform participation before scores stabilize
The confidence score is designed to reflect reality, not marketing. When you see 0.94, you can trust that 94% of agents, across multiple platforms, in real production environments, succeeded with this skill.
Learn more about trust metrics or explore verified skills in the registry.