Confidence Score

The Confidence Score is Safua's single numeric signal of engineering readiness in a given school. It's a floating-point value in the closed interval [0.00, 1.00], published per-school on a learner's Proof Profile, and revised every time a mission in that school is reviewed.

This page documents the math, the thresholds, and the semantics.

What it means#

A Confidence of 0.93 in the School of AI Engineering reads as: "Across this learner's reviewed missions in this school, the weighted aggregate of five-dimension scores places them in the top ~5% of the population Safua's faculty has calibrated against."

It does not mean "93% accurate." It does not mean "93% of employers would hire them." It is specifically an internal aggregate of mission-review scores, weighted by school-specific priorities, and compared against the calibration distribution the faculty established during rubric design.

Honesty about what this number is prevents hiring teams from treating it as something it isn't.

How it's computed#

For a learner L in a school S, the Confidence is computed in three steps.

Step 1 — per-mission score#

Each mission m in school S has a reviewer-assigned 5-tuple of dimension scores:

r_m = (r_correctness, r_code_quality, r_problem_solving,
       r_engineering_thinking, r_communication)

Each r_i is an integer in {1, 2, 3, 4, 5}, assigned by the named reviewer. The per-mission aggregate is a weighted sum:

score_m(S) = Σ w_i(S) · r_i     for i = 1..5
             ─────────────
             5

where w_i(S) is the school-specific weighting for dimension i, and the weights sum to 5.0. For example, AI Safety & Governance uses w_correctness = 1.4, w_engineering_thinking = 1.2, w_communication = 1.2, w_code_quality = 0.6, w_problem_solving = 0.6 — putting correctness and engineering thinking at a premium and softening the signal on code quality relative to, say, Data Engineering. The per-school weighting table is public on each school page.

score_m(S) lives in [0.2, 1.0] because the minimum dimension score is 1 and the maximum is 5, and after dividing by 5 the floor is 0.2.

Step 2 — aggregate across missions#

Confidence aggregates over the learner's recent missions in the school, weighted by recency:

C(L, S) = Σ α^(n - k) · score_m_k(S)
          ─────────────────────────
            Σ α^(n - k)

where m_k is the k-th most recent mission in school S (with k=1 the most recent and n the oldest counted), and α = 0.85 is a recency discount. This means the most recent 6-8 missions dominate the signal, while a learner's earliest Build missions fade out of Confidence over time.

The discount is deliberate. A submission from twelve months ago, however strong, is less predictive of current engineering capacity than a submission from last week. Confidence is a current readiness signal, not a lifetime average.

Step 3 — normalise to the school distribution#

The raw aggregate C(L, S) is then mapped onto the calibration distribution: the set of C-values Safua's faculty produced when they scored the benchmark missions during rubric design. The published Confidence is this percentile-mapped value, clipped to [0, 1].

This is why 0.93 reads as "top ~5%" rather than "93 out of 100": it's a percentile position against a calibrated reference, not a raw score.

The 0.60 flag#

Any published Confidence below 0.60 triggers an additional review pass. The flag is not pejorative — it exists because the model's estimation error widens toward the low end of the distribution, and a low-Confidence learner deserves a second human look before that signal becomes part of their public profile.

When the flag fires:

  1. A second faculty member — from the same school, different specialisation — re-reviews the most recent two missions.
  2. If their scores agree with the original reviewer within ±0.15, the Confidence publishes as computed.
  3. If they diverge by more than that, the faculty resolve the disagreement in the monthly calibration session and the Confidence is re-computed from the reconciled scores.

In practice the 0.60 flag fires on about 8% of active-learner profiles at any given point in the residency, and the reconciliation outcome shifts the Confidence by an average of ±0.04. The purpose is to prevent a single atypical reviewer from pinning a learner low for reasons that wouldn't hold up to peer challenge.

The ±0.25 tolerance band#

Confidence is reported with a tolerance band. A profile showing 0.78 should be read as "Confidence in the range roughly [0.76, 0.80]" — the reported number is the point estimate, the band is the noise.

The tolerance comes from two sources:

  1. Inter-rater variance. Even with calibration sessions, two qualified reviewers on the same mission produce a small amount of score disagreement. Safua measures this ongoing and publishes the per-school inter-rater standard deviation internally.
  2. Mission sampling. A learner's submitted-missions sample is a subset of the population of possible missions. The sample variance contributes to uncertainty on the aggregate.

Combined, these produce the ±0.25 envelope. Employers reading a Proof Profile should treat Confidence at this resolution — 0.78 and 0.82 are not meaningfully different; 0.78 and 0.93 are.

School-specific thresholds#

Graduation from a school's Build phase into Prove requires clearing a school-specific Confidence threshold. The thresholds are published per-school:

| School | Graduation threshold | |---|---| | Data Engineering | 0.70 | | Machine Learning | 0.72 | | AI Engineering | 0.74 | | Agentic AI | 0.76 | | MLOps & Infrastructure | 0.75 | | AI Safety & Governance | 0.80 |

The differences reflect the downside risk of a weak graduate in each domain. A weak Safety engineer is worse, in consequence, than a weak ETL engineer — so the bar is higher.

What this score doesn't capture#

A few things worth stating:

  • Breadth across schools. A 0.93 in one school says nothing about performance in another. A learner's full profile aggregates per school, not across.
  • Soft skills beyond communication. The Communication dimension covers written artifacts. It does not score stakeholder management, team collaboration, or live technical interviewing — those are orthogonal axes Safua doesn't measure directly.
  • Domain novelty. Every mission in a school is curated by the faculty against the school's canonical problem shapes. A mission outside the curated set (a greenfield project the learner proposed off-platform) isn't in Confidence — the score reflects platform work specifically.

We publish these caveats because they prevent over-indexing. Use Confidence as one signal. The Proof Profile — the missions, the reviews, the reviewer names — is the fuller picture.

Versioning#

The Confidence formula is versioned. The current version is v1.0. Every published Confidence includes the version under which it was computed, so historical scores remain comparable after formula updates.

When we change the formula — e.g. if future calibration work refines the school-specific weighting table — new submissions score under the new version; historical missions retain their original score. The Proof Profile shows both the current aggregate and, on expansion, the per-version history.

Further reading#