The Five-Dimension Rubric

Every review on Safua grades a submitted mission against five dimensions. The rubric is uniform across schools — the same five dimensions apply whether the mission is a pipeline at DataForge Labs or a compliance audit at Sentient Health. What changes between schools is the weighting: correctness matters more in Safety & Governance than in Agentic AI, code quality matters more in Data Engineering than in red-teaming.

The five dimensions:

  1. Correctness
  2. Code Quality
  3. Problem Solving
  4. Engineering Thinking
  5. Communication

Each is scored on a discrete 1–5 scale per mission, with reviewer commentary per dimension. This page walks through what each dimension measures, with a concrete example of a strong versus weak score drawn from the kind of work that shows up in Build.

Why five, and why these#

A single composite score is too reductive — it collapses a reviewer's judgment into one number and hides the shape of strength and weakness. Two dimensions is too few — in engineering, "correctness" and "style" don't exhaust the axes that matter. A dozen dimensions dilutes the signal and makes reviews unreadable.

Five is the point where the rubric is expressive enough to capture what senior engineers actually notice in a code review, while still small enough to summarise at a glance. We tested this against a thousand hand-annotated reviews before settling.

The specific five are not arbitrary. They map to the questions a hiring engineer asks about a junior candidate:

  • Does it work? → Correctness
  • Could I live with this code long-term? → Code Quality
  • Did they think about this, or copy it? → Problem Solving
  • Would they make good calls on the next thing too? → Engineering Thinking
  • Can they explain it to someone who wasn't in the conversation? → Communication

Correctness#

What it measures. Does the submitted artifact do what the brief asked, across the full input space including the edge cases the reviewer knows to probe? Correctness is the non-negotiable floor — a submission with a great architecture and wrong output doesn't pass.

Strong (5/5). A VisionArc edge-inference mission targets 50ms latency on 2GB RAM. The submission meets both budgets across all benchmark images, handles the three malformed-input cases the reviewer planted in the test harness, and graceful-degrades (returns a partial-confidence result) on the out-of-distribution sample rather than crashing.

Weak (2/5). Same mission, but the submission meets the latency budget only on GPU (the brief said edge device), crashes on the malformed-input JPEG, and the out-of-distribution result is a confident false positive. The code is well-organised, but that's not what this dimension measures.

Code Quality#

What it measures. Would a senior engineer sign off on this code in a production PR review without major revisions? This is about naming, structure, test coverage, type discipline, and the boring-but-critical signals of a codebase that's going to survive six months of future contributors.

Strong (5/5). A DataForge pipeline mission. Functions are named for intent (partition_by_source_date, not part1), boundaries between pure transformations and side-effectful I/O are explicit, tests cover both the happy path and the three named failure modes (late-arriving data, schema drift, partial delivery), and the reviewer can read the dbt manifest in the time it takes to finish a coffee.

Weak (2/5). Same pipeline ships with three functions named process, processV2, and processFinal, a 400-line module that mixes I/O and transformation with no tests, and a TODO: fix this later comment on the reconciliation step. The output is correct. The code is going to rot.

Problem Solving#

What it measures. Did the submitter reach for the right abstraction and the right tool for the shape of the problem, or did they hammer a familiar nail into an unfamiliar wall? Problem solving evaluates the approach, not just the result.

Strong (5/5). A Sentient Health mission asks for a HIPAA audit trail on an existing AI inference pipeline. The submitter realises the audit isn't really about logging — it's about provenance, and they design an append-only event record keyed to the model version and the input hash, so the audit can reconstruct "what model said what about this patient, with which inputs, at what time" months later. A weaker submitter logs every request to a rotating file and calls it done.

Weak (2/5). An Agentic AI mission asks for a multi-agent research assistant with conflict resolution. The submission wires three agents to a message bus and hopes for the best — no planning step, no explicit arbitration, no reflection layer. When the reviewer probes with a contradictory query, the agents ping-pong until the token budget is exhausted. The code is clean. The approach is wrong.

Engineering Thinking#

What it measures. The hardest dimension to teach and the clearest predictor of senior promotion. Engineering thinking is about second-order consequences: what breaks when this scales 10×, what monitoring catches the regression, what happens to this system at 3am on a Sunday when the on-call engineer has never seen it before.

Strong (5/5). An MLOps mission asks for a model deployment with monitoring. The submission includes not just latency and error-rate dashboards, but a feature-drift alert that fires when the input distribution diverges by a documented Wasserstein threshold, a rollback procedure tested via a scripted canary promotion, and a runbook in Markdown that a new engineer could execute without Slack-pinging the author.

Weak (2/5). Same mission ships with Prometheus metrics wired up and a dashboard that looks good on the demo screen. No drift alerting, no rollback procedure, no runbook. In production this is a model that will silently decay and nobody will know until a customer complains.

Communication#

What it measures. Can the submitter explain what they built, why, and what they'd change, in a way that a future maintainer — or a stakeholder without their context — can follow? This dimension covers written docs, PR descriptions, architecture diagrams, and the reasoning trail inside the submitted artifact.

Strong (5/5). An AI Safety & Governance mission ships with a four-page write-up: the threat model the submitter used to scope red-teaming, the three attack vectors they found, which mitigations they applied versus which they escalated, and a closing section on what they'd still want to test if they had another week. The reviewer can hand this document to a compliance officer and it reads cleanly.

Weak (2/5). Same mission. Code is mostly correct. The write-up is three bullets that say "tested for prompt injection, added input validation, done." The reviewer can't tell whether the submitter thought hard about the threat model or stumbled into a partial fix.

How dimensions aggregate#

Per-mission scores feed into the Confidence Score. The Confidence aggregate uses school-specific weightings, so the same 5-tuple of dimension scores produces different Confidence values depending on which school the mission belongs to. That page documents the exact formula.

How reviewers calibrate#

Named reviewers aren't freelancers grading independently. Each school runs monthly calibration sessions where the faculty score a shared benchmark mission and reconcile disagreements. The calibration minutes are internal (they include candid peer critique of each other's review taste), but the outcome — tighter inter-rater reliability over time — shows up in the Confidence Score noise bands.

What this rubric isn't#

A few things worth stating explicitly.

  • Not a checklist. Reviewers don't grade mechanically. Strong 3/5 reviewer commentary catches nuances no checklist encodes.
  • Not a proxy for seniority. A strong Confidence in a School doesn't equal "ready to be a senior engineer." It means "ready for production work in this domain at the level Safua's faculty calibrated for." Seniority is about years and breadth.
  • Not static. The rubric gets refined. When we change it, we version the change and the existing profiles carry the version under which they were scored, so historical comparability holds.

Further reading#