The point of bringing AI into RCA is not to replace human judgment — it is to free human judgment from work that does not need it. When the repetitive layer is automated, what's left is the layer that actually moves the system: insight, calibration, and direction.
Only a small fraction of moderation errors ever receive a true root-cause attribution; the rest stay as raw error counts with no diagnostic signal.
Different markets, different QAs, and even the same QA across weeks produce inconsistent codes when case volumes spike.
Downstream calibration, policy feedback, and training are forced to operate on shallow signals — no evidence chains, no business-level attribution.
The result is a persistent gap — not just between the errors that occur and the errors that receive meaningful diagnosis, but between the diagnoses we produce and the structured insight the business actually needs to act on.
AI takes on the high-volume, repeatable attribution layer at machine speed and consistency. Human QA levels up from first-pass labeler to validator and strategist — applying judgment where it matters most.
A step-change on both axes: efficiency lifts as repetitive attribution is automated end-to-end, and quality lifts on two fronts — every error case can now receive a structured root-cause attribution, and scarce human expertise is concentrated on the decisions that actually move the system.
| Lever | What LLM RCA unlocks |
|---|---|
| Coverage | First-pass attribution on every case |
| Consistency | Same code book, same template, 24×7, no drift |
| Sustainability | Rolling error-pattern + gap detection becomes continuous |
| Structure | Reasoning + evidence + confidence — a queryable data asset |
| Dirty-data detection | Catches wrong KOM, drifting standards, before contaminating reporting |
| Business insight | Clustering & trend detection no single QA can do manually |
Broad where the model can already deliver. Focused where clean validation requires it. Test the hypothesis on solid ground before scaling.
Organized along Priority × Frequency × Model Comprehensibility × Signal Accessibility. "Others" available as free-text fallback.
LLM has no hard language limits.
Where KOM provides a clear ground truth. Fuzzy cases deferred to later iterations.
No market / language restrictions. Maximize volume — let the model learn human RCA standards.
Online RCA codes, current process → baseline attribution.
Curated 1.0 codes + structured reasoning → model attribution.
Adjudicates diff cases; verdict drives accuracy scoring and iteration.
A case enters the pipeline with a tightly controlled signal whitelist — 7 fields in v1.0, plus the structured policy artifacts.
The model receives no free-form context beyond this whitelist — a deliberate constraint that prevents hallucinated references and keeps the reasoning chain auditable.
Static lookup of OVERKILL / MISCLASSIFICATION / MISS from policy UID. Non-negotiable.
Packages conflict signal + suspected-code pre-list. Constrains the candidate space.
LLM reads policy + signals. Bilateral judgment of each side's interpretation.
Up to 3 RCA-code candidates with working hypothesis. Generate, don't score.
Independent verification → final code + confidence + cap reason + full chain.
Aggregates failed / low-confidence cases. Drives prompt + data-field iteration.
The LLM reads multimodal case material — keyframes, OCR, transcript, ASR — and policy texts. It outputs a structured case_diagnosis that distinguishes "the case is genuinely ambiguous" from "the moderator misread a clear policy."
The model does not jump to "the moderator was wrong." It evaluates both the moderator's stance and KOM's stance against the policy — a bilateral diff. This is what detects both sides wrong, policy ambiguous, or signals insufficient.
Final RCA attribution.
Full structured chain — cites N3 phrases, policy spans, case-signal fields.
Capped when signals are insufficient. E.g. missing policy_update_date → cap 0.6.
Citations to every claim — auditable end-to-end.
An interactive walkthrough of the pipeline — case input, node-by-node reasoning, and the final structured artifact.
→ Open Demo390c6625606b.aime-app.bytedance.net/demo.html
+6.5 pp · TL score 2.86 — the LLM is at near-launch quality where ground truth is clean.
−7.6 pp · TL score 1.98 — LLM faithfully tracks KOM even when TL disagrees. Drivers: KOM-leakage Approves, and Youth / Adult Sexualized Behaviors disputes.
TL 1–5 scoring across 456 reviewed cases. The distribution is bimodal: 22.8% fully usable, 50% not usable — and most of the low end traces back to QA TL disagrees with KOM.
| RCA Type | QA | LLM | GT | Read |
|---|---|---|---|---|
| policy_understanding | 88.5% | 80.1% | 64.0% | Both over-attribute |
| age_judgement | 0.0% | 6.4% | 4.7% | LLM net-new |
| policy_vague | 9.7% | 8.2% | 2.0% | Both over-attribute |
| slip / speed / SOP | 0.0% | 0.0% | 14.3% | Shared blind spot |
The blind spot is a training-data gap, not a model defect — codes that humans never used, the model never learned.
The repetitive layer is what AI automates. The layer that's left — insight, calibration, direction — is the one that actually moves the system.