[VSO-PPS] · 2026

LLM-Driven
RCA Sharing

The point of bringing AI into RCA is not to replace human judgment — it is to free human judgment from work that does not need it. When the repetitive layer is automated, what's left is the layer that actually moves the system: insight, calibration, and direction.

01 / 12 v1.0 Trust & Safety · PPS
01 · Why We Use LLM RCA02 / 12

Three structural ceilings
of today's human-only RCA

Coverage ceiling

High-quality RCA is labor-intensive.

Only a small fraction of moderation errors ever receive a true root-cause attribution; the rest stay as raw error counts with no diagnostic signal.

Consistency ceiling

Attribution quality degrades under load.

Different markets, different QAs, and even the same QA across weeks produce inconsistent codes when case volumes spike.

Insight ceiling

Descriptive, not structured.

Downstream calibration, policy feedback, and training are forced to operate on shallow signals — no evidence chains, no business-level attribution.

The result is a persistent gap — not just between the errors that occur and the errors that receive meaningful diagnosis, but between the diagnoses we produce and the structured insight the business actually needs to act on.

01.2 · Value Proposition03 / 12

A Co-Pilot model

AI takes on the high-volume, repeatable attribution layer at machine speed and consistency. Human QA levels up from first-pass labeler to validator and strategist — applying judgment where it matters most.

A step-change on both axes: efficiency lifts as repetitive attribution is automated end-to-end, and quality lifts on two fronts — every error case can now receive a structured root-cause attribution, and scarce human expertise is concentrated on the decisions that actually move the system.

LeverWhat LLM RCA unlocks
CoverageFirst-pass attribution on every case
ConsistencySame code book, same template, 24×7, no drift
SustainabilityRolling error-pattern + gap detection becomes continuous
StructureReasoning + evidence + confidence — a queryable data asset
Dirty-data detectionCatches wrong KOM, drifting standards, before contaminating reporting
Business insightClustering & trend detection no single QA can do manually
02 · v1.0 Design04 / 12

Scope — deliberately shaped

Broad where the model can already deliver. Focused where clean validation requires it. Test the hypothesis on solid ground before scaling.

RCA Codes

Curated subset

Organized along Priority × Frequency × Model Comprehensibility × Signal Accessibility. "Others" available as free-text fallback.

Markets

Global

LLM has no hard language limits.

Case Types

Non-fuzzy: OMA + OS

Where KOM provides a clear ground truth. Fuzzy cases deferred to later iterations.

Training Set

This year's human RCA

No market / language restrictions. Maximize volume — let the model learn human RCA standards.

Validation: parallel test, QA Lead as Ground Truth

Human QA

Online RCA codes, current process → baseline attribution.

AIQA · LLM

Curated 1.0 codes + structured reasoning → model attribution.

QA Lead · GT

Adjudicates diff cases; verdict drives accuracy scoring and iteration.

03 · How LLM Works05 / 12

Inputs — a tight whitelist

A case enters the pipeline with a tightly controlled signal whitelist — 7 fields in v1.0, plus the structured policy artifacts.

The model receives no free-form context beyond this whitelist — a deliberate constraint that prevents hallucinated references and keeps the reasoning chain auditable.

Case signals (7)
video · video_title · transcript
sticker · keyframe · ocr · background_asr_text
Policy artifacts
<MODERATOR_POLICY_TEXT>
what the moderator applied
<CORRECT_POLICY_TEXT>
KOM ground truth
03.2 · Reasoning Pipeline06 / 12

Six nodes — five inline, one offline

N1

Conflict Pre-check

Static lookup of OVERKILL / MISCLASSIFICATION / MISS from policy UID. Non-negotiable.

N2

Result Bundling

Packages conflict signal + suspected-code pre-list. Constrains the candidate space.

N3

Case Understanding

LLM reads policy + signals. Bilateral judgment of each side's interpretation.

N4

Hypothesis Generator

Up to 3 RCA-code candidates with working hypothesis. Generate, don't score.

N5

Hypothesis Validator

Independent verification → final code + confidence + cap reason + full chain.

N6 · offline

Evolution Pool

Aggregates failed / low-confidence cases. Drives prompt + data-field iteration.

(a) Content Understanding

The LLM reads multimodal case material — keyframes, OCR, transcript, ASR — and policy texts. It outputs a structured case_diagnosis that distinguishes "the case is genuinely ambiguous" from "the moderator misread a clear policy."

(b) Bilateral Judgment

The model does not jump to "the moderator was wrong." It evaluates both the moderator's stance and KOM's stance against the policy — a bilateral diff. This is what detects both sides wrong, policy ambiguous, or signals insufficient.

03.4 · Outputs07 / 12

Every sentence must be traceable

ai_rca_code

Final RCA attribution.

ai_reasoning

Full structured chain — cites N3 phrases, policy spans, case-signal fields.

confidence + cap_reason

Capped when signals are insufficient. E.g. missing policy_update_date → cap 0.6.

evidence chain

Citations to every claim — auditable end-to-end.

"Every reasoning sentence must be traceable to one of: a policy citation, a case-signal citation, an N3 field, or an explicit logical step — otherwise it is rejected."
04 · Demo Showcase08 / 12

See it in action

An interactive walkthrough of the pipeline — case input, node-by-node reasoning, and the final structured artifact.

→ Open Demo

390c6625606b.aime-app.bytedance.net/demo.html

05 · v1.0 Performance09 / 12

LLM beats Human QA on RCA accuracy

Consistency LLM↔QA70.8%
Human QA accuracy60.9%
LLM accuracy67.9%+7.0 pp vs. QA
Avg TL score (1–5)2.43
QA TL agrees with KOM · 50.9%

LLM 69.0% vs QA 62.5%

+6.5 pp · TL score 2.86 — the LLM is at near-launch quality where ground truth is clean.

QA TL disagrees with KOM · 49.1%

LLM 59.4% vs QA 67.0%

−7.6 pp · TL score 1.98 — LLM faithfully tracks KOM even when TL disagrees. Drivers: KOM-leakage Approves, and Youth / Adult Sexualized Behaviors disputes.

05.2 · Reasoning Quality10 / 12

U-shaped — not mediocre, split

TL 1–5 scoring across 456 reviewed cases. The distribution is bimodal: 22.8% fully usable, 50% not usable — and most of the low end traces back to QA TL disagrees with KOM.

5 · Fully usable
104 · 22.8%
4 · Minor miss
42 · 9.2%
3 · Half-right
28 · 6.1%
2 · Partial value
54 · 11.8%
1 · Not usable
228 · 50.0%

Differentiated capability vs. Human QA

RCA TypeQALLMGTRead
policy_understanding88.5%80.1%64.0%Both over-attribute
age_judgement0.0%6.4%4.7%LLM net-new
policy_vague9.7%8.2%2.0%Both over-attribute
slip / speed / SOP0.0%0.0%14.3%Shared blind spot

The blind spot is a training-data gap, not a model defect — codes that humans never used, the model never learned.

06 · Further Planning11 / 12

Where v2.0 goes

Short-term

  • Expand code coverage — absorb Slip / Speed / Technical / SOP. Shrink Others fallback. Tighten vague vs. understanding boundary.
  • Standardize reasoning — shared indicator framework + prompt dictionary. Each code decomposed into its underlying signals.
  • Cluster-level labels — complexity & similarity tags on cases; rolling high-frequency error notebooks on sites & individuals.
  • RCA Report design — four layers: data overview → business attribution → case-feature insights → owner-action.

Mid-term

  • Guidance for fuzzy cases — not "force a code," but analyze the vote distribution: which sub-groups, what reasoning, what split signals.
  • LLM consolidates the RCA framework itself — propose new codes when Others clusters reach mass. Shift from "what kind of error" to "who needs to do what, based on what evidence."
  • Productionize the workflow — case routing logic, QA review interface, operational baseline.
  • Continuous monitoring — drift detection, periodic GT sampling, retraining triggers.

Roadmap

Phase 0 · Scoping

Apr 17 / May 6 ✅

  • RCA scope & LLM logic
  • Market, language, test case design
Phase 1 · Build

May 9 ✅

  • LLM v1.0 ready
Phase 2 · Validation

May 15–22

  • QA Lead judgement ✅
  • Metrics analysis ✅
  • Badcase review · May 22
Phase 3 · Iteration

May 29 → Mid June

  • LLM v2.0 iteration
Phase 4 · Integration

End of June

  • LLM RCA → production workflow
Closing

Start here.
Build with AI.
Move forward.

The repetitive layer is what AI automates. The layer that's left — insight, calibration, direction — is the one that actually moves the system.

12 / 12 VSO · PPS 🏔️