[VSO-PPS] · 2026

LLM-Driven
RCA Sharing

The point of bringing AI into RCA is not to replace human judgment — it is to free human judgment from work that does not need it. When the repetitive layer is automated, what's left is the layer that actually moves the system: insight, calibration, and direction.

01 / 12 v1.0 Trust & Safety · PPS

01 · Why We Use LLM RCA02 / 12

Three structural ceilings
of today's human-only RCA

Coverage ceiling

High-quality RCA is labor-intensive.

Only a small fraction of moderation errors ever receive a true root-cause attribution; the rest stay as raw error counts with no diagnostic signal.

Consistency ceiling

Attribution quality degrades under load.

Different markets, different QAs, and even the same QA across weeks produce inconsistent codes when case volumes spike.

Insight ceiling

Descriptive, not structured.

Downstream calibration, policy feedback, and training are forced to operate on shallow signals — no evidence chains, no business-level attribution.

The result is a persistent gap — not just between the errors that occur and the errors that receive meaningful diagnosis, but between the diagnoses we produce and the structured insight the business actually needs to act on.

01.2 · Value Proposition03 / 12

A Co-Pilot model

AI takes on the high-volume, repeatable attribution layer at machine speed and consistency. Human QA levels up from first-pass labeler to validator and strategist — applying judgment where it matters most.

A step-change on both axes: efficiency lifts as repetitive attribution is automated end-to-end, and quality lifts on two fronts — every error case can now receive a structured root-cause attribution, and scarce human expertise is concentrated on the decisions that actually move the system.

Lever	What LLM RCA unlocks
Coverage	First-pass attribution on every case
Consistency	Same code book, same template, 24×7, no drift
Sustainability	Rolling error-pattern + gap detection becomes continuous
Structure	Reasoning + evidence + confidence — a queryable data asset
Dirty-data detection	Catches wrong KOM, drifting standards, before contaminating reporting
Business insight	Clustering & trend detection no single QA can do manually

02 · v1.0 Design04 / 12

Scope — deliberately shaped

Broad where the model can already deliver. Focused where clean validation requires it. Test the hypothesis on solid ground before scaling.

RCA Codes

Curated subset

Organized along Priority × Frequency × Model Comprehensibility × Signal Accessibility. "Others" available as free-text fallback.

Markets

Global

LLM has no hard language limits.

Case Types

Non-fuzzy: OMA + OS

Where KOM provides a clear ground truth. Fuzzy cases deferred to later iterations.

Training Set

This year's human RCA

No market / language restrictions. Maximize volume — let the model learn human RCA standards.

Validation: parallel test, QA Lead as Ground Truth

Human QA

Online RCA codes, current process → baseline attribution.

AIQA · LLM

Curated 1.0 codes + structured reasoning → model attribution.

QA Lead · GT

Adjudicates diff cases; verdict drives accuracy scoring and iteration.

03 · How LLM Works05 / 12

Inputs — a tight whitelist

A case enters the pipeline with a tightly controlled signal whitelist — 7 fields in v1.0, plus the structured policy artifacts.

The model receives no free-form context beyond this whitelist — a deliberate constraint that prevents hallucinated references and keeps the reasoning chain auditable.

Case signals (7)video · video_title · transcript
sticker · keyframe · ocr · background_asr_text
Policy artifacts<MODERATOR_POLICY_TEXT>
what the moderator applied
<CORRECT_POLICY_TEXT>
KOM ground truth

03.2 · Reasoning Pipeline06 / 12

Six nodes — five inline, one offline

N1

Conflict Pre-check

Static lookup of OVERKILL / MISCLASSIFICATION / MISS from policy UID. Non-negotiable.

N2

Result Bundling

Packages conflict signal + suspected-code pre-list. Constrains the candidate space.

N3

Case Understanding

LLM reads policy + signals. Bilateral judgment of each side's interpretation.

N4

Hypothesis Generator

Up to 3 RCA-code candidates with working hypothesis. Generate, don't score.

N5

Hypothesis Validator

Independent verification → final code + confidence + cap reason + full chain.

N6 · offline

Evolution Pool

Aggregates failed / low-confidence cases. Drives prompt + data-field iteration.

(a) Content Understanding

The LLM reads multimodal case material — keyframes, OCR, transcript, ASR — and policy texts. It outputs a structured case_diagnosis that distinguishes "the case is genuinely ambiguous" from "the moderator misread a clear policy."

(b) Bilateral Judgment

The model does not jump to "the moderator was wrong." It evaluates both the moderator's stance and KOM's stance against the policy — a bilateral diff. This is what detects both sides wrong, policy ambiguous, or signals insufficient.

03.4 · Outputs07 / 12

Every sentence must be traceable

ai_rca_code

Final RCA attribution.

ai_reasoning

Full structured chain — cites N3 phrases, policy spans, case-signal fields.

confidence + cap_reason

Capped when signals are insufficient. E.g. missing policy_update_date → cap 0.6.

evidence chain

Citations to every claim — auditable end-to-end.

"Every reasoning sentence must be traceable to one of: a policy citation, a case-signal citation, an N3 field, or an explicit logical step — otherwise it is rejected."

04 · Demo Showcase08 / 12

See it in action

An interactive walkthrough of the pipeline — case input, node-by-node reasoning, and the final structured artifact.

→ Open Demo

390c6625606b.aime-app.bytedance.net/demo.html

05 · v1.0 Performance09 / 12

LLM beats Human QA on RCA accuracy

Consistency LLM↔QA70.8%

Human QA accuracy60.9%

LLM accuracy67.9%+7.0 pp vs. QA

Avg TL score (1–5)2.43

QA TL agrees with KOM · 50.9%

LLM 69.0% vs QA 62.5%

+6.5 pp · TL score 2.86 — the LLM is at near-launch quality where ground truth is clean.

QA TL disagrees with KOM · 49.1%

LLM 59.4% vs QA 67.0%

−7.6 pp · TL score 1.98 — LLM faithfully tracks KOM even when TL disagrees. Drivers: KOM-leakage Approves, and Youth / Adult Sexualized Behaviors disputes.

05.2 · Reasoning Quality10 / 12

U-shaped — not mediocre, split

TL 1–5 scoring across 456 reviewed cases. The distribution is bimodal: 22.8% fully usable, 50% not usable — and most of the low end traces back to QA TL disagrees with KOM.

5 · Fully usable

104 · 22.8%

4 · Minor miss

42 · 9.2%

3 · Half-right

28 · 6.1%

2 · Partial value

54 · 11.8%

1 · Not usable

228 · 50.0%

Differentiated capability vs. Human QA

RCA Type	QA	LLM	GT	Read
policy_understanding	88.5%	80.1%	64.0%	Both over-attribute
age_judgement	0.0%	6.4%	4.7%	LLM net-new
policy_vague	9.7%	8.2%	2.0%	Both over-attribute
slip / speed / SOP	0.0%	0.0%	14.3%	Shared blind spot

The blind spot is a training-data gap, not a model defect — codes that humans never used, the model never learned.

06 · Further Planning11 / 12

Where v2.0 goes

Short-term

Expand code coverage — absorb Slip / Speed / Technical / SOP. Shrink Others fallback. Tighten vague vs. understanding boundary.
Standardize reasoning — shared indicator framework + prompt dictionary. Each code decomposed into its underlying signals.
Cluster-level labels — complexity & similarity tags on cases; rolling high-frequency error notebooks on sites & individuals.
RCA Report design — four layers: data overview → business attribution → case-feature insights → owner-action.

Mid-term

Guidance for fuzzy cases — not "force a code," but analyze the vote distribution: which sub-groups, what reasoning, what split signals.
LLM consolidates the RCA framework itself — propose new codes when Others clusters reach mass. Shift from "what kind of error" to "who needs to do what, based on what evidence."
Productionize the workflow — case routing logic, QA review interface, operational baseline.
Continuous monitoring — drift detection, periodic GT sampling, retraining triggers.

Roadmap

Phase 0 · Scoping

Apr 17 / May 6 ✅

RCA scope & LLM logic
Market, language, test case design

Phase 1 · Build

May 9 ✅

LLM v1.0 ready

Phase 2 · Validation

May 15–22

QA Lead judgement ✅
Metrics analysis ✅
Badcase review · May 22

Phase 3 · Iteration

May 29 → Mid June

LLM v2.0 iteration

Phase 4 · Integration

End of June

LLM RCA → production workflow

Closing

Start here.
Build with AI.
Move forward.

The repetitive layer is what AI automates. The layer that's left — insight, calibration, direction — is the one that actually moves the system.

12 / 12 VSO · PPS 🏔️

LLM-DrivenRCA Sharing

Three structural ceilingsof today's human-only RCA