We use mechanistic interpretability findings to detect framing drift in regulatory, financial, and information environments.
A literary translation paradigm establishes that content representations peak at ~50% of network depth. Subspace activation patching demonstrates that a single direction in 8,192-dimensional space captures content-specific causal effect. Content directions are pair-specific and orthogonal, with mean pairwise cosine similarity of +0.041.
Framing similarity peaks at layer 25 (~31% of depth) while content peaks at layer 40 (~50%). A null baseline using different topics in the same registers produces the opposite pattern, confirming the signal is topic-dependent framing rather than register similarity. Validated across pharmaceutical, financial, and insurance domains.
Cross-lingual comparison replicates the three-phase trajectory from Paper I. Initial results across French, Greek, Russian, and Japanese show divergence gaps exceeding +0.36 between engines that capture semantic intent and those that translate literally.
Compares public language against filed disclosures at the sentence level. Identifies which specific sentences drive framing divergence and classifies the drift type: insertion, reframing, or substitution.
Deployable on air-gapped infrastructure. The open-weight model downloads once; after that, proprietary documents never leave the room.
Pharmaceutical promotional material compliance. Sentence-level drift detection between marketing copy and approved prescribing information. Validated against 20 FDA OPDP enforcement actions with 100% sensitivity and specificity. Live scans produce findings including data vintage mismatches and mechanism-of-action reframing.
Insurance compliance analysis. Detects framing drift between marketing materials and policy language across commercial lines including BOP, professional liability, cyber, and workers’ compensation. Validated on 24 cases with 100% sensitivity and specificity.
Information operation detection via two-layer residual stream extraction. Content-layer topic matching paired with framing-layer comparison identifies narrative amplification networks. Zero false positives across mainstream partisan outlets in live monitoring.