We use mechanistic interpretability findings to detect framing drift in information environments.
A literary translation paradigm establishes that content representations peak at ~50% of network depth. Subspace activation patching demonstrates that a single direction in 8,192-dimensional space captures content-specific causal effect. Content directions are pair-specific and orthogonal, with mean pairwise cosine similarity of +0.041.
Framing similarity peaks at layer 25 (~31% of depth) while content peaks at layer 40 (~50%). A null baseline using different topics in the same registers produces the opposite pattern, confirming the signal is topic-dependent framing rather than register similarity. Validated across pharmaceutical, financial, and insurance domains.
Cross-lingual comparison replicates the three-phase trajectory from Paper I. Initial results across French, Greek, Russian, and Japanese show divergence gaps exceeding +0.36 between engines that capture semantic intent and those that translate literally.
Pinpoints material divergence from approved disclosures, classifies and measures severity against a cross-domain baseline.
Open-weight models. Deployable on local, air-gapped infrastructure.
Pharmaceutical promotional material compliance. Sentence-level drift detection between marketing copy and approved prescribing information. Validated against FDA OPDP enforcement actions. Live scans produce findings including data vintage mismatches and mechanism-of-action reframing.
Insurance compliance analysis. Detects framing drift between marketing materials and policy language across commercial lines including BOP, professional liability, cyber, and workers’ compensation.
Information operation detection via two-layer residual stream extraction. Content-layer topic matching paired with framing-layer comparison identifies narrative amplification networks. Zero false positives across mainstream partisan outlets in live monitoring.