Offline, private AI for document-intensive workflows.

Open-weight models on hardware you own, deployed through scoped pilots that prove out value on your actual documents before broader rollout.

We structure engagements as scoped pilots that prove out value on your actual documents before broader deployment. Most clients begin with a defined workflow, validate results, and expand from there.

How Engagements Work
Stage One

Workflow Discovery

On-site or remote review of your document workflows and existing systems. We identify candidate applications, test feasibility on representative samples of your documents, and produce a written assessment with recommended scope. The engagement is a fixed-fee deliverable. There is no commitment to proceed beyond it.

Stage Two

Focused Pilot

Deployment of one or more specific workflows on dedicated hardware in your office. We define acceptance criteria together, build the system to those criteria, train your team on it, and operate alongside you during a defined evaluation period. If the system meets the acceptance criteria, you keep it. If it does not, we have learned together what needs to change.

Stage Three

Expanded Deployment

For clients whose pilot succeeded and who want to expand the system to additional workflows, integrate it more deeply with existing systems, or deploy to multiple sites. This is where larger investments make sense — including air-gapped configurations, custom model fine-tuning on your historical documents, and multi-user deployments.

Ongoing

Maintenance and Updates

Open-weight models improve continuously. As new model releases offer meaningful capability improvements for your workflows, we update your deployed system. Ongoing engagement is hourly and demand-driven rather than a mandatory recurring subscription.

We deploy on hardware purpose-built for sustained local AI workloads, running open-weight models that have closed the gap with frontier providers. The configuration evolves as the open-weight ecosystem matures.

Hardware

NVIDIA DGX Spark

128GB unified memory, GB10 Blackwell architecture, designed for sustained agentic workloads. Runs 120B-parameter mixture-of-experts models continuously. Sits in an office on standard power. No data center required.

Apple Silicon (Mac mini, Mac Studio)

Lower-footprint deployments for single-workflow installations. M4 Pro and M4 Max configurations support 24–128GB unified memory. Fits in any office.

Models

NVIDIA Nemotron 3 Super

120B parameter mixture-of-experts model with 12B active per token, designed by NVIDIA for agentic and reasoning workloads. Primary model for deployments where doctrinal precision and procedural specificity matter — legal, regulatory, and complex compliance work. Runs natively on DGX Spark with the recommended NVIDIA stack.

Qwen 3.6 (35B / 27B)

Mixture-of-experts models for high-throughput document workflows where speed matters alongside quality. The 35B parameter version with 3B active per token delivers production-quality summarization and extraction at roughly 20GB memory footprint, suitable for Mac mini through DGX Spark deployments.

GPT-OSS 120B, Llama 3.3 70B, Llama 3.2 Vision, Qwen 2.5 VL, Gemma 3

Available for specific tasks where model architecture, license terms, or training characteristics matter. Cross-architectural validation work informs which model fits which workflow.

Frameworks

Hermes Agent (Nous Research)

Agent framework for sustained, multi-step workflows on local hardware. Supports self-refining skills, integration with local files and applications, model-agnostic deployment. NVIDIA-recommended pairing with DGX Spark for agentic workloads.

Ollama, vLLM, llama.cpp

Inference runtimes selected per workload. Ollama for development and lower-throughput deployments. vLLM and llama.cpp for production serving where throughput and stability matter.

Trajectory

One hardware payment. Continuously improving models. No subscription.

The open-weight ecosystem now releases competitive models within months of any frontier announcement. Deployed systems load each new model on the same hardware, with no migration cost and no subscription fee. The capability of a deployed system improves over its lifetime as the ecosystem matures. The hardware purchase is the only fixed investment.

Our consulting work is grounded in original mechanistic interpretability research. The same techniques we publish enable the document classification, extraction, and drift detection we deploy for clients.

Paper I — 2026

Causally Functional Content Representations in Transformer Residual Streams

A literary translation paradigm establishes that content representations peak at ~50% of network depth. Subspace activation patching demonstrates that a single direction in 8,192-dimensional space captures content-specific causal effect. Content directions are pair-specific and orthogonal, with mean pairwise cosine similarity of +0.041.

Paper II — 2026

Depth-Dependent Dissociation of Content and Framing in Transformer Residual Streams

Framing similarity peaks at layer 25 (~31% of depth) while content peaks at layer 40 (~50%). A null baseline using different topics in the same registers produces the opposite pattern, confirming the signal is topic-dependent framing rather than register similarity. Validated across pharmaceutical, financial, and insurance domains.

In Progress

Cross-Architectural Document Classification via Residual Stream Probing

Linear probes on vision-language model residual streams achieve 100% accuracy (Llama 3.2 Vision), 95% (Qwen 2.5 VL), and 84% (Gemma 3) on text-vs-chart classification with leave-one-out validation. Three distinct architectural signatures: early-emergence (Qwen), cross-attention-gated (Llama), gradual-integration (Gemma). Directly applicable to document routing and classification deployments.

Contact

info@21cq2.com

Research

github.com/21CQ2