Offline, private AI for document-intensive workflows.

Open-weight models on hardware you own, deployed through scoped pilots that work on your actual documents before broader rollout.

On-premises demonstrations in the New York metropolitan area. Air-gapped configurations available for engagements where data classification requires it.

info@21cq2.com

Consulting Engagements

We structure engagements as scoped pilots that prove out value on your actual documents before broader deployment. Most clients begin with a defined workflow, validate results, and expand from there.

How Engagements Work

Stage One

Workflow Discovery

Remote or on-site (on-site available air-gapped). We review your document workflows, run candidate systems against samples you provide, and deliver a written assessment with recommended scope. Fixed fee. No commitment to proceed.

Stage Two

Focused Pilot

Deployment of one or more specific workflows on dedicated hardware in your office. We define acceptance criteria together, build the system to those criteria, train your team on it, and operate alongside you during a defined evaluation period.

Stage Three

Expanded Deployment

For clients who want to integrate further, expand systems to additional workflows, or deploy to multiple sites.

Ongoing

Maintenance and Updates

Open-weight models improve continuously. As new model releases offer meaningful capability improvements for your workflows, we update your deployed system. Ongoing engagement is hourly and demand-driven rather than a mandatory recurring subscription.

Stack

We deploy on hardware purpose-built for sustained local AI workloads, running open-weight models that have closed the gap with frontier providers. The configuration evolves as the open-weight ecosystem matures.

Hardware

NVIDIA DGX Spark

128GB unified memory, GB10 Blackwell architecture, designed for sustained agentic workloads. Runs 120B-parameter mixture-of-experts models continuously. Stackable. Sits in an office on standard power. No data center required.

Apple Silicon (Mac mini, Mac Studio)

Lower-footprint deployments for single-workflow installations. M4 Pro and M4 Max configurations support 24–128GB unified memory. Stackable. Fit in any office.

Models

NVIDIA Nemotron 3 Super

120B parameter mixture-of-experts model with 12B active per token, designed by NVIDIA for agentic and reasoning workloads. Primary model for deployments where doctrinal precision and procedural specificity matter — legal, regulatory, and complex compliance work. Runs natively on DGX Spark with the recommended NVIDIA stack.

Qwen 3.6 (35B / 27B)

Mixture-of-experts models for high-throughput document workflows where speed matters alongside quality. The 35B parameter version with 3B active per token delivers production-quality summarization and extraction at roughly 20GB memory footprint, suitable for Mac mini through DGX Spark deployments.

GPT-OSS 120B, Llama 3.3 70B, Llama 3.2 Vision, Qwen 2.5 VL, Gemma 3

Available for specific tasks where model architecture, license terms, or training characteristics matter. Cross-architectural validation work informs which model fits which workflow.

Frameworks

Hermes Agent (Nous Research)

Agent framework for sustained, multi-step workflows on local hardware. Supports self-refining skills, integration with local files and applications, model-agnostic deployment. NVIDIA-recommended pairing with DGX Spark for agentic workloads.

Ollama, vLLM, llama.cpp

Inference runtimes selected per workload. Ollama for development and lower-throughput deployments. vLLM and llama.cpp for production serving where throughput and stability matter.

Trajectory

One hardware payment. Continuously improving models. No subscription.

The open-weight ecosystem now releases competitive models within months of any frontier announcement. Deployed systems load each new model on the same hardware, with no migration cost and no subscription fee. The capability of a deployed system improves over its lifetime as the ecosystem matures. The hardware purchase is the only fixed investment.

Research Foundation

Our consulting work is grounded in original mechanistic interpretability research. The same techniques we publish enable the document classification, extraction, and drift detection we deploy for clients.

Paper I — 2026

Causally Functional Content Representations in Transformer Residual Streams

A literary translation paradigm establishes that content representations peak at ~50% of network depth. Subspace activation patching demonstrates that a single direction in 8,192-dimensional space captures content-specific causal effect. Content directions are pair-specific and orthogonal, with mean pairwise cosine similarity of +0.041.

GitHub

Paper II — 2026

Depth-Dependent Dissociation of Content and Framing in Transformer Residual Streams

Framing similarity peaks at layer 25 (~31% of depth) while content peaks at layer 40 (~50%). A null baseline using different topics in the same registers produces the opposite pattern, confirming the signal is topic-dependent framing rather than register similarity. Validated across pharmaceutical, financial, and insurance domains.

GitHub

In Progress

Cross-Architectural Document Classification via Residual Stream Probing

Linear probes on vision-language model residual streams achieve 100% accuracy (Llama 3.2 Vision), 95% (Qwen 2.5 VL), and 84% (Gemma 3) on text-vs-chart classification with leave-one-out validation. Three distinct architectural signatures: early-emergence (Qwen), cross-attention-gated (Llama), gradual-integration (Gemma). Directly applicable to document routing and classification deployments.

Contact

info@21cq2.com

Research

github.com/21CQ2