Open-weight models on hardware you own, deployed through scoped pilots that prove out value on your actual documents before broader rollout.
On-premises demonstrations in the New York metropolitan area. Air-gapped configurations available for engagements where data classification requires it.
We structure engagements as scoped pilots that prove out value on your actual documents before broader deployment. Most clients begin with a defined workflow, validate results, and expand from there.
Remote or on-site (on-site available air-gapped). We review your document workflows, run candidate systems against samples you provide, and deliver a written assessment with recommended scope. Fixed fee. No commitment to proceed.
Deployment of one or more specific workflows on dedicated hardware in your office. We define acceptance criteria together, build the system to those criteria, train your team on it, and operate alongside you during a defined evaluation period. If the system meets the acceptance criteria, you keep it. If it does not, we have learned together what needs to change.
For clients whose pilot succeeded and who want to expand the system to additional workflows, integrate it more deeply with existing systems, or deploy to multiple sites. This is where larger investments make sense — including air-gapped configurations, custom model fine-tuning on your historical documents, and multi-user deployments.
Open-weight models improve continuously. As new model releases offer meaningful capability improvements for your workflows, we update your deployed system. Ongoing engagement is hourly and demand-driven rather than a mandatory recurring subscription.
We deploy on hardware purpose-built for sustained local AI workloads, running open-weight models that have closed the gap with frontier providers. The configuration evolves as the open-weight ecosystem matures.
128GB unified memory, GB10 Blackwell architecture, designed for sustained agentic workloads. Runs 120B-parameter mixture-of-experts models continuously. Sits in an office on standard power. No data center required.
Lower-footprint deployments for single-workflow installations. M4 Pro and M4 Max configurations support 24–128GB unified memory. Fits in any office.
120B parameter mixture-of-experts model with 12B active per token, designed by NVIDIA for agentic and reasoning workloads. Primary model for deployments where doctrinal precision and procedural specificity matter — legal, regulatory, and complex compliance work. Runs natively on DGX Spark with the recommended NVIDIA stack.
Mixture-of-experts models for high-throughput document workflows where speed matters alongside quality. The 35B parameter version with 3B active per token delivers production-quality summarization and extraction at roughly 20GB memory footprint, suitable for Mac mini through DGX Spark deployments.
Available for specific tasks where model architecture, license terms, or training characteristics matter. Cross-architectural validation work informs which model fits which workflow.
Agent framework for sustained, multi-step workflows on local hardware. Supports self-refining skills, integration with local files and applications, model-agnostic deployment. NVIDIA-recommended pairing with DGX Spark for agentic workloads.
Inference runtimes selected per workload. Ollama for development and lower-throughput deployments. vLLM and llama.cpp for production serving where throughput and stability matter.
The open-weight ecosystem now releases competitive models within months of any frontier announcement. Deployed systems load each new model on the same hardware, with no migration cost and no subscription fee. The capability of a deployed system improves over its lifetime as the ecosystem matures. The hardware purchase is the only fixed investment.
Our consulting work is grounded in original mechanistic interpretability research. The same techniques we publish enable the document classification, extraction, and drift detection we deploy for clients.
A literary translation paradigm establishes that content representations peak at ~50% of network depth. Subspace activation patching demonstrates that a single direction in 8,192-dimensional space captures content-specific causal effect. Content directions are pair-specific and orthogonal, with mean pairwise cosine similarity of +0.041.
Framing similarity peaks at layer 25 (~31% of depth) while content peaks at layer 40 (~50%). A null baseline using different topics in the same registers produces the opposite pattern, confirming the signal is topic-dependent framing rather than register similarity. Validated across pharmaceutical, financial, and insurance domains.
Linear probes on vision-language model residual streams achieve 100% accuracy (Llama 3.2 Vision), 95% (Qwen 2.5 VL), and 84% (Gemma 3) on text-vs-chart classification with leave-one-out validation. Three distinct architectural signatures: early-emergence (Qwen), cross-attention-gated (Llama), gradual-integration (Gemma). Directly applicable to document routing and classification deployments.