2 days ago 2 days ago

Harness-1: The Open Source AI Search Agent That Beats GPT-5.4

A joint research team from the University of Illinois at Urbana-Champaign, UC Berkeley, and the open-source vector database platform [Chroma](https://trychroma.com) just published Harness-1 — a 20-billion parameter open-source search agent that [outperforms GPT-5.4 on information recall benchmarks](

by marketingagent.io 2 days ago2 days ago

6views

A joint research team from the University of Illinois at Urbana-Champaign, UC Berkeley, and the open-source vector database platform Chroma just published Harness-1 — a 20-billion parameter open-source search agent that outperforms GPT-5.4 on information recall benchmarks, according to VentureBeat’s June 8, 2026 reporting. For marketing teams, this is not a research curiosity — it is a signal that high-fidelity AI-powered research and retrieval is decoupling from proprietary API dependencies, and the cost, control, and capability implications are worth acting on right now.

What Happened

On June 8, 2026, researchers from UIUC, UC Berkeley, and Chroma published a paper introducing Harness-1, a 20-billion parameter open-source search agent that sets a new standard in AI-powered information retrieval. The paper — publicly available on arXiv (2606.02373) — introduces what the team calls a “state-externalizing harness” architecture that fundamentally rethinks how search agents manage memory and context during multi-step retrieval sessions.

The research team includes Pengcheng Jiang, Zhiyi Shi, Kelly Hong, Xueqiang Xu, Jiashuo Sun, Jimeng Sun, Hammad Bashir, and Jiawei Han. They built Harness-1 on top of OpenAI’s gpt-oss-20B — an open-source base model OpenAI released earlier in 2026, part of its renewed open-source push — and trained it using reinforcement learning within the harness framework.

The core problem the team solved is architectural. Traditional search agents are trained as policies over continuously growing transcripts: the model must simultaneously decide how to search, remember what it has already found, track which evidence is useful, monitor which research constraints remain open, and verify which claims have actually been confirmed — all within a finite context window. As search sessions grow longer and more complex, this combined cognitive load degrades performance. The model gets overwhelmed managing state while also reasoning about relevance, and accuracy drops.

The Harness-1 team’s answer to this problem is strict separation of concerns. State management is externalized entirely — moved out of the model’s context and into the surrounding environment, which they call the harness. The harness maintains six distinct components throughout every search session:

Candidate pool: the set of documents currently under active consideration
Curated evidence set: importance-tagged documents that have passed the agent’s relevance threshold and been retained
Evidence links: compact connections between related evidence items that preserve relational structure across search steps
Verification records: a persistent log of what claims have been checked and confirmed during the session
Compressed observations: deduplicated notes from prior search steps, preventing the agent from re-searching what it has already covered
Budget-aware context rendering: a real-time view of what can be presented to the model within its remaining context window, ensuring coherent operation even in long sessions

With the harness managing all of this environmental state, the 20B-parameter neural policy is freed to focus exclusively on semantic decisions: which documents to select, what to retain versus discard, when a claim needs additional verification, and when the search has found enough to stop. This cognitive division of labor — harness handles memory, policy handles meaning — is what drives the performance leap.

The arXiv paper reports results benchmarked across eight distinct retrieval domains: web search, financial data, patent documents, and multi-hop question answering. Harness-1 achieves an average curated recall of 0.730, an improvement of +11.4 points over the next-strongest open-source search subagent. The paper characterizes the system as “competitive with larger frontier models,” and VentureBeat reported that performance specifically exceeds GPT-5.4 on recall tasks.

The system also shows “strong transfer performance on held-out benchmarks,” meaning Harness-1 generalizes to retrieval domains not seen during training — a critical property for real-world deployment where no team can anticipate every retrieval scenario in advance.

The code is publicly available at GitHub (pat-jj/harness-1), runs locally via vLLM on CUDA-compatible GPUs, and requires Python 3.11+ to deploy. The paper is released under Creative Commons BY 4.0, explicitly permitting commercial use and derivative works. This is not a gated preview or lab demo — it is deployable today.

Why This Matters

Harness-1 is not a general-purpose language model. It is a purpose-built retrieval agent trained specifically to search, evaluate, and curate information more effectively than the best proprietary systems available. That narrow focus is what makes it potentially transformative for marketing applications.

Marketing teams are fundamentally in the information business. Competitive positioning requires tracking what rivals are doing across dozens of channels. Audience research demands synthesizing signals from forums, reviews, social platforms, and primary data simultaneously. Content strategy is built on identifying what questions prospects are asking and where authoritative answers don’t yet exist. Campaign compliance in regulated verticals requires continuous cross-referencing with regulatory guidance that updates frequently. Account-based marketing at scale requires deep research on hundreds of target accounts at once. All of these workflows reduce, at their core, to one capability: finding the right information in a large, noisy information environment — and knowing you’ve actually found it.

Until now, the only AI systems with the recall depth and multi-hop reasoning capability to handle these tasks at production quality were proprietary frontier models: GPT-5.4, Claude Opus, and similar systems. Using them meant ongoing per-query API costs, latency at scale, rate limits, and — critically — routing your proprietary research queries and competitive intelligence through third-party servers you do not control.

Harness-1 changes the calculus across all four of those dimensions.

Cost: A 20B-parameter model running locally via vLLM on a standard NVIDIA GPU cluster costs a fraction of frontier API rates. For marketing teams running hundreds or thousands of research queries per month — competitive monitoring, content gap analysis, prospect profiling — the savings compound quickly. An agency running 5,000 structured research queries per month at GPT-5.4 API pricing versus the amortized infrastructure cost of local inference faces a cost differential that can reach tens of thousands of dollars annually. At scale, this becomes a structural expense difference, not a marginal optimization.

Latency: Local inference eliminates round-trip API latency entirely. The Harness-1 architecture is explicitly designed for multi-step retrieval — chains of search operations where one finding informs the next query. At 1-2 seconds of API latency per step, a 10-step retrieval chain takes 10-20 seconds and is subject to third-party service availability. Local inference with appropriate hardware compresses that significantly and removes external dependency from your production pipeline.

Data privacy: Agencies and brand marketing teams regularly search for sensitive material — unpublished product positioning, competitive pricing intelligence, proprietary research findings, customer complaint patterns. Routing those queries through a third-party API is a legitimate compliance and confidentiality exposure. Running Harness-1 locally means queries and retrieved documents never leave your infrastructure. For enterprises in healthcare, financial services, or legal verticals, this is not a preference — it is a hard requirement.

Recall quality: The specific capability Harness-1 excels at — curated recall, meaning correctly identifying which documents among many candidates are genuinely most relevant — is precisely what marketing research demands. You do not want a system that returns 50 marginally relevant results. You want the eight most critical ones, correctly tagged and linked to supporting evidence. A 0.730 average curated recall across hard, multi-domain benchmarks, per the arXiv paper, means the system performs at frontier-competitive quality on the retrieval task that most directly determines the quality of marketing intelligence outputs.

The question for practitioners has shifted. It is no longer whether open-source retrieval agents can perform at frontier level on this task — Harness-1 demonstrates they can. The question is now: how quickly do you integrate this capability into your workflows before your competitors do?

The Data

Understanding Harness-1’s performance requires looking at the benchmarks carefully. The arXiv paper evaluates the system using four metrics: curated recall (did the agent find the right documents?), answer-tied recall (did those documents support the correct answer?), trajectory recall (were the right documents found at any point during the search session?), and precision (how much irrelevant noise was included in the curated set?). The primary headline metric is 0.730 average curated recall across eight distinct retrieval domains — consistent performance across a diverse benchmark set, not optimization for a narrow task.

The +11.4-point improvement over the next-best open-source search agent at this recall level is not an incremental gain. In retrieval benchmarking terms, an 11-point improvement in recall represents a category-level jump — the difference between a system that finds the right documents roughly 62% of the time versus one that finds them 73% of the time, on hard multi-domain tasks where the “right” documents are non-obvious and multi-hop reasoning is required to determine relevance.

The following table compares Harness-1 against the landscape of retrieval approaches based on the arXiv paper and VentureBeat’s reporting:

System	Parameters	Open Source	Avg. Curated Recall	Architecture	Deployment Model
Harness-1	20B	✅ Yes	0.730	RL + state-externalizing harness	Local (vLLM / CUDA)
Prior best open-source agent	~20B	✅ Yes	~0.616 (est.)	Standard RL on transcripts	Local / API
GPT-5.4	Undisclosed	❌ No	Competitive tier*	RLHF / RLAIF	API only
Standard RAG (BM25 + reranker)	N/A	✅ Yes	~0.45–0.55 (est.)	Retrieval + rerank, no RL	Varies
Naive keyword search	N/A	✅ Yes	~0.25–0.35 (est.)	Lexical matching only	Varies

GPT-5.4 exact recall scores are not publicly disclosed. “Competitive tier” per VentureBeat. Prior best open-source agent estimate is derived from the stated +11.4-point gap reported in the arXiv paper. Standard RAG and keyword search figures are industry context estimates, not from the Harness-1 paper.

The eight benchmark domains are directly relevant to marketing practitioners because they map onto real marketing research tasks: web search covers brand monitoring and trend research; financial data covers competitive intelligence on public companies; patents cover product innovation tracking; multi-hop question answering represents the chained reasoning tasks that make up the hardest parts of deep research workflows, where you need to find X, use X to find Y, and use Y to answer the original question.

The Chroma Infrastructure Layer

The involvement of Chroma as a research collaborator is architecturally significant for production deployment. Chroma is an open-source, Apache 2.0 licensed search infrastructure platform that supports vector, full-text, BM25, SPLADE, regex, and metadata search in a unified system. At published benchmark performance of 20ms p50 query latency for 100,000 vectors at 384 dimensions, and 90-100% recall rates at scale, it provides production-grade retrieval performance as the indexing layer that Harness-1 operates against.

With 27,000+ GitHub stars, 15 million monthly downloads, and integration into 90,000+ open-source codebases, Chroma is already deployed across a significant share of AI-native applications. It supports up to 5 million records per collection and stores data on object storage at approximately $0.02/GB/month — economics that make large-scale marketing intelligence indexing financially practical. Having Chroma as a research partner means Harness-1 is designed from the outset to plug into real production retrieval infrastructure, not a custom research backend.

Real-World Use Cases

Here are five concrete marketing applications you can begin building against the published Harness-1 codebase today.

Use Case 1: Automated Competitive Intelligence Monitoring

Scenario: A mid-size SaaS marketing team tracks eight competitors across product announcements, pricing changes, messaging updates, and customer review sentiment. Currently, this research consumes 6-8 hours of analyst time per week and produces inconsistent coverage — some competitors get reviewed thoroughly, others get skimmed, and nothing is documented.

Implementation: Deploy Harness-1 locally on a GPU server and configure a weekly retrieval job running structured multi-hop queries per competitor: “What new features has [competitor] announced in the last 7 days?”, “What are customers on G2 and Reddit saying about [competitor]’s onboarding in the last 30 days?”, “Have their enterprise pricing or packaging pages changed?” The harness’s evidence-linking and verification records components are specifically engineered for this kind of multi-step, multi-source session — each search step builds on prior findings without losing context as the session extends. Export curated evidence sets as structured reports with source links intact.

Expected Outcome: 6-8 hours of weekly research reduced to 20-30 minutes of human review and synthesis. Every competitor gets checked against every source every cycle, eliminating the inconsistency of analyst-driven coverage. The 0.730 curated recall means roughly 73% of meaningful competitor changes are surfaced, compared to the unpredictable coverage of manual search. A documented evidence trail shows exactly what was searched, when, and what was found — which also serves as defensible competitive intelligence records for legal and strategic purposes.

Use Case 2: Content Gap Analysis at Scale

Scenario: A B2B content agency manages content strategy for 20 clients across different verticals. Identifying what questions prospects in each vertical are asking, what competitors have already answered, and where authoritative content gaps exist is currently done manually and inconsistently — done thoroughly for some clients, not at all for others.

Implementation: Build a retrieval pipeline using Harness-1 on top of Chroma. For each client vertical, index competitor content archives, relevant community forums, customer support transcripts, and sales call summaries into dedicated Chroma collections. Chroma’s multi-modal search — vector for semantic similarity, BM25 for keyword precision, metadata filtering for recency and source type — lets you query these collections with nuance that simple keyword search cannot match. Run Harness-1 retrieval jobs using queries derived from actual customer question patterns: “What questions about [topic] appear frequently in [vertical] forums that published content doesn’t adequately answer?”, “What does [competitor] cover on [topic] that our client has no content for?”

Expected Outcome: Systematic, evidence-backed content calendars for all 20 clients generated in hours rather than analyst-weeks. Prioritization is driven by actual retrieved evidence of gaps, not editorial intuition. Because Chroma supports up to 5 million records per collection, this architecture scales to large-enterprise content programs without requiring infrastructure changes.

Use Case 3: Prospect Research Automation for ABM

Scenario: An account-based marketing team targeting 500 enterprise accounts needs substantive research on each before campaign launch: recent news, declared strategic priorities, technology stack signals, active buying indicators, and decision-maker context. At 30 minutes of manual research per account, that’s 250 hours of work before the campaign goes live — consistently infeasible.

Implementation: Deploy Harness-1 as an automated account research policy. For each target account, configure chained retrieval queries: pull recent press releases and news mentions to surface active strategic priorities; cross-reference earnings calls or public filings for technology investment declarations; search job postings for technology stack signals — a company aggressively posting for Kubernetes and Terraform engineers is signaling infrastructure transformation that maps to specific ICP purchase triggers; retrieve available content from LinkedIn profiles of key decision-makers to understand current focus areas. The harness’s evidence-linking component connects findings across sources, enabling synthesis of a coherent account narrative rather than a list of disconnected facts.

Expected Outcome: 500 substantive, sourced account briefs generated overnight rather than over multiple weeks. Sales and marketing alignment improves because every brief follows the same structure with the same level of rigor. Fewer missed buying signals — purpose-built retrieval at 0.730 curated recall significantly outperforms the ad hoc searching that typically characterizes ABM research. The pipeline can rerun continuously, refreshing accounts as new signals emerge.

Use Case 4: Brand Mention and Sentiment Intelligence Across the Open Web

Scenario: A D2C e-commerce brand needs to monitor brand mentions across the open web — not just the platforms that enterprise social listening tools prioritize, but niche forums, subreddits, Quora threads, product review communities, and specialized vertical communities that either cost significant incremental fees or fall outside the coverage of major tools entirely.

Implementation: Use Chroma’s Cloud Sync capabilities — the platform’s automated data ingestion feature that pulls from web sources — to maintain a rolling indexed collection of brand-relevant communities and review platforms. Run daily Harness-1 retrieval jobs querying for brand names, product names, competitive comparisons, and category-level sentiment across the indexed sources. The agent’s verification records component prevents double-counting the same mentions across sessions, and the deduplication in compressed observations keeps the daily brief clean rather than repetitive.

Expected Outcome: Brand intelligence that covers the long tail of the web, not just the platforms enterprise tools prioritize. Faster crisis detection — a developing negative thread in a niche community surfaces before it reaches mainstream media coverage. Richer voice-of-customer data for product marketing: real customer language from real communities, available for positioning and messaging development without third-party panel costs.

Use Case 5: Regulatory and Compliance Research for Regulated Verticals

Scenario: A healthcare or financial services marketing agency must verify all campaign claims are compliant with current FDA advertising guidelines, FINRA rules, or FTC disclosure requirements before campaigns go live. Regulations update frequently, reviewers have finite memory of the full regulatory landscape, and manual compliance review is slow enough to delay campaign launches.

Implementation: Index the full corpus of current regulatory documents — FDA guidance letters, FTC enforcement actions and rules, FINRA notices, relevant precedent cases — into a Chroma collection with scheduled automated updates whenever regulatory sources publish new content. Deploy Harness-1 as the compliance research policy against draft campaign copy: “Does this efficacy claim align with current FDA guidance for [drug class]?”, “What FTC disclosure language is required for this type of performance claim?”, “Has there been a recent enforcement action relevant to claims like this one?” The harness’s evidence-linking component traces how each compliance determination connects to specific regulatory source documents, producing citation chains that can be stored as compliance records.

Expected Outcome: Compliance review time reduced significantly with more complete regulatory coverage — the agent retrieves from the full regulatory document library rather than relying on reviewer memory. Fully sourced compliance traces provide the documentation trails that regulated verticals require for internal audit purposes. Earlier flagging of compliance issues during creative development reduces last-minute campaign revisions and launch delays.

The Bigger Picture

The release of Harness-1 arrives at a specific inflection point in the AI infrastructure stack’s maturation. For the past two years, the marketing technology industry has operated on an implicit assumption: retrieval-grade AI capability — the kind that searches, evaluates evidence, and synthesizes findings with frontier-model accuracy — requires frontier proprietary models. That assumption is now under direct challenge.

The open-source AI ecosystem has been assembling the prerequisites for exactly this moment. Chroma — with 15 million monthly downloads and multi-modal search covering vector, full-text, and metadata — provides production-grade retrieval infrastructure. OpenAI’s gpt-oss-20B provides a capable open base model. Reinforcement learning techniques, once the exclusive province of AI labs with massive compute budgets, have become accessible to academic research teams. And vLLM has made high-throughput local inference practical on commodity GPU hardware.

Harness-1 is the product of those stacked capabilities — a research team that assembled existing open components with a novel architectural insight (state externalization) and produced a system that performs at frontier-competitive levels on the specific task of information retrieval. The research provenance matters: UIUC and UC Berkeley are among the most productive AI research institutions in the world. Their output tends to become production infrastructure quickly. Meta’s LLaMA series, which reshaped the open-source LLM landscape starting in 2023, followed the same pattern: rigorous academic work that became the foundation for production systems across the industry within 12-18 months of initial release.

The commodity pattern for AI capabilities is consistent and well-documented. RAG was a research technique in 2020 and standard production practice by 2023. Fine-tuning frontier models was a proprietary capability in 2021 and a commodity skill by 2024. Harness-1 demonstrates in 2026 that open-source retrieval agents can achieve frontier-competitive recall — which puts widespread production deployment of specialized retrieval agents squarely on the 2026-2027 horizon, not some distant future.

For marketing technologists, the downstream implication is a structural shift in the build-versus-buy decision for AI-powered research and retrieval workflows. The argument for paying frontier API rates for pure retrieval tasks — “you need frontier-level capability for this” — weakens materially when a 20B open-source model achieves competitive recall scores on the same benchmarks. Teams that recognize this shift early and build the supporting infrastructure now — indexed knowledge bases, local inference pipelines, domain-specific fine-tuning workflows — will have a durable head start when this capability becomes table stakes across the industry.

The CC BY 4.0 license is also strategically meaningful: it permits commercial use and derivative works without restriction. Third-party providers will build hosted Harness-1 services and domain-specific fine-tuned variants. The commoditization timeline will accelerate because commercial incentives will amplify the research team’s foundational work.

What Smart Marketers Should Do Now

Run a proof of concept on your highest-volume research workflow this week. The code is live at GitHub (pat-jj/harness-1). Python 3.11+ and vLLM are the primary requirements, with CUDA-compatible GPU hardware for inference. Pick the research task your team executes most frequently — competitor monitoring, content research, prospect profiling — and run a structured test comparing Harness-1’s curated recall against your current pipeline, whether that’s a frontier API or a standard RAG setup. The goal is not production deployment at this stage; it is establishing an empirical baseline for your specific use case so you can make a data-informed infrastructure decision rather than a speculative one. The benchmark to compare against: 0.730 average curated recall, per the arXiv paper.
Audit your AI retrieval API spend before evaluating alternatives. The cost displacement case for Harness-1 depends on knowing what you are currently spending on retrieval-class AI tasks. Break down your frontier model API costs by workflow type: what percentage is pure retrieval and information extraction, versus tasks that require deep reasoning, nuanced judgment, or creative generation? Retrieval is the use case category where open-source alternatives have the most compelling cost and performance case today. If retrieval-class tasks represent 40-60% of your AI API spend — which is common for research-heavy marketing teams — the potential savings from a local deployment are significant enough to justify dedicated infrastructure investment evaluation.
Build Chroma-based indexing infrastructure now, regardless of which search agent you ultimately deploy. Chroma is free, open-source, production-proven at scale, and the natural complement to Harness-1. Even before you have a search agent in production, building structured Chroma collections from your marketing data assets — competitive intelligence archives, customer feedback corpora, content libraries, regulatory document collections — creates the retrieval infrastructure that any search agent can operate against. This is foundational infrastructure investment that compounds over time: every retrieval system you run, current or future, becomes more powerful as the indexed knowledge base grows.
Establish benchmark thresholds for evaluating open-source retrieval agents as they emerge. Harness-1 establishes a concrete reference point: 0.730 average curated recall across eight diverse domains, +11.4 points above the prior open-source best. The BrowseComp+ evaluation benchmark referenced in the Harness-1 repository is publicly available. Set an internal policy: any open-source retrieval agent achieving above 0.70 curated recall across diverse retrieval benchmarks warrants a formal production pilot evaluation. This gives your team a systematic framework for assessing the succession of retrieval agents that will follow Harness-1, rather than evaluating each one ad hoc based on marketing claims.
Build a proprietary data moat strategy while retrieval capability commoditizes. This is the strategic implication most teams will miss. When retrieval capability becomes widely available as open-source infrastructure — which Harness-1 accelerates — the competitive advantage shifts entirely to the quality and exclusivity of the data you index. A Harness-1 deployment running against your proprietary first-party data — years of customer conversation records, win/loss analysis, internal research archives, competitive intelligence — is categorically more valuable than the same model running against public web data. The model becomes table stakes; the data becomes the moat. Use the current window — while open-source retrieval is still in early adoption — to define what proprietary data your organization has or could systematically collect, and build the indexing infrastructure to make it queryable. Organizations doing this now will have a durable advantage two years from now, when capable open-source retrieval agents are standard equipment and model capability itself is fully undifferentiated.

What to Watch Next

The Harness-1 GitHub repository (pat-jj/harness-1) will be the fastest-moving source of updates over the next 90 days. Watch for domain-specific fine-tuning recipes, community-contributed checkpoints optimized for specific retrieval domains — financial data, healthcare, e-commerce — and integration tooling for common marketing data sources. The CC BY 4.0 license actively invites commercial derivative work, which means the community velocity on this repository will likely accelerate quickly once initial adoption hits.

Third-party hosted Harness-1 services will emerge within Q3 2026. The combination of a permissive license, strong benchmark performance, and practical deployment requirements creates clear commercial opportunity for managed inference providers. Watch for Harness-1 API offerings that give teams frontier-competitive retrieval performance without the GPU infrastructure management overhead. For marketing teams that cannot or will not manage local GPU infrastructure, these hosted options will be the practical on-ramp.

Frontier provider responses will be instructive signals about competitive dynamics. When open-source systems demonstrate competitive performance on specific benchmark categories, proprietary providers face pricing pressure on those workloads. Watch for OpenAI, Anthropic, and Google to either reduce API pricing on search-oriented retrieval endpoints or introduce retrieval-specialized model tiers designed to compete with open-source alternatives on cost-per-query economics over H2 2026.

Chroma’s product roadmap is worth monitoring closely as the infrastructure complement to Harness-1. Recent additions — Cloud Sync for automated data ingestion, collection forking with copy-on-write semantics, sparse vector search — suggest a product trajectory toward enterprise marketing intelligence use cases. Watch for deeper integrations with CRM, CDP, and marketing data platforms that would expand the addressable use cases for a Chroma + Harness-1 deployment beyond raw web data retrieval.

Follow-on research from the UIUC/Berkeley team will extend and refine the harness architecture. The paper establishes a strong foundation and notes strong generalization on held-out domains. Expect subsequent work on domain-specific training recipes, larger-scale harness deployments built on gpt-oss-120B, and architectural extensions supporting real-time streaming data ingestion — a critical capability for marketing intelligence applications that require continuous monitoring rather than periodic batch retrieval.

Multi-agent marketing stacks incorporating specialized retrieval agents like Harness-1 represent the near-term architecture to watch at the system level. Rather than routing all marketing AI tasks through a single frontier model, the pattern emerging is specialized agents per function: retrieval (Harness-1), reasoning (larger frontier model), generation (fine-tuned creative model), evaluation (domain-specific judge model). The Harness-1 paper is early evidence that specialized, modular architectures produce measurably better outcomes than one-model-fits-all approaches on specific task categories.

Bottom Line

Harness-1 is a 20-billion parameter open-source search agent from UIUC, UC Berkeley, and Chroma that achieves 0.730 average curated recall across eight retrieval benchmarks — 11.4 points above the prior open-source best and competitive with proprietary frontier models including GPT-5.4, per the arXiv paper and VentureBeat’s reporting. Its state-externalizing harness architecture — separating environmental memory management from semantic reasoning — is a genuine architectural innovation that produces measurably better multi-step retrieval performance than prior approaches. For marketing teams, the implications are immediate and practical: competitive intelligence automation, content gap analysis at scale, ABM prospect research, brand monitoring, and compliance research are all deployable use cases today using public code running on local infrastructure. The code is live, the benchmarks are verifiable, and the supporting infrastructure is mature. Teams that run proofs of concept now and build proprietary indexed knowledge bases in parallel will have a meaningful competitive advantage when capable open-source retrieval agents become standard equipment across the industry.