Skip to main content
Back to Case Studies
Building the first open benchmark for federal contracting AI: how we validated AI-assisted FAR clause extraction
Federal Contracting / Government Procurement

Building the first open benchmark for federal contracting AI: how we validated AI-assisted FAR clause extraction

VETR Proposal (Internal R&D)

0%

Claude Sonnet 4.6 hallucination rate on real FAR text

Zero invented clause numbers across 182 predictions

32.1%

Claude Haiku 4.5 hallucination rate on real FAR text

Invented a clause number almost one time in three

13.8% vs. 32.1%

FedProc-180M hallucination rate vs. Haiku

Matched Haiku accuracy at less than half the hallucination rate

~50x faster

Latency improvement over Claude Sonnet

Per-call latency, running locally on consumer GPU

~1000x lower

Cost difference per call

Roughly three orders of magnitude lower per-call cost

4.5 minutes

Training time on consumer GPU

Single RTX 3060, joint training across four tasks

The Challenge

Federal solicitations are dense with FAR and DFARS clause citations. A typical Combined Synopsis / Solicitation will reference anywhere from two to twenty clauses by number, often in abbreviated forms like 52.219-9 or DFARS 252.225-7042. For SDVOSB, WOSB, 8(a), and HUBZone certified contractors responding to those solicitations, every cited clause is a potential compliance requirement that must be honored in the proposal response. Large language models are increasingly used to summarize solicitations, draft compliance matrices, and pre-populate proposal sections. The problem is that these models have a documented tendency to invent FAR clause numbers that look syntactically correct but do not exist. A clause like FAR 52.999-99 can surface in a generated compliance matrix despite having no real-world referent. For a contractor staffing a federal proposal under deadline pressure, catching that error downstream is costly. Missing it entirely is far worse. Before VETR Proposal could responsibly extend its AI-assisted proposal workflow to features that depend on clause citations, we needed three things: a reliable measure of how often frontier models hallucinate FAR clause numbers on real federal text, clarity on whether that failure rate is uniform across vendors or model-specific, and evidence on whether a smaller specialized model could match or beat the frontier on this failure mode at a price point compatible with on-premise deployment. When we went looking for an existing benchmark to answer those questions, we found that one did not exist. Commercial tools in the space publish no reproducible benchmarks. Academic work on RFP processing is narrow and one-off. GSA's own tooling covers only Section 508 compliance. So we built the benchmark ourselves and made it public.

The Solution

We built FedProc-Bench, an open multi-task benchmark covering four federal procurement NLP tasks drawn from real source data: notice type classification across the eight SAM.gov notice-type categories, NAICS sector prediction across the twenty top-level sectors, set-aside identification across SBA, SDVOSB, WOSB, EDWOSB, 8(a), HUBZone, and SDB designations, and FAR/DFARS clause extraction as the headline task. The data comes entirely from public-domain sources: SAM.gov's Opportunities API for real solicitations, the Electronic Code of Federal Regulations API for the full text of Title 48 covering FAR and DFARS, and a small amount of synthetic augmentation to ensure rare set-aside types are represented. Every record carries a source field and a label origin field so provenance is fully auditable. We then trained a single 150-million-parameter ModernBERT-base model with one shared encoder and four task-specific heads, jointly across all benchmark tasks. Training took four and a half minutes on a single consumer RTX 3060 GPU. We evaluated four systems on the held-out test split using identical prompts and canonical scoring: Claude Sonnet 4.6, GPT-4o, Claude Haiku 4.5, and our own FedProc-180M v0. The headline metric for the FAR clause task is hallucination rate — the share of predicted clause numbers that do not appear anywhere in the real FAR plus DFARS corpus. The benchmark dataset, trained model, and full methodology are all released under Apache 2.0 and Creative Commons compatible terms so any contractor, prime, academic group, or government agency can use FedProc-Bench as an independent yardstick.

On the cleanest slice of the test set — real Federal Acquisition Regulation text, no synthetic data — the results showed a wide spread across systems.

  • Claude Sonnet 4.6 achieved an F1 of 0.984 with a hallucination rate of 0.0% across 182 predictions.
  • GPT-4o achieved an F1 of 0.937 with a hallucination rate of 11.0% across 209 predictions.
  • Claude Haiku 4.5 achieved an F1 of 0.804 with a hallucination rate of 32.1% across 274 predictions.
  • FedProc-180M v0, our trained model, achieved an F1 of 0.800 with a hallucination rate of 13.8% across 159 predictions.

Three observations that matter most for federal contractors

Frontier model reliability is not uniform

Claude Sonnet 4.6 produced zero invented clause numbers. GPT-4o invented a clause number more than one time in ten. Claude Haiku 4.5 invented a clause number almost one time in three. The choice of model materially determines whether AI-generated compliance content is trustworthy without human review.

A specialized small model is competitive on the failure mode that matters

Our 150-million-parameter model running locally on a consumer GPU matched Claude Haiku on extraction accuracy at less than half the hallucination rate. For workflows where on-premise execution, predictable latency, and low per-call cost are priorities, this represents a real alternative.

The cost gap is dramatic

FedProc-180M runs roughly fifty times faster per call than Claude Sonnet at approximately three orders of magnitude lower per-call cost. For high-volume FAR compliance scanning across an active opportunity pipeline, that difference is operationally significant.

What v0 is and isn't

We are publishing v0, not the final word. The v0 model was trained on 1,129 records due to SAM.gov API quota limits. Version 0.1, with full solicitation description text and approximately ten times the training data, is expected to substantially close the gap with frontier models on the other three tasks. The v0 test set also contains 65 Claude-generated synthetic records that bias frontier Claude-model scores upward on the clause extraction task; the leaderboard's per-source breakdown surfaces this transparently.

For federal contractors using VETR Proposal: every AI-assisted feature that touches FAR clause citation now ships with a measured hallucination-rate floor that can be cited to your compliance team, specifying exactly which model, which version, and which expected reliability range applies to which feature.

For federal contractors evaluating any AI tool: ask the vendor for their FAR-clause hallucination rate on real federal text and ask which benchmark they used to measure it. If they do not have a number, that is itself a data point.

Open and reproducible

The benchmark dataset is available at huggingface.co/datasets/raihan-js/fedproc-bench and the model at huggingface.co/raihan-js/fedproc-180m-v0.

Ready to put VETR on your next pursuit?

Walk through the platform with our team and see how it fits your capture cycle.

Schedule a free demo