Benchmarks

LangExtract benchmarks

Understand how LangExtract behaves on real‑world extraction tasks, how grounding is evaluated, and how to reproduce results locally.

What we measure

LangExtract focuses on two core aspects of information extraction: structured correctness (did we extract the right fields?) and grounding accuracy (do those fields map to the right parts of the source text?).

  • Field‑level precision and recall for entities and attributes.
  • Span‑level alignment metrics for grounding.
  • Latency and cost metrics across providers and models.

Benchmarks are implemented as reproducible scripts in the official repository.

Domains covered

Current benchmark suites cover several representative domains:

  • Medication extraction from synthetic clinical notes.
  • Radiology report structuring (findings and impressions).
  • Long‑form literature extraction with multi‑pass pipelines.
  • General entity and relationship extraction from mixed text.

For concrete examples of how these benchmarks map to real code, see the Medication, Radiology, and Romeo and Juliet example pages.

Running benchmarks locally

To run benchmarks yourself, clone the repository, install test dependencies, and run the test or benchmark suite for your preferred provider:

git clone https://github.com/google/langextract.git
cd langextract
pip install -e ".[test]"

pytest tests  # or use tox for the full matrix

Some suites require provider‑specific configuration (for example, OpenAI or Ollama). See provider notes in the repository for details.