LangExtract Docs – Concepts, API, and workflows

On this page

Core concepts
Basic extraction API
Source grounding
Schemas & validation
Model providers
Interactive visualization

Core concepts

LangExtract is a Python library built to help you extract structured, grounded information from unstructured text using large language models (LLMs). It wraps LLM calls with higher-level building blocks that focus on:

Prompted extraction: describe what to extract in natural language.
Optional schemas: specify the target structure for more control.
Grounding metadata: track where each field came from.
Provider abstraction: swap between Gemini, OpenAI, and local models.
Visualization: inspect results and their provenance interactively.

Under the hood, LangExtract is designed for quality and robustness, with evaluation scripts, test suites, and examples that demonstrate its performance on real-world tasks.

Basic extraction API

The primary entry point is the extract function. You provide raw text or documents, a high-level description of what you want, and optional examples or schemas:

import langextract as lx

report = open("discharge_summary.txt").read()

result = lx.extract(
    text_or_documents=report,
    prompt_description="Extract diagnoses, medications, and follow-up recommendations.",
    examples=[
        # Optional few-shot examples
    ],
    model_id="gemini-2.5-flash",
)

The object returned from extract contains both the structured data and detailed metadata about model calls and grounding. See Examples for end‑to‑end scripts, and the Getting started page for installation and environment configuration.

Source grounding

A key feature of LangExtract is source grounding: the ability to link every extracted field back to spans in the original input. This makes your extraction pipelines more:

Auditable: reviewers can see exactly where each piece of data came from.
Trustworthy: applications can highlight or justify model outputs to end users.
Composable: later passes can reference grounded spans for follow‑up processing.

The Grounding subpage explains the span data structures, offset formats, and how to overlay highlights in your own UI using the visualization components described on the Visualization page.

Schemas & validation

LangExtract lets you express your target structure in several ways, including natural language instructions, JSON-like examples, or explicit schema classes.

Loose extraction – just specify a prompt_description and let the LLM propose a structure.
Guided extraction – provide examples of desired JSON output to steer the model.
Schema-constrained extraction – define a schema class and let LangExtract enforce types and shapes (with provider‑specific support).

Visit the Schemas & validation page to learn how to define fields, nested objects, lists, and enums, and how different providers handle schema enforcement.

Model providers

LangExtract includes a provider abstraction layer, so your extraction code can stay the same while you switch between:

Gemini models (such as Gemini 2.5 Flash) via the LangExtract API key.
Vertex AI for enterprise‑grade deployment using service accounts.
OpenAI models like gpt-4o with optional fenced output modes.
Local models via Ollama, running Gemma and other compatible models on your own hardware.

See the dedicated Providers section for configuration examples, environment variables, and instructions on writing your own custom provider plugin.

Interactive visualization

LangExtract ships with visualization components that let you explore extraction outputs, inspect grounding spans, and compare different runs. These tools are helpful when:

Tuning prompts and schemas for new domains.
Debugging unexpected fields or missing data.
Presenting results to product and domain experts.

Learn more in the Visualization guide, and see how visual tooling is used in the Radiology report example and Medication extraction example.

LangExtract documentation