Extract structured, grounded data with LangExtract

LangExtract is a Python library from Google that turns unstructured text into reliable structured data using Gemini and other LLMs – with precise source grounding, schema-aware extraction, and rich visualization.

pip install langextract · Apache-2.0 · Built for Gemini 2.5, Gemma, OpenAI, and local Ollama

One call to extract everything you need

import langextract as lx

result = lx.extract(
    text_or_documents=report_text,
    prompt_description="Extract diagnoses, meds, and dates",
    examples=examples,
    model_id="gemini-2.5-flash",
)
  • Schema-aware extraction with optional Pydantic-style schemas.
  • Grounded spans back to original text for each field.
  • Flexible providers including Gemini, OpenAI, and Ollama.
Explore full examples →

Why LangExtract?

Designed for production-grade information extraction, LangExtract combines powerful LLMs with deterministic post-processing and transparent provenance.

Precise source grounding

Every extracted field can be traced back to the exact character spans in the input text, enabling audits, highlighting, and robust UI overlays.

Learn about grounding →

Schema-first extraction

Define your target structure in natural language, as a JSON schema, or with Pydantic-like models and let LangExtract handle validation and coercion.

Schema & validation docs →

Multi-provider, portable

Use Gemini, Vertex AI, OpenAI, or local models via Ollama. Swap providers without rewriting extraction logic using a unified API and plugin system.

Provider integrations →

Built for real-world domains

LangExtract is already being used in domains like healthcare, finance, legal, and customer support to turn long-form documents into structured records.

  • Clinical notes → diagnoses, medications, and timelines
  • Radiology reports → structured findings and impressions
  • Support tickets → intent, entities, and routing suggestions
  • Legal documents → parties, clauses, and key obligations

Visit the Examples page for detailed walkthroughs, including a full Romeo and Juliet extraction, medication extraction, and radiology report structuring.

Performance and scale

Process long documents efficiently with batching, streaming, and composable extraction passes. Combine summary passes with targeted follow-ups to keep quality high and costs low.

The Benchmarks page summarizes evaluation results and links to the evaluation scripts on GitHub.

Get started in minutes

Install from PyPI, configure your model provider, and start extracting.

1. Install the library

pip install langextract

Or install from source for development and testing. See the Getting started guide for virtual environment and Docker instructions.

2. Configure a provider

Use Gemini (via LangExtract API key or Vertex AI), OpenAI, or a local Ollama model. The Providers page walks through configuration for each backend.

3. Design your extraction

Draft a prompt_description, add a few examples, and optionally define a schema. Explore extraction design patterns and the interactive visualization tools.

Explore the docs, examples, and provider ecosystem

Dive into the documentation, adapt real-world examples, or learn how to extend LangExtract via custom providers.