Schemas & Validation

LangExtract uses Pydantic models to define structured schemas for your extracted data. This provides type safety, automatic validation, and clear error messages when extractions don't match your expectations.

Why Use Schemas?

Defining schemas for your extracted data provides several benefits:

Type safety: Ensure extracted values match expected types (strings, numbers, dates, etc.)
Validation: Automatically validate that required fields are present and values meet constraints
IDE support: Get autocomplete and type hints in your editor
Documentation: Schemas serve as clear documentation of your data structure
Error handling: Get clear, actionable error messages when validation fails

Defining Schemas with Pydantic

LangExtract uses Pydantic models to define schemas. Here's a simple example:

from pydantic import BaseModel, Field
from typing import Optional

class Medication(BaseModel):
    name: str = Field(description="Name of the medication")
    dosage: str = Field(description="Dosage amount and unit")
    frequency: Optional[str] = Field(None, description="How often to take")
    route: Optional[str] = Field(None, description="Route of administration")

result = lx.extract(
    text_or_documents=clinical_text,
    prompt_description="Extract medication information",
    examples=[...],
    schema=Medication
)

Field Types & Constraints

Pydantic supports a wide range of field types and constraints:

Basic types: str, int, float, bool, datetime, date
Optional fields: Use Optional[T] or T | None for fields that may be missing
Lists: List[T] for extracting multiple values
Nested models: Define complex nested structures
Field constraints: Use Field() for descriptions, defaults, and validation rules
Custom validators: Add custom validation logic with Pydantic validators

Nested Schemas

You can define complex nested structures using multiple Pydantic models:

class Address(BaseModel):
    street: str
    city: str
    state: str
    zip_code: str

class Person(BaseModel):
    name: str
    age: int
    address: Address
    phone_numbers: List[str]

result = lx.extract(
    text_or_documents=text,
    prompt_description="Extract person information",
    examples=[...],
    schema=Person
)

Validation & Error Handling

When LangExtract extracts data, it automatically validates against your schema:

Type checking: Values are coerced to the correct types when possible
Required fields: Missing required fields trigger validation errors
Constraint validation: Field constraints (min/max, regex patterns, etc.) are enforced
Clear errors: Validation errors include field names and specific issues

Handle validation errors gracefully in your code to provide feedback and improve extraction quality.

Schema Constraints in Prompts

LangExtract can use schema information to guide the LLM's extraction:

Field descriptions: Field descriptions from Field() are included in prompts
Type hints: Type information helps the model understand expected formats
Structure guidance: Nested models guide the model to extract hierarchical data

Note: Schema constraints work best with Gemini models. For OpenAI models, you may need to use fence_output=True and use_schema_constraints=False.

Best Practices

Start simple: Begin with basic schemas and add complexity as needed
Use descriptions: Always provide Field descriptions to guide extraction
Make fields optional: Use Optional for fields that may not always be present
Validate early: Check validation errors to improve prompts and schemas
Iterate: Refine schemas based on real extraction results