Schemas & Validation

LangExtract uses Pydantic models to define structured schemas for your extracted data. This provides type safety, automatic validation, and clear error messages when extractions don't match your expectations.

Why Use Schemas?

Defining schemas for your extracted data provides several benefits:

  • Type safety: Ensure extracted values match expected types (strings, numbers, dates, etc.)
  • Validation: Automatically validate that required fields are present and values meet constraints
  • IDE support: Get autocomplete and type hints in your editor
  • Documentation: Schemas serve as clear documentation of your data structure
  • Error handling: Get clear, actionable error messages when validation fails

Defining Schemas with Pydantic

LangExtract uses Pydantic models to define schemas. Here's a simple example:

from pydantic import BaseModel, Field
from typing import Optional

class Medication(BaseModel):
    name: str = Field(description="Name of the medication")
    dosage: str = Field(description="Dosage amount and unit")
    frequency: Optional[str] = Field(None, description="How often to take")
    route: Optional[str] = Field(None, description="Route of administration")

result = lx.extract(
    text_or_documents=clinical_text,
    prompt_description="Extract medication information",
    examples=[...],
    schema=Medication
)

Field Types & Constraints

Pydantic supports a wide range of field types and constraints:

  • Basic types: str, int, float, bool, datetime, date
  • Optional fields: Use Optional[T] or T | None for fields that may be missing
  • Lists: List[T] for extracting multiple values
  • Nested models: Define complex nested structures
  • Field constraints: Use Field() for descriptions, defaults, and validation rules
  • Custom validators: Add custom validation logic with Pydantic validators

Nested Schemas

You can define complex nested structures using multiple Pydantic models:

class Address(BaseModel):
    street: str
    city: str
    state: str
    zip_code: str

class Person(BaseModel):
    name: str
    age: int
    address: Address
    phone_numbers: List[str]

result = lx.extract(
    text_or_documents=text,
    prompt_description="Extract person information",
    examples=[...],
    schema=Person
)

Validation & Error Handling

When LangExtract extracts data, it automatically validates against your schema:

  • Type checking: Values are coerced to the correct types when possible
  • Required fields: Missing required fields trigger validation errors
  • Constraint validation: Field constraints (min/max, regex patterns, etc.) are enforced
  • Clear errors: Validation errors include field names and specific issues

Handle validation errors gracefully in your code to provide feedback and improve extraction quality.

Schema Constraints in Prompts

LangExtract can use schema information to guide the LLM's extraction:

  • Field descriptions: Field descriptions from Field() are included in prompts
  • Type hints: Type information helps the model understand expected formats
  • Structure guidance: Nested models guide the model to extract hierarchical data

Note: Schema constraints work best with Gemini models. For OpenAI models, you may need to use fence_output=True and use_schema_constraints=False.

Best Practices

  • Start simple: Begin with basic schemas and add complexity as needed
  • Use descriptions: Always provide Field descriptions to guide extraction
  • Make fields optional: Use Optional for fields that may not always be present
  • Validate early: Check validation errors to improve prompts and schemas
  • Iterate: Refine schemas based on real extraction results