Schemas & Validation
LangExtract uses Pydantic models to define structured schemas for your extracted data. This provides type safety, automatic validation, and clear error messages when extractions don't match your expectations.
Why Use Schemas?
Defining schemas for your extracted data provides several benefits:
- Type safety: Ensure extracted values match expected types (strings, numbers, dates, etc.)
- Validation: Automatically validate that required fields are present and values meet constraints
- IDE support: Get autocomplete and type hints in your editor
- Documentation: Schemas serve as clear documentation of your data structure
- Error handling: Get clear, actionable error messages when validation fails
Defining Schemas with Pydantic
LangExtract uses Pydantic models to define schemas. Here's a simple example:
from pydantic import BaseModel, Field
from typing import Optional
class Medication(BaseModel):
name: str = Field(description="Name of the medication")
dosage: str = Field(description="Dosage amount and unit")
frequency: Optional[str] = Field(None, description="How often to take")
route: Optional[str] = Field(None, description="Route of administration")
result = lx.extract(
text_or_documents=clinical_text,
prompt_description="Extract medication information",
examples=[...],
schema=Medication
)
Field Types & Constraints
Pydantic supports a wide range of field types and constraints:
- Basic types: str, int, float, bool, datetime, date
- Optional fields: Use Optional[T] or T | None for fields that may be missing
- Lists: List[T] for extracting multiple values
- Nested models: Define complex nested structures
- Field constraints: Use Field() for descriptions, defaults, and validation rules
- Custom validators: Add custom validation logic with Pydantic validators
Nested Schemas
You can define complex nested structures using multiple Pydantic models:
class Address(BaseModel):
street: str
city: str
state: str
zip_code: str
class Person(BaseModel):
name: str
age: int
address: Address
phone_numbers: List[str]
result = lx.extract(
text_or_documents=text,
prompt_description="Extract person information",
examples=[...],
schema=Person
)
Validation & Error Handling
When LangExtract extracts data, it automatically validates against your schema:
- Type checking: Values are coerced to the correct types when possible
- Required fields: Missing required fields trigger validation errors
- Constraint validation: Field constraints (min/max, regex patterns, etc.) are enforced
- Clear errors: Validation errors include field names and specific issues
Handle validation errors gracefully in your code to provide feedback and improve extraction quality.
Schema Constraints in Prompts
LangExtract can use schema information to guide the LLM's extraction:
- Field descriptions: Field descriptions from Field() are included in prompts
- Type hints: Type information helps the model understand expected formats
- Structure guidance: Nested models guide the model to extract hierarchical data
Note: Schema constraints work best with Gemini models. For OpenAI models, you may need to use fence_output=True and use_schema_constraints=False.
Best Practices
- Start simple: Begin with basic schemas and add complexity as needed
- Use descriptions: Always provide Field descriptions to guide extraction
- Make fields optional: Use Optional for fields that may not always be present
- Validate early: Check validation errors to improve prompts and schemas
- Iterate: Refine schemas based on real extraction results