Full-Text Literature Extraction: Romeo and Juliet

This example demonstrates LangExtract's capability to process complete documents directly from URLs, extracting structured information from full-length literary texts. We'll extract character information, relationships, and key events from Shakespeare's Romeo and Juliet (147,843 characters from Project Gutenberg).

Overview

This example showcases LangExtract's ability to handle:

  • Long documents: Processing complete works (147,843+ characters)
  • URL input: Direct processing from web URLs
  • Parallel processing: Efficient handling of large texts
  • Sequential extraction: Multiple extraction passes for complex information
  • Performance optimization: Techniques for long document processing

Extracting Character Information

Define a schema for character information:

from pydantic import BaseModel, Field
from typing import Optional, List

class Character(BaseModel):
    name: str = Field(description="Character's name")
    role: Optional[str] = Field(None, description="Character's role or title")
    family: Optional[str] = Field(None, description="Family affiliation (e.g., Montague, Capulet)")
    description: Optional[str] = Field(None, description="Brief description of the character")
    key_quotes: List[str] = Field(default_factory=list, description="Notable quotes from the character")

class CharacterList(BaseModel):
    characters: List[Character] = Field(description="List of characters in the play")

Extracting Relationships

Extract relationships between characters:

class Relationship(BaseModel):
    character1: str = Field(description="First character in the relationship")
    character2: str = Field(description="Second character in the relationship")
    relationship_type: str = Field(description="Type of relationship (e.g., 'lovers', 'family', 'enemy')")
    description: Optional[str] = Field(None, description="Description of the relationship")

class RelationshipList(BaseModel):
    relationships: List[Relationship] = Field(description="List of character relationships")

Extracting Key Events

Extract major plot events:

class Event(BaseModel):
    act: Optional[str] = Field(None, description="Act number")
    scene: Optional[str] = Field(None, description="Scene number")
    event_description: str = Field(description="Description of the event")
    characters_involved: List[str] = Field(default_factory=list, description="Characters involved")
    significance: Optional[str] = Field(None, description="Significance to the plot")

class EventList(BaseModel):
    events: List[Event] = Field(description="List of key plot events")

Processing from URL

LangExtract can process documents directly from URLs:

import langextract as lx

# Process Romeo and Juliet from Project Gutenberg
romeo_juliet_url = "https://www.gutenberg.org/files/1112/1112-0.txt"

# Extract character information
characters_result = lx.extract(
    text_or_documents=romeo_juliet_url,
    prompt_description="Extract all major characters from Romeo and Juliet, including their names, roles, family affiliations, and notable quotes.",
    examples=[...],
    schema=CharacterList,
    model_id="gemini-2.0-flash-exp"
)

# Extract relationships
relationships_result = lx.extract(
    text_or_documents=romeo_juliet_url,
    prompt_description="Extract relationships between characters in Romeo and Juliet, including family relationships, romantic relationships, and conflicts.",
    examples=[...],
    schema=RelationshipList,
    model_id="gemini-2.0-flash-exp"
)

# Extract key events
events_result = lx.extract(
    text_or_documents=romeo_juliet_url,
    prompt_description="Extract major plot events from Romeo and Juliet, organized by act and scene, including characters involved and significance.",
    examples=[...],
    schema=EventList,
    model_id="gemini-2.0-flash-exp"
)

Parallel Processing

For multiple extractions, use parallel processing:

# Process multiple extractions in parallel
results = lx.extract(
    text_or_documents=romeo_juliet_url,
    prompt_descriptions=[
        "Extract character information",
        "Extract relationships",
        "Extract key events"
    ],
    schemas=[CharacterList, RelationshipList, EventList],
    examples=[...],
    model_id="gemini-2.0-flash-exp"
)

Handling Long Documents

For very long documents, consider:

  • Chunking: Split long documents into manageable chunks
  • Sequential passes: Extract different information types in separate passes
  • Batch processing: Use Vertex AI Batch API for cost-effective processing
  • Incremental extraction: Extract information incrementally and combine results

Performance Optimization

Optimize performance for long documents:

  • Use appropriate models: Choose models based on speed vs. accuracy trade-offs
  • Parallel extraction: Extract multiple information types in parallel
  • Efficient prompts: Design prompts that minimize token usage
  • Batch API: Use Vertex AI Batch API for large-scale processing

Results and Analysis

After extraction, you can:

  • Analyze characters: Explore character information and relationships
  • Visualize relationships: Create relationship graphs from extracted data
  • Timeline events: Organize events chronologically
  • Verify accuracy: Use source grounding to verify extractions

Best Practices

  • Start with summaries: Extract high-level information first, then details
  • Use multiple passes: Break complex extractions into multiple focused passes
  • Validate results: Check source grounding for critical extractions
  • Handle ambiguity: Account for ambiguous or conflicting information
  • Monitor costs: Track token usage and costs for long documents