Full-Text Literature Extraction: Romeo and Juliet
This example demonstrates LangExtract's capability to process complete documents directly from URLs, extracting structured information from full-length literary texts. We'll extract character information, relationships, and key events from Shakespeare's Romeo and Juliet (147,843 characters from Project Gutenberg).
Overview
This example showcases LangExtract's ability to handle:
- Long documents: Processing complete works (147,843+ characters)
- URL input: Direct processing from web URLs
- Parallel processing: Efficient handling of large texts
- Sequential extraction: Multiple extraction passes for complex information
- Performance optimization: Techniques for long document processing
Extracting Character Information
Define a schema for character information:
from pydantic import BaseModel, Field
from typing import Optional, List
class Character(BaseModel):
name: str = Field(description="Character's name")
role: Optional[str] = Field(None, description="Character's role or title")
family: Optional[str] = Field(None, description="Family affiliation (e.g., Montague, Capulet)")
description: Optional[str] = Field(None, description="Brief description of the character")
key_quotes: List[str] = Field(default_factory=list, description="Notable quotes from the character")
class CharacterList(BaseModel):
characters: List[Character] = Field(description="List of characters in the play")
Extracting Relationships
Extract relationships between characters:
class Relationship(BaseModel):
character1: str = Field(description="First character in the relationship")
character2: str = Field(description="Second character in the relationship")
relationship_type: str = Field(description="Type of relationship (e.g., 'lovers', 'family', 'enemy')")
description: Optional[str] = Field(None, description="Description of the relationship")
class RelationshipList(BaseModel):
relationships: List[Relationship] = Field(description="List of character relationships")
Extracting Key Events
Extract major plot events:
class Event(BaseModel):
act: Optional[str] = Field(None, description="Act number")
scene: Optional[str] = Field(None, description="Scene number")
event_description: str = Field(description="Description of the event")
characters_involved: List[str] = Field(default_factory=list, description="Characters involved")
significance: Optional[str] = Field(None, description="Significance to the plot")
class EventList(BaseModel):
events: List[Event] = Field(description="List of key plot events")
Processing from URL
LangExtract can process documents directly from URLs:
import langextract as lx
# Process Romeo and Juliet from Project Gutenberg
romeo_juliet_url = "https://www.gutenberg.org/files/1112/1112-0.txt"
# Extract character information
characters_result = lx.extract(
text_or_documents=romeo_juliet_url,
prompt_description="Extract all major characters from Romeo and Juliet, including their names, roles, family affiliations, and notable quotes.",
examples=[...],
schema=CharacterList,
model_id="gemini-2.0-flash-exp"
)
# Extract relationships
relationships_result = lx.extract(
text_or_documents=romeo_juliet_url,
prompt_description="Extract relationships between characters in Romeo and Juliet, including family relationships, romantic relationships, and conflicts.",
examples=[...],
schema=RelationshipList,
model_id="gemini-2.0-flash-exp"
)
# Extract key events
events_result = lx.extract(
text_or_documents=romeo_juliet_url,
prompt_description="Extract major plot events from Romeo and Juliet, organized by act and scene, including characters involved and significance.",
examples=[...],
schema=EventList,
model_id="gemini-2.0-flash-exp"
)
Parallel Processing
For multiple extractions, use parallel processing:
# Process multiple extractions in parallel
results = lx.extract(
text_or_documents=romeo_juliet_url,
prompt_descriptions=[
"Extract character information",
"Extract relationships",
"Extract key events"
],
schemas=[CharacterList, RelationshipList, EventList],
examples=[...],
model_id="gemini-2.0-flash-exp"
)
Handling Long Documents
For very long documents, consider:
- Chunking: Split long documents into manageable chunks
- Sequential passes: Extract different information types in separate passes
- Batch processing: Use Vertex AI Batch API for cost-effective processing
- Incremental extraction: Extract information incrementally and combine results
Performance Optimization
Optimize performance for long documents:
- Use appropriate models: Choose models based on speed vs. accuracy trade-offs
- Parallel extraction: Extract multiple information types in parallel
- Efficient prompts: Design prompts that minimize token usage
- Batch API: Use Vertex AI Batch API for large-scale processing
Results and Analysis
After extraction, you can:
- Analyze characters: Explore character information and relationships
- Visualize relationships: Create relationship graphs from extracted data
- Timeline events: Organize events chronologically
- Verify accuracy: Use source grounding to verify extractions
Best Practices
- Start with summaries: Extract high-level information first, then details
- Use multiple passes: Break complex extractions into multiple focused passes
- Validate results: Check source grounding for critical extractions
- Handle ambiguity: Account for ambiguous or conflicting information
- Monitor costs: Track token usage and costs for long documents