Grounding & Source Tracking

LangExtract provides precise source grounding for every extracted value, showing exactly where in the original text each piece of information came from. This enables provenance tracking, verification, and explainable AI.

What is Grounding?

Grounding refers to the ability to trace extracted information back to its source location in the original text. For every value extracted by LangExtract, you receive:

Character offsets: Start and end positions in the source text
Text spans: The exact substring that was extracted
Source references: Links back to the original document or text segment

Span Data Structures

LangExtract represents grounding information using span objects that contain:

start: Character offset where the span begins (0-indexed)
end: Character offset where the span ends (exclusive)
text: The actual text content of the span
source: Reference to the source document or text segment

These spans are automatically attached to every extracted field value, allowing you to programmatically access the source location.

Why Grounding Matters

Source grounding provides several critical benefits:

Verification: Review extracted values against the original text to ensure accuracy
Debugging: Understand why the model extracted certain values by examining source context
Compliance: Meet requirements for explainable AI and auditability in regulated industries
User trust: Show users exactly where information came from, building confidence in results
Error correction: Identify and fix extraction errors by examining source spans

Using Grounding in Your Code

When you extract data with LangExtract, grounding information is automatically included. You can access spans through the extraction result object:

import langextract as lx

result = lx.extract(
    text_or_documents="Patient was prescribed 500mg of aspirin.",
    prompt_description="Extract medication information",
    examples=[...]
)

# Access extracted values with their spans
for field_name, value in result.items():
    if hasattr(value, 'span'):
        print(f"{field_name}: {value}")
        print(f"  Source: {value.span.text}")
        print(f"  Position: {value.span.start}-{value.span.end}")

Visualization

LangExtract's visualization tools automatically highlight grounded spans in the source text, making it easy to see where each extracted value came from. This is especially useful for:

Reviewing extraction quality
Debugging prompt issues
Presenting results to stakeholders
Training and documentation

Learn more about visualization features and see grounding in action in the radiology report example.

Best Practices

Always verify: Use grounding information to verify critical extractions
Check context: Examine surrounding text to understand extraction context
Handle edge cases: Be aware that some extractions may span multiple locations or have no direct source
Use for debugging: When extractions are incorrect, examine source spans to improve prompts