Grounding & Source Tracking
LangExtract provides precise source grounding for every extracted value, showing exactly where in the original text each piece of information came from. This enables provenance tracking, verification, and explainable AI.
What is Grounding?
Grounding refers to the ability to trace extracted information back to its source location in the original text. For every value extracted by LangExtract, you receive:
- Character offsets: Start and end positions in the source text
- Text spans: The exact substring that was extracted
- Source references: Links back to the original document or text segment
Span Data Structures
LangExtract represents grounding information using span objects that contain:
start: Character offset where the span begins (0-indexed)end: Character offset where the span ends (exclusive)text: The actual text content of the spansource: Reference to the source document or text segment
These spans are automatically attached to every extracted field value, allowing you to programmatically access the source location.
Why Grounding Matters
Source grounding provides several critical benefits:
- Verification: Review extracted values against the original text to ensure accuracy
- Debugging: Understand why the model extracted certain values by examining source context
- Compliance: Meet requirements for explainable AI and auditability in regulated industries
- User trust: Show users exactly where information came from, building confidence in results
- Error correction: Identify and fix extraction errors by examining source spans
Using Grounding in Your Code
When you extract data with LangExtract, grounding information is automatically included. You can access spans through the extraction result object:
import langextract as lx
result = lx.extract(
text_or_documents="Patient was prescribed 500mg of aspirin.",
prompt_description="Extract medication information",
examples=[...]
)
# Access extracted values with their spans
for field_name, value in result.items():
if hasattr(value, 'span'):
print(f"{field_name}: {value}")
print(f" Source: {value.span.text}")
print(f" Position: {value.span.start}-{value.span.end}")
Visualization
LangExtract's visualization tools automatically highlight grounded spans in the source text, making it easy to see where each extracted value came from. This is especially useful for:
- Reviewing extraction quality
- Debugging prompt issues
- Presenting results to stakeholders
- Training and documentation
Learn more about visualization features and see grounding in action in the radiology report example.
Best Practices
- Always verify: Use grounding information to verify critical extractions
- Check context: Examine surrounding text to understand extraction context
- Handle edge cases: Be aware that some extractions may span multiple locations or have no direct source
- Use for debugging: When extractions are incorrect, examine source spans to improve prompts