0

Google AI releases langeXtract: an open source python library that extracts structured data from unstructured text documents

In today’s data-driven world, valuable insights are often buried in unstructured texts such as clinical notes, lengthy legal contracts, or customer feedback threads. Extracting meaningful traceable information from these documents is both a technical and practical challenge. Google AI’s new open source Python library, LangeXtract, is designed to directly address this gap, using LLMS like Gemini to provide powerful automatic extraction with traceability and transparency at the core.

1. Declarative and traceable extraction

langeXtract lets users define custom extraction tasks using natural language instructions and high-quality “small” examples. This enables developers and analysts to Specify exactly what entity, relationship or fact to extract and in what structure. It is crucial that every extracted information is Tie directly to its source text– Implement verification, auditing and end-to-end traceability.

2. Domain versatility

The library works not only in technical demonstrations, but also in the critical real world, including health (clinical annotations, medical reports), finance (abstract, risk documents), law (contract), research literature, and even art (analysis of Shakespeare). Original use cases include automatic drug extraction from clinical files, dosage and management details, as well as dramatic or literary relationships and emotions.

3. Execution using LLM mode

LangeXtract enables Gemini powered and compatible with other LLMs Perform custom output mode (like JSON), so the results are not only accurate – they are available immediately in downstream databases, analytics, or AI pipelines. It addresses traditional LLM weaknesses around hallucinations and pattern drifts by grounding the output to user description and actual source text.

4. Scalability and visualization

  • Handle a lot of: LangeXtract effectively processes long documents by decomposing, parallelizing and aggregating results.
  • Interactive visualization: Developers can generate interactive HTML reports that develop audits and error analysis by highlighting the location in the original document to view each extracted entity.
  • Smooth integration: Working in Google Colab, Jupyter or as a standalone HTML file provides a quick feedback loop for developers and researchers.

5. Installation and usage

Easy installation using PIP:

Sample workflow (extract character information from Shakespeare):

import langextract as lx
import textwrap

# 1. Define your prompt
prompt = textwrap.dedent("""
Extract characters, emotions, and relationships in order of appearance.
Use exact text for extractions. Do not paraphrase or overlap entities.
Provide meaningful attributes for each entity to add context.
""")

# 2. Give a high-quality example
examples = [
    lx.data.ExampleData(
        text="ROMEO. But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.",
        extractions=[
            lx.data.Extraction(extraction_class="character", extraction_text="ROMEO", attributes={"emotional_state": "wonder"}),
            lx.data.Extraction(extraction_class="emotion", extraction_text="But soft!", attributes={"feeling": "gentle awe"}),
            lx.data.Extraction(extraction_class="relationship", extraction_text="Juliet is the sun", attributes={"type": "metaphor"}),
        ],
    )
]

# 3. Extract from new text
input_text = "Lady Juliet gazed longingly at the stars, her heart aching for Romeo"

result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-pro"
)

# 4. Save and visualize results
lx.io.save_annotated_documents([result], output_name="extraction_results.jsonl")
html_content = lx.visualize("extraction_results.jsonl")
with open("visualization.html", "w") as f:
    f.write(html_content)

This will result in structured, source-anchored JSON output, as well as interactive HTML visualizations for easy review and presentation.

Professional and realistic applications

  • drug: Extract the drug, dosage, timing, and link it to the source sentence. LangeXtract’s approach is powered by insights on research that accelerates medical information extraction and is directly applicable to the construction of clinical and radiological reports to improve clarity and support interoperability.
  • Finance and Law: Automatically extract relevant terms, terms or risks from dense legal or financial texts to ensure that each output can be traced back to its context.
  • Research and data mining: Simplified high-throughput extraction of thousands of scientific papers.

The team even offers a called Radextract To construct radiological reports – not only what is to be extracted, but also exactly where the information that appears in the original input appears.

How to compare LangeXtract

feature Traditional method LangeXtract method
Pattern consistency Usually manual/error-prone Enforced with instructions and a few examples
Results traceability Minimum All outputs are linked to input text
Extend to long text Window, damaged Block + parallel extraction, then aggregate
Visualization Habits, usually not present Built-in interactive HTML reports
deploy Rigid, model-specific Gemini – Open to other LLMs and locally

Anyway

langeXtract proposes a new era of extracting structured, actionable data from text – Delivery:

  • Declarative, interpretable extraction
  • Traceable results are supported by the source environment
  • Quick iteration of instant visualization
  • Easily integrate into any Python workflow

Check Github page and Technology Blog. Check out ours anytime Tutorials, codes and notebooks for github pages. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.


Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.