GliNER2: Extracting Structured Information from Text

, we had SpaCy, which was the de facto NLP library for each newcomers and superior customers. It made it straightforward to dip your toes into NLP, even when you weren’t a deep studying knowledgeable. Nevertheless, with the rise of ChatGPT and different LLMs, it appears to have been moved apart.

Whereas LLMs like Claude or Gemini can do all kinds of NLP issues automagically, you don’t at all times need to convey a rocket launcher to a fist struggle. GliNER is spearheading the return of smaller, targeted fashions for traditional NLP strategies like entity and relationship extraction. It’s light-weight sufficient to run on a CPU, but highly effective sufficient to have constructed a thriving group round it.

Launched earlier this yr, GliNER2 is a big leap ahead. The place the unique GliNER targeted on entity recognition (spawning varied spin-offs like GLiREL for relations and GLiClass for classification), GliNER2 unifies named entity recognition, textual content classification, relation extraction, and structured information extraction right into a single framework.

The core shift in GliNER2 is its schema-driven strategy, which lets you outline extraction necessities declaratively and execute a number of duties in a single inference name. Regardless of these expanded capabilities, the mannequin stays CPU-efficient, making it a really perfect answer for remodeling messy, unstructured textual content into clear information with out the overhead of a giant language mannequin.
As a information graph fanatic at Neo4j, I’ve been significantly drawn to newly added structured information extraction by way of extract_json technique. Whereas entity and relation extraction are useful on their very own, the power to outline a schema and pull structured JSON straight from textual content is what actually excites me. It’s a pure match for information graph ingestion, the place structured, constant output is crucial.

Setting up information graphs with GliNER2. Picture by writer.

On this weblog submit, we’ll consider GliNER2’s capabilities, particularly the mannequin fastino/gliner2-large-v1, with a give attention to how properly it could assist us construct clear, structured information graphs.

The code is out there on GitHub.

Dataset choice

We’re not operating formal benchmarks right here, only a fast vibe test to see what GliNER2 can do. Right here’s our check textual content, pulled from the Ada Lovelace Wikipedia page:

Augusta Ada King, Countess of Lovelace (10 December 1815–27 November 1852), often known as Ada Lovelace, was an English mathematician and author mainly identified for work on Charles Babbage’s proposed mechanical general-purpose pc, the analytical engine. She was the primary to recognise the machine had functions past pure calculation. Lovelace is usually thought of the primary pc programmer. Lovelace was the one professional youngster of poet Lord Byron and reformer Anne Isabella Milbanke. All her half-siblings, Lord Byron’s different youngsters, have been born out of wedlock to different ladies. Lord Byron separated from his spouse a month after Ada was born, and left England eternally. He died in Greece through the Greek Struggle of Independence, when she was eight. Woman Byron was anxious about her daughter’s upbringing and promoted Lovelace’s curiosity in arithmetic and logic, to stop her creating her father’s perceived madness. Regardless of this, Lovelace remained taken with her father, naming one son Byron and the opposite, for her father’s center title, Gordon. Lovelace was buried subsequent to her father at her request. Though typically sick in childhood, Lovelace pursued her research assiduously. She married William King in 1835. King was a Baron, and was created Viscount Ockham and 1st Earl of Lovelace in 1838. The title Lovelace was chosen as a result of Ada was descended from the extinct Baron Lovelaces. The title given to her husband thus made Ada the Countess of Lovelace.

At 322 tokens, it’s a stable chunk of textual content to work with. Let’s dive in.

Entity extraction

Let’s begin with entity extraction. At its core, entity extraction is the method of routinely figuring out and categorizing key entities inside textual content, equivalent to folks, areas, organizations, or technical ideas. GliNER1 already dealt with this properly, however GliNER2 takes it additional by letting you add descriptions to entity varieties, providing you with finer management over what will get extracted.

entities = extractor.extract_entities(
    textual content,
    {
        "Particular person": "Names of individuals, together with the Aristocracy titles.",
        "Location": "International locations, cities, or geographic locations.",
        "Invention": "Machines, units, or technological creations.",
        "Occasion": "Historic occasions, wars, or conflicts."
    }
)

The outcomes are the next:

Entity extraction outcomes. Picture by writer.

Offering customized descriptions for every entity sort helps resolve ambiguity and improves extraction accuracy. That is particularly helpful for broad classes like Occasion, the place by itself, the mannequin won’t know whether or not to incorporate wars, ceremonies, or private milestones. Including historic occasions, wars, or conflicts clarifies the supposed scope.

Relation extraction

Relation extraction identifies relationships between pairs of entities in textual content. For instance, within the sentence “Steve Jobs based Apple”, a relation extraction mannequin would determine the connection Based between the entities Steve Jobs and Apple.

With GLiNER2, you outline solely the relation varieties you need to extract as you’ll be able to’t constrain which entity varieties are allowed as the pinnacle or tail of every relation. This simplifies the interface however might require post-processing to filter undesirable pairings.

Right here, I added a easy experiment by including each the alias and the same_as relationship definitions.

relations = extractor.extract_relations(
    textual content,
    {
        "parent_of": "An individual is the father or mother of one other individual",
        "married_to": "An individual is married to a different individual",
        "worked_on": "An individual contributed to or labored on an invention",
        "invented": "An individual created or proposed an invention",
        "alias": "Entity is an alias, nickname, title, or alternate reference for an additional entity",
        "same_as": "Entity is an alias, nickname, title, or alternate reference for an additional entity"
    }
)

The outcomes are the next:

Relation extraction outcomes. Picture by writer.

The extraction appropriately recognized key relationships: Lord Byron and Anne Isabella Milbanke as Ada’s dad and mom, her marriage to William King, Babbage as inventor of the analytical engine, and Ada’s work on it. Notably, the mannequin detected Augusta Ada King as an alias of Ada Lovelace, however same_as wasn’t captured regardless of having an equivalent description. The choice doesn’t appear random because the mannequin at all times populates the alias however by no means the same_as relationship. This highlights how delicate relation extraction is to label naming, not simply descriptions.

Conveniently, GLiNER2 permits combining a number of extraction varieties in a single name so you may get entity varieties alongside relation varieties in a single go. Nevertheless, the operations are unbiased: entity extraction doesn’t filter or constrain which entities seem in relation extraction, and vice versa. Consider it as operating each extractions in parallel fairly than as a pipeline.

schema = (extractor.create_schema()
    .entities({
        "Particular person": "Names of individuals, together with the Aristocracy titles.",
        "Location": "International locations, cities, or geographic locations.",
        "Invention": "Machines, units, or technological creations.",
        "Occasion": "Historic occasions, wars, or conflicts."
    })
    .relations({
        "parent_of": "An individual is the father or mother of one other individual",
        "married_to": "An individual is married to a different individual",
        "worked_on": "An individual contributed to or labored on an invention",
        "invented": "An individual created or proposed an invention",
        "alias": "Entity is an alias, nickname, title, or alternate reference for an additional entity"
    })
)

outcomes = extractor.extract(textual content, schema)

The outcomes are the next:

Mixed entity and relation extraction outcomes. Picture by writer.

The mixed extraction now provides us entity varieties, that are distinguished by colour. Nevertheless, a number of nodes seem remoted (Greece, England, Greek Struggle of Independence) since not each extracted entity participates in a detected relationship.

Structured JSON extraction

Maybe probably the most highly effective characteristic is structured information extraction by way of extract_json. This mimics the structured output performance of LLMs like ChatGPT or Gemini however runs totally on CPU. In contrast to entity and relation extraction, this allows you to outline arbitrary fields and pull them into structured data. The syntax follows a field_name::sort::description sample, the place sort is str or listing.

outcomes = extractor.extract_json(
    textual content,
    {
        "individual": [
            "name::str",
            "gender::str::male or female",
            "alias::str::brief summary of included information about the person",
            "description::str",
            "birth_date::str",
            "death_date::str",
            "parent_of::str",
            "married_to::str"
        ]
    }
)

Right here we’re experimenting with some overlap: alias, parent_of, and married_to may be modeled as relations. It’s value exploring which strategy works higher in your use case. One fascinating addition is the description area, which pushes the boundaries a bit: it’s nearer to abstract era than pure extraction.

The outcomes are the next:

{
  "individual": [
    {
      "name": "Augusta Ada King",
      "gender": null,
      "alias": "Ada Lovelace",
      "description": "English mathematician and writer",
      "birth_date": "10 December 1815",
      "death_date": "27 November 1852",
      "parent_of": "Ada Lovelace",
      "married_to": "William King"
    },
    {
      "name": "Charles Babbage",
      "gender": null,
      "alias": null,
      "description": null,
      "birth_date": null,
      "death_date": null,
      "parent_of": "Ada Lovelace",
      "married_to": null
    },
    {
      "name": "Lord Byron",
      "gender": null,
      "alias": null,
      "description": "reformer",
      "birth_date": null,
      "death_date": null,
      "parent_of": "Ada Lovelace",
      "married_to": null
    },
    {
      "name": "Anne Isabella Milbanke",
      "gender": null,
      "alias": null,
      "description": "reformer",
      "birth_date": null,
      "death_date": null,
      "parent_of": "Ada Lovelace",
      "married_to": null
    },
    {
      "name": "William King",
      "gender": null,
      "alias": null,
      "description": null,
      "birth_date": null,
      "death_date": null,
      "parent_of": "Ada Lovelace",
      "married_to": null
    }
  ]
}

The outcomes reveal some limitations. All gender fields are null, although Ada is explicitly known as a daughter, the mannequin doesn’t infer she’s feminine. The description area captures solely surface-level phrases (“English mathematician and author”, “reformer”) fairly than producing significant summaries, not helpful for workflows like Microsoft’s GraphRAG that depend on richer entity descriptions. There are additionally clear errors: Charles Babbage and William King are incorrectly marked as parent_of Ada, and Lord Byron is labeled a reformer (that’s Anne Isabella). These errors with parent_ofdidn’t come up throughout relation extraction, so maybe that’s the higher technique right here. Total, the outcomes suggests the mannequin excels at extraction however struggles with reasoning or inference, seemingly a tradeoff of its compact dimension.

Moreover, all attributes are non-obligatory, which is smart and simplifies issues. Nevertheless, it’s important to watch out as typically the title attribute shall be null, therefore making the file invalid. Lastly, we may use one thing like PyDantic to validate outcomes and solid to to applicable varieties like floats or dates and deal with invalid outcomes.

Setting up information graphs

Since GLiNER2 permits a number of extraction varieties in a single go, we will mix all above strategies to assemble a information graph. Slightly than operating separate pipelines for entity, relation, and structured information extraction, a single schema definition handles all three. This makes it simple to go from uncooked textual content to a wealthy, interconnected illustration.

schema = (extractor.create_schema()
    .entities({
        "Particular person": "Names of individuals, together with the Aristocracy titles.",
        "Location": "International locations, cities, or geographic locations.",
        "Invention": "Machines, units, or technological creations.",
        "Occasion": "Historic occasions, wars, or conflicts."
    })
    .relations({
        "parent_of": "An individual is the father or mother of one other individual",
        "married_to": "An individual is married to a different individual",
        "worked_on": "An individual contributed to or labored on an invention",
        "invented": "An individual created or proposed an invention",
    })
    .construction("individual")
        .area("title", dtype="str")
        .area("alias", dtype="str")
        .area("description", dtype="str")
        .area("birth_date", dtype="str")
)

outcomes = extractor.extract(textual content, schema)

The way you map these outputs to your graph (nodes, relationships, properties) depends upon your information mannequin. On this instance, we use the next information mannequin:

Information graph development outcome. Picture by writer.

What you’ll be able to discover is that we embody the unique textual content chunk within the graph as properly, which permits us to retrieve and reference the supply materials when querying the graph, enabling extra correct and traceable outcomes. The import Cypher appears to be like like the next:

import_cypher_query = """
// Create Chunk node from textual content
CREATE (c:Chunk {textual content: $textual content})

// Create Particular person nodes with properties
WITH c
CALL (c) {
  UNWIND $information.individual AS p
  WITH p
  WHERE p.title IS NOT NULL
  MERGE (n:__Entity__ {title: p.title})
  SET n.description = p.description,
      n.birth_date = p.birth_date
  MERGE (c)-[:MENTIONS]->(n)
  WITH p, n WHERE p.alias IS NOT NULL
  MERGE (m:__Entity__ {title: p.alias})
  MERGE (n)-[:ALIAS_OF]->(m)
}

// Create entity nodes dynamically with __Entity__ base label + dynamic label
CALL (c) {
  UNWIND keys($information.entities) AS label
  UNWIND $information.entities[label] AS entityName
  MERGE (n:__Entity__ {title: entityName})
  SET n:$(label)
  MERGE (c)-[:MENTIONS]->(n)
}

// Create relationships dynamically
CALL (c) {
  UNWIND keys($information.relation_extraction) AS relType
  UNWIND $information.relation_extraction[relType] AS rel
  MATCH (a:__Entity__ {title: rel[0]})
  MATCH (b:__Entity__ {title: rel[1]})
  MERGE (a)-[:$(toUpper(relType))]->(b)
}
RETURN distinct 'import accomplished' AS outcome
"""

The Cypher question takes the outcomes from GliNER2 output and shops them into Neo4j. We may additionally embody embeddings for the textual content chunks, entities, and so forth.

Abstract

GliNER2 is a step in the appropriate course for structured information extraction. With the rise of LLMs, it’s straightforward to succeed in for ChatGPT or Claude each time you must pull info from textual content, however that’s typically overkill. Working a multi-billion-parameter mannequin to extract a number of entities and relationships feels wasteful when smaller, specialised instruments can do the job on a CPU.

GliNER2 unifies named entity recognition, relation extraction, and structured JSON output right into a single framework. It’s well-suited for duties like information graph development, the place you want constant, schema-driven extraction fairly than open-ended era.
Whereas the mannequin has its limitations. It really works greatest for direct extraction fairly than inference or reasoning, and outcomes could be inconsistent. However the progress from the unique GliNER1 to GliNER2 is encouraging, and hopefully we’ll see continued growth on this area. For a lot of use instances, a targeted extraction mannequin beats an LLM that’s doing excess of you want.

The code is out there on GitHub.

Source link

3 Questions: On the future of AI and the mathematical and physical sciences | MIT News

An Intuitive Guide to MCMC (Part I): The Metropolis-Hastings Algorithm

New MIT class uses anthropology to improve chatbots | MIT News

How LLMs Handle Infinite Context With Finite Memory

The AI Hype Index: Cracking the chatbot code

Modern DataFrames in Python: A Hands-On Tutorial with Polars and DuckDB

Inroads to personalized AI trip planning | MIT News

We Need a Fourth Law of Robotics in the Age of AI

Most Popular

YOLOv3 Paper Walkthrough: Even Better, But Not That Much

AI text-to-speech programs could “unlearn” how to imitate certain people

Grok 4 – xAI:s nya AI-modell

Our Picks

Are OpenAI and Google intentionally downgrading their models?

3 Questions: On the future of AI and the mathematical and physical sciences | MIT News

Is Open AI actually making its own models dumber?

GliNER2: Extracting Structured Information from Text

Dataset choice

Entity extraction

Relation extraction

Structured JSON extraction

Setting up information graphs

Abstract

Related Posts