Topic Model Labelling with LLMs | Towards Data Science

By: Petr Koráb*, Martin Feldkircher**, *** Viktoriya Teliha** (*Textual content Mining Tales, Prague, **Vienna Faculty of Worldwide Research, ***Centre for Utilized Macroeconomic Evaluation, Australia).

of phrases produced by matter fashions requires area expertise and could also be subjective to the labeler. Particularly when the variety of matters grows giant, it may be handy to assign human-readable names to matters robotically with an LLM. Merely copying and pasting the outcomes into UIs, reminiscent of chatgpt.com, is sort of a “black-box” and unsystematic. A better option can be so as to add matter labeling to the code with a documented labeler, which provides the engineer extra management over the outcomes and ensures reproducibility. This tutorial will discover intimately:

Learn how to prepare a subject mannequin with a recent new Turftopic Python package deal
Learn how to label matter mannequin outcomes with GPT-4.0 mini.

We are going to prepare a cutting-edge FASTopic mannequin by Xiaobao Wu et al. [3] introduced finally yr’s NeurIPS. This mannequin outperforms other competing models, reminiscent of BERTopic, in a number of key metrics (e.g., matter range) and has broad applications in business intelligence.

1. Elements of the Subject Modelling Pipeline

Labelling is the important a part of the subject modelling pipeline as a result of it bridges the mannequin outputs with real-world selections. The mannequin assigns a quantity to every matter, however a enterprise choice depends on the human-readable textual content label summarizing the standard phrases in every matter. The fashions are sometimes labelled by (1) labellers with the area expertise, usually utilizing a well-defined labelling technique, (2) LLMs, and (3) industrial instruments. The trail from uncooked knowledge to decision-making via a subject mannequin is properly defined in Picture 1.

Picture 1. Elements of the subject modeling pipeline.
Supply: tailored and prolonged from Kardos et al [2].

The pipeline begins with uncooked knowledge, which is preprocessed and vectorized for the subject mannequin. The mannequin returns matters named with integers, together with typical phrases (phrases or bigrams). The labeling layer replaces the integer within the matter identify with the textual content label. The mannequin consumer (product manager, customer care dept., and so forth.) then works with labelled phrases to make data-informed selections. Within the following modeling instance, we are going to observe it step-by-step.

2. Information

We are going to use FASTopic to categorise buyer complaints knowledge into 10 matters. The instance use case makes use of a synthetically generated Customer Care Email dataset out there on Kaggle, licensed below the GPL-3 license. The prefiltered knowledge covers 692 incoming emails to the shopper care division and appears like this:

Picture 2. Customer Care Email dataset. Picture by authors.

2.1. Information preprocessing

Textual content knowledge is sequentially preprocessed in six steps. Numbers are eliminated first, adopted by emojis. English stopwords are eliminated afterward, adopted by punctuation. Extra tokens (reminiscent of firm and individual names) are eliminated within the subsequent step earlier than lemmatization. Learn extra on textual content preprocessing for matter fashions in our previous tutorial.

First, we learn the clear knowledge and tokenize the dataset:

import pandas as pd

# Learn knowledge
knowledge = pd.read_csv("knowledge.csv", usecols=['message_clean'])

# Create corpus listing
docs = knowledge["message_clean"].tolist()

Picture 3. Beneficial cleansing pipeline for matter fashions. Picture by authors.

2.2. Bigram vectorization

Subsequent, we create a bigram tokenizer to course of tokens as bigrams in the course of the mannequin coaching. Bigram fashions present extra related data and determine higher key qualities and issues for enterprise selections than single-word fashions (“supply” vs. “poor supply”, “abdomen” vs. “delicate abdomen”, and so forth.).

from sklearn.feature_extraction.textual content import CountVectorizer

bigram_vectorizer = CountVectorizer(
    ngram_range=(2, 2),               # solely bigrams
    max_features=1000                 # prime 1000 bigrams by frequency
)

3. Mannequin coaching

The FASTopic mannequin is presently carried out in two Python packages:

Fastopic: official package deal by X. Wu
Turftopic : new Python package deal that brings many useful matter modeling options, together with labeling with LLMs [2]

We are going to use the Turftopic implementation due to the direct hyperlink between the mannequin and the Namer that provides LLM labelling.

Let’s arrange the mannequin and match it to the info. It’s important to set a random state to safe coaching reproducibility.

from turftopic import FASTopic

# Mannequin specification
topic_size  = 10
mannequin = FASTopic(n_components = topic_size,       # prepare for 10 matters
                 vectorizer = bigram_vectorizer,  # generate bigrams in matters
                 random_state = 32).match(docs)     # set random state 

# Match mannequin to corpus
topic_data = mannequin.prepare_topic_data(docs)

Now, let’s put together a dataframe with matter IDs and the highest 10 bigrams with the very best chance obtained from the mannequin (code is here).

Picture 4. Unlabeled matters in FASTopic. Picture by authors.

4. Subject labeling

Within the subsequent step, we add textual content labels to the subject IDs with GPT4-o-mini. Let’s observe these steps:

With this code, we label the matters and add a brand new row topic_name to the dataframe.

from turftopic.namers import OpenAITopicNamer
import os

# OpenAI API key key to entry GPT-4
os.environ["OPENAI_API_KEY"] = ""   

# use Namer to label matter mannequin with LLM
namer = OpenAITopicNamer("gpt-4o-mini")
mannequin.rename_topics(namer)

# create a dataframe with labelled matters
topics_df = mannequin.topics_df()
topics_df.columns = ['topic_id', 'topic_name', 'topic_words']

# break up and explode
topics_df['topic_word'] = topics_df['topic_words'].str.break up(',')
topics_df = topics_df.explode('topic_word')
topics_df['topic_word'] = topics_df['topic_word'].str.strip()

# add a rank for every phrase inside a subject
topics_df['word_rank'] = topics_df.groupby('topic_id').cumcount() + 1

# pivot to huge format
huge = topics_df.pivot(index='word_rank', 
                       columns=['topic_id', 'topic_name'], values='topic_word')

Right here is the desk with labeled matters after extra transformations. It could be fascinating to check the LLM outcomes with these of an organization insider who’s acquainted with the corporate’s processes and buyer base. The dataset is artificial, so let’s depend on the GPT-4 labeling.

Picture 5. Labeled matters in FASTopic by GPT4–o-mini. Picture by authors.

We are able to additionally visualize the labeled matters for a greater presentation. The code for the bigram phrase cloud visualization, generated from the matters produced by the mannequin, is here.

Picture 6. Phrase cloud visualization of labeled matters by GPT4–o-mini. Picture by authors.

Abstract

The brand new Turftopic Python package deal hyperlinks current matter fashions with the LLM-based labeler for producing human-readable matter names.
The principle advantages are: 1) independence from the labeler’s subjective expertise, 2) capability to label fashions with a lot of matters {that a} human labeler might need problem labeling independently, and three) extra management of the code and reproducibility.
Subject labeling with LLMs has a variety of functions in various areas. Learn our latest paper on the subject modeling of central financial institution communication, the place GPT-4 labeled the FASTopic mannequin.
The labels are barely completely different for every coaching, even with the random state. It isn’t attributable to the Namer, however by the random processes in mannequin coaching that output bigrams with possibilities in descending order. The variations in possibilities are in tiny decimals, so every coaching generates just a few new phrases within the prime 10, which then impacts the LLM labeler.

The info and full code for this tutorial are here.

Petr Korab is a Senior Information Analyst and Founding father of Text Mining Stories with over eight years of expertise in Enterprise Intelligence and NLP.

Join for our blog to get the most recent information from the NLP trade!

References

[1] Feldkircher, M., Korab, P., Teliha, V., (2025). “What Do Central Bankers Talk About? Evidence From the BIS Archive,” CAMA Working Papers 2025–35, Centre for Utilized Macroeconomic Evaluation, Crawford Faculty of Public Coverage, The Australian Nationwide College.

[2] Kardos, M., Enevoldsen, Ok. C., Kostkan, J., Kristensen-McLachlan, R. D., Rocca, R. (2025). Turftopic: Subject Modelling with Contextual Representations from Sentence Transformers. Journal of Open Supply Software program, 10(111), 8183, https://doi.org/10.21105/joss.08183.

[3] Wu, X, Nguyen, T., Ce Zhang, D., Yang Wang, W., Luu, A. T. (2024). FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm. arXiv preprint: 2405.17978.

Source link

What Can the History of Data Tell Us About the Future of AI?

Accuracy Is Dead: Calibration, Discrimination, and Other Metrics You Actually Need

There and Back Again: An AI Career Journey

Building a Сustom MCP Chatbot | Towards Data Science

Metas AI-app exponerar privata konversationer

FantasyTalking – AI-baserad läppsynkronisering – AI nyheter

An Introduction to Remote Model Context Protocol Servers

Ivory Tower Notes: The Problem | Towards Data Science

Most Popular

How to Set the Number of Trees in Random Forest

OpenAI’s New o3 Model May Be the Closest We’ve Come to AGI

How to Access NASA’s Climate Data — And How It’s Powering the Fight Against Climate Change Pt. 1

Our Picks

Vad världen har frågat ChatGPT under 2025

Google’s generative video model Veo 3 has a subtitles problem

MedGemma – Nya AI-modeller för hälso och sjukvård