| Title: | Automated Topic Labeling with Language Models |
|---|---|
| Description: | Leveraging (large) language models for automatic topic labeling. The main function converts a list of top terms into a label for each topic. Hence, it is complementary to any topic modeling package that produces a list of top terms for each topic. While human judgement is indispensable for topic validation (i.e., inspecting top terms and most representative documents), automatic topic labeling can be a valuable tool for researchers in various scenarios. |
| Authors: | Jonas Rieger [aut, cre] (ORCID: <https://orcid.org/0000-0002-0007-4478>), Fritz Peters [aut] (ORCID: <https://orcid.org/0009-0003-8471-4931>), Andreas Fischer [aut] (ORCID: <https://orcid.org/0009-0006-0748-6076>), Tim Lauer [aut] (ORCID: <https://orcid.org/0009-0003-1625-1672>), André Bittermann [aut] (ORCID: <https://orcid.org/0000-0003-2942-9831>) |
| Maintainer: | Jonas Rieger <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 0.4.0 |
| Built: | 2026-05-27 06:26:51 UTC |
| Source: | https://github.com/petersfritz/topiclabels |
Leveraging (large) language models for automatic topic labeling. The main function converts a list of top terms into a label for each topic. Hence, it is complementary to any topic modeling package that produces a list of top terms for each topic. While human judgement is indispensable for topic validation (i.e., inspecting top terms and most representative documents), automatic topic labeling can be a valuable tool for researchers in various scenarios.
Maintainer: Jonas Rieger [email protected] (ORCID)
Authors:
Fritz Peters [email protected] (ORCID)
Andreas Fischer [email protected] (ORCID)
Tim Lauer [email protected] (ORCID)
André Bittermann [email protected] (ORCID)
Useful links:
Report bugs at https://github.com/PetersFritz/topiclabels/issues
Constructor for lm_topic_labels objects used in this package.
as.lm_topic_labels( x, terms, prompts, model, params, with_token, time, model_output, labels ) is.lm_topic_labels(obj, verbose = FALSE)as.lm_topic_labels( x, terms, prompts, model, params, with_token, time, model_output, labels ) is.lm_topic_labels(obj, verbose = FALSE)
x |
[ |
terms |
[ |
prompts |
[ |
model |
[ |
params |
[ |
with_token |
[ |
time |
[ |
model_output |
[ |
labels |
[ |
obj |
[ |
verbose |
[ |
If you call as.lm_topic_labels on an object x which already is of
the structure of a lm_topic_labels object (in particular a lm_topic_labels
object itself), the additional arguments id, param, ...
may be used to override the specific elements.
[named list] lm_topic_labels object.
## Not run: token = "" # please insert your hf token here topwords_matrix = matrix(c("zidane", "figo", "kroos", "gas", "power", "wind"), ncol = 2) obj = label_topics(topwords_matrix, token = token) obj$model obj_modified = as.lm_topic_labels(obj, model = "It is possible to modify individual entries") obj_modified$model obj_modified$model = 3.5 # example for an invalid modification is.lm_topic_labels(obj_modified, verbose = TRUE) obj_manual = as.lm_topic_labels(terms = list(c("zidane", "figo", "kroos"), c("gas", "power", "wind")), model = "manual labels", labels = c("Football Players", "Energy Supply")) ## End(Not run)## Not run: token = "" # please insert your hf token here topwords_matrix = matrix(c("zidane", "figo", "kroos", "gas", "power", "wind"), ncol = 2) obj = label_topics(topwords_matrix, token = token) obj$model obj_modified = as.lm_topic_labels(obj, model = "It is possible to modify individual entries") obj_modified$model obj_modified$model = 3.5 # example for an invalid modification is.lm_topic_labels(obj_modified, verbose = TRUE) obj_manual = as.lm_topic_labels(terms = list(c("zidane", "figo", "kroos"), c("gas", "power", "wind")), model = "manual labels", labels = c("Football Players", "Energy Supply")) ## End(Not run)
Performs an automated labeling process of topics from topic models using language models. For this, the top terms and (optionally) a short context description are used.
label_topics(...) ## Default S3 method: label_topics( terms, model = "deepseek-ai/DeepSeek-V3.2-Exp:novita", params = list(), token = NA_character_, context = "", sep_terms = "; ", max_length_label = 5L, prompt_type = c("json", "plain", "json-roles"), max_wait = 0L, progress = TRUE, ... ) ## S3 method for class 'labelTopics' label_topics(terms, stm_type = c("prob", "frex", "lift", "score"), ...)label_topics(...) ## Default S3 method: label_topics( terms, model = "deepseek-ai/DeepSeek-V3.2-Exp:novita", params = list(), token = NA_character_, context = "", sep_terms = "; ", max_length_label = 5L, prompt_type = c("json", "plain", "json-roles"), max_wait = 0L, progress = TRUE, ... ) ## S3 method for class 'labelTopics' label_topics(terms, stm_type = c("prob", "frex", "lift", "score"), ...)
... |
additional arguments |
terms |
[ |
model |
[ |
params |
[ |
token |
[ |
context |
[ |
sep_terms |
[ |
max_length_label |
[ |
prompt_type |
[ |
max_wait |
[ |
progress |
[ |
stm_type |
[ |
The function builds helpful prompts based on the top terms and sends these
prompts to language models on Huggingface. The output is in turn
post-processed so that the labels for each topic are extracted automatically.
If the automatically extracted labels show any errors, they can alternatively
be extracted using custom functions or manually from the original output of
the model using the model_output entry of the lm_topic_labels object.
Implemented default parameters for the models google/gemma-2-2b-it and
deepseek-ai/DeepSeek-V3.2-Exp:novita are:
max_new_tokens300
return_full_textFALSE
Implemented prompt types are:
jsonthe language model is asked to respond in JSON format with a single field called 'label', specifying the best label for the topic
plainthe language model is asked to return an answer that should only consist of the best label for the topic
json-rolesthe language model is asked to respond in JSON format with a single field called 'label', specifying the best label for the topic; in addition, the model is queried using identifiers for <|user|> input and the beginning of the <|assistant|> output
[named list] lm_topic_labels object.
## Not run: token = "" # please insert your hf token here topwords_matrix = matrix(c("zidane", "figo", "kroos", "gas", "power", "wind"), ncol = 2) label_topics(topwords_matrix, token = token) label_topics(list(c("zidane", "figo", "kroos"), c("gas", "power", "wind")), token = token) label_topics(list(c("zidane", "figo", "ronaldo"), c("gas", "power", "wind")), token = token) label_topics(list("wind", "greta", "hambach"), token = token) label_topics(list("wind", "fire", "air"), token = token) label_topics(list("wind", "feuer", "luft"), token = token) label_topics(list("wind", "feuer", "luft"), context = "Elements of the Earth", token = token) ## End(Not run)## Not run: token = "" # please insert your hf token here topwords_matrix = matrix(c("zidane", "figo", "kroos", "gas", "power", "wind"), ncol = 2) label_topics(topwords_matrix, token = token) label_topics(list(c("zidane", "figo", "kroos"), c("gas", "power", "wind")), token = token) label_topics(list(c("zidane", "figo", "ronaldo"), c("gas", "power", "wind")), token = token) label_topics(list("wind", "greta", "hambach"), token = token) label_topics(list("wind", "fire", "air"), token = token) label_topics(list("wind", "feuer", "luft"), token = token) label_topics(list("wind", "feuer", "luft"), context = "Elements of the Earth", token = token) ## End(Not run)