Title: | Automated Topic Labeling with Language Models |
---|---|
Description: | Leveraging (large) language models for automatic topic labeling. The main function converts a list of top terms into a label for each topic. Hence, it is complementary to any topic modeling package that produces a list of top terms for each topic. While human judgement is indispensable for topic validation (i.e., inspecting top terms and most representative documents), automatic topic labeling can be a valuable tool for researchers in various scenarios. |
Authors: | Jonas Rieger [aut, cre] |
Maintainer: | Jonas Rieger <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.2.0 |
Built: | 2025-02-18 06:28:17 UTC |
Source: | https://github.com/petersfritz/topiclabels |
Leveraging (large) language models for automatic topic labeling. The main function converts a list of top terms into a label for each topic. Hence, it is complementary to any topic modeling package that produces a list of top terms for each topic. While human judgement is indispensable for topic validation (i.e., inspecting top terms and most representative documents), automatic topic labeling can be a valuable tool for researchers in various scenarios.
Maintainer: Jonas Rieger [email protected] (ORCID)
Authors:
Fritz Peters [email protected] (ORCID)
Andreas Fischer [email protected] (ORCID)
Tim Lauer [email protected] (ORCID)
André Bittermann [email protected] (ORCID)
Useful links:
Report bugs at https://github.com/PetersFritz/topiclabels/issues
Constructor for lm_topic_labels objects used in this package.
as.lm_topic_labels( x, terms, prompts, model, params, with_token, time, model_output, labels ) is.lm_topic_labels(obj, verbose = FALSE)
as.lm_topic_labels( x, terms, prompts, model, params, with_token, time, model_output, labels ) is.lm_topic_labels(obj, verbose = FALSE)
x |
[ |
terms |
[ |
prompts |
[ |
model |
[ |
params |
[ |
with_token |
[ |
time |
[ |
model_output |
[ |
labels |
[ |
obj |
[ |
verbose |
[ |
If you call as.lm_topic_labels
on an object x
which already is of
the structure of a lm_topic_labels
object (in particular a lm_topic_labels
object itself), the additional arguments id, param, ...
may be used to override the specific elements.
[named list
] lm_topic_labels
object.
## Not run: token = "" # please insert your hf token here topwords_matrix = matrix(c("zidane", "figo", "kroos", "gas", "power", "wind"), ncol = 2) obj = label_topics(topwords_matrix, token = token) obj$model obj_modified = as.lm_topic_labels(obj, model = "It is possible to modify individual entries") obj_modified$model obj_modified$model = 3.5 # example for an invalid modification is.lm_topic_labels(obj_modified, verbose = TRUE) obj_manual = as.lm_topic_labels(terms = list(c("zidane", "figo", "kroos"), c("gas", "power", "wind")), model = "manual labels", labels = c("Football Players", "Energy Supply")) ## End(Not run)
## Not run: token = "" # please insert your hf token here topwords_matrix = matrix(c("zidane", "figo", "kroos", "gas", "power", "wind"), ncol = 2) obj = label_topics(topwords_matrix, token = token) obj$model obj_modified = as.lm_topic_labels(obj, model = "It is possible to modify individual entries") obj_modified$model obj_modified$model = 3.5 # example for an invalid modification is.lm_topic_labels(obj_modified, verbose = TRUE) obj_manual = as.lm_topic_labels(terms = list(c("zidane", "figo", "kroos"), c("gas", "power", "wind")), model = "manual labels", labels = c("Football Players", "Energy Supply")) ## End(Not run)
Performs an automated labeling process of topics from topic models using language models. For this, the top terms and (optionally) a short context description are used.
label_topics(...) ## Default S3 method: label_topics( terms, model = "mistralai/Mixtral-8x7B-Instruct-v0.1", params = list(), token = NA_character_, context = "", sep_terms = "; ", max_length_label = 5L, prompt_type = c("json", "plain", "json-roles"), max_wait = 0L, progress = TRUE, ... ) ## S3 method for class 'labelTopics' label_topics(terms, stm_type = c("prob", "frex", "lift", "score"), ...)
label_topics(...) ## Default S3 method: label_topics( terms, model = "mistralai/Mixtral-8x7B-Instruct-v0.1", params = list(), token = NA_character_, context = "", sep_terms = "; ", max_length_label = 5L, prompt_type = c("json", "plain", "json-roles"), max_wait = 0L, progress = TRUE, ... ) ## S3 method for class 'labelTopics' label_topics(terms, stm_type = c("prob", "frex", "lift", "score"), ...)
... |
additional arguments |
terms |
[ |
model |
[ |
params |
[ |
token |
[ |
context |
[ |
sep_terms |
[ |
max_length_label |
[ |
prompt_type |
[ |
max_wait |
[ |
progress |
[ |
stm_type |
[ |
The function builds helpful prompts based on the top terms and sends these
prompts to language models on Huggingface. The output is in turn
post-processed so that the labels for each topic are extracted automatically.
If the automatically extracted labels show any errors, they can alternatively
be extracted using custom functions or manually from the original output of
the model using the model_output
entry of the lm_topic_labels object.
Implemented default parameters for the models HuggingFaceH4/zephyr-7b-beta
,
tiiuae/falcon-7b-instruct
, and mistralai/Mixtral-8x7B-Instruct-v0.1
are:
max_new_tokens
300
return_full_text
FALSE
Implemented prompt types are:
json
the language model is asked to respond in JSON format with a single field called 'label', specifying the best label for the topic
plain
the language model is asked to return an answer that should only consist of the best label for the topic
json-roles
the language model is asked to respond in JSON format with a single field called 'label', specifying the best label for the topic; in addition, the model is queried using identifiers for <|user|> input and the beginning of the <|assistant|> output
[named list
] lm_topic_labels
object.
## Not run: token = "" # please insert your hf token here topwords_matrix = matrix(c("zidane", "figo", "kroos", "gas", "power", "wind"), ncol = 2) label_topics(topwords_matrix, token = token) label_topics(list(c("zidane", "figo", "kroos"), c("gas", "power", "wind")), token = token) label_topics(list(c("zidane", "figo", "ronaldo"), c("gas", "power", "wind")), token = token) label_topics(list("wind", "greta", "hambach"), token = token) label_topics(list("wind", "fire", "air"), token = token) label_topics(list("wind", "feuer", "luft"), token = token) label_topics(list("wind", "feuer", "luft"), context = "Elements of the Earth", token = token) ## End(Not run)
## Not run: token = "" # please insert your hf token here topwords_matrix = matrix(c("zidane", "figo", "kroos", "gas", "power", "wind"), ncol = 2) label_topics(topwords_matrix, token = token) label_topics(list(c("zidane", "figo", "kroos"), c("gas", "power", "wind")), token = token) label_topics(list(c("zidane", "figo", "ronaldo"), c("gas", "power", "wind")), token = token) label_topics(list("wind", "greta", "hambach"), token = token) label_topics(list("wind", "fire", "air"), token = token) label_topics(list("wind", "feuer", "luft"), token = token) label_topics(list("wind", "feuer", "luft"), context = "Elements of the Earth", token = token) ## End(Not run)