Package 'topiclabels' reference manual

Title:	Automated Topic Labeling with Language Models
Description:	Leveraging (large) language models for automatic topic labeling. The main function converts a list of top terms into a label for each topic. Hence, it is complementary to any topic modeling package that produces a list of top terms for each topic. While human judgement is indispensable for topic validation (i.e., inspecting top terms and most representative documents), automatic topic labeling can be a valuable tool for researchers in various scenarios.
Authors:	Jonas Rieger [aut, cre] , Fritz Peters [aut] , Andreas Fischer [aut] , Tim Lauer [aut] , André Bittermann [aut]
Maintainer:	Jonas Rieger <[email protected]>
License:	GPL (>= 3)
Version:	0.2.0
Built:	2025-02-18 06:28:17 UTC
Source:	https://github.com/petersfritz/topiclabels

Automated Topic Labeling with Language Models

Description

Leveraging (large) language models for automatic topic labeling. The main function converts a list of top terms into a label for each topic. Hence, it is complementary to any topic modeling package that produces a list of top terms for each topic. While human judgement is indispensable for topic validation (i.e., inspecting top terms and most representative documents), automatic topic labeling can be a valuable tool for researchers in various scenarios.

Labeling function

label_topics

Constructor

as.lm_topic_labels

Author(s)

Maintainer: Jonas Rieger [email protected] (ORCID)

Authors:

Fritz Peters [email protected] (ORCID)
Andreas Fischer [email protected] (ORCID)
Tim Lauer [email protected] (ORCID)
André Bittermann [email protected] (ORCID)

lm_topic_labels object

Description

Constructor for lm_topic_labels objects used in this package.

Usage

as.lm_topic_labels(
  x,
  terms,
  prompts,
  model,
  params,
  with_token,
  time,
  model_output,
  labels
)

is.lm_topic_labels(obj, verbose = FALSE)
as.lm_topic_labels(
  x,
  terms,
  prompts,
  model,
  params,
  with_token,
  time,
  model_output,
  labels
)

is.lm_topic_labels(obj, verbose = FALSE)

Arguments

`x`	[`named list`] `lm_topic_labels` object. Alternatively each element can be passed for individual results. Individually set elements overwrite elements from `x`.
`terms`	[`list(n) of character`] List of `character` vectors, whereas each vector represents the top terms of a topic. Topics may consist of different numbers of top terms.
`prompts`	[`character(n)`] Optional. Each entry of the `character` vector contains the original prompt that was used to obtain the corresponding entry of `model_output`.
`model`	[`character(1)`] The language model used for labeling the topics.
`params`	[`named list`] Optional. Model parameters passed.
`with_token`	[`logical(1)`] Optional. Was the labeling executed using a Huggingface token?
`time`	[`numeric(1)`] Optional. Time needed for the labeling.
`model_output`	[`character(n)`] Optional. Each entry of the `character` vector contains the original model output obtained using the corresponding prompt from `prompts`.
`labels`	[`character(n)`] The extracted labels from `model_output`.
`obj`	[`R` object] Object to test.
`verbose`	[`logical(1)`] Should test information be given in the console?

Details

If you call as.lm_topic_labels on an object x which already is of the structure of a lm_topic_labels object (in particular a lm_topic_labels object itself), the additional arguments id, param, ... may be used to override the specific elements.

Value

[named list] lm_topic_labels object.

Examples

## Not run: 
token = "" # please insert your hf token here
topwords_matrix = matrix(c("zidane", "figo", "kroos",
                           "gas", "power", "wind"), ncol = 2)
obj = label_topics(topwords_matrix, token = token)
obj$model
obj_modified = as.lm_topic_labels(obj, model = "It is possible to modify individual entries")
obj_modified$model

obj_modified$model = 3.5 # example for an invalid modification
is.lm_topic_labels(obj_modified, verbose = TRUE)

obj_manual = as.lm_topic_labels(terms = list(c("zidane", "figo", "kroos"),
                                             c("gas", "power", "wind")),
                                model = "manual labels",
                                labels = c("Football Players", "Energy Supply"))

## End(Not run)
## Not run: 
token = "" # please insert your hf token here
topwords_matrix = matrix(c("zidane", "figo", "kroos",
                           "gas", "power", "wind"), ncol = 2)
obj = label_topics(topwords_matrix, token = token)
obj$model
obj_modified = as.lm_topic_labels(obj, model = "It is possible to modify individual entries")
obj_modified$model

obj_modified$model = 3.5 # example for an invalid modification
is.lm_topic_labels(obj_modified, verbose = TRUE)

obj_manual = as.lm_topic_labels(terms = list(c("zidane", "figo", "kroos"),
                                             c("gas", "power", "wind")),
                                model = "manual labels",
                                labels = c("Football Players", "Energy Supply"))

## End(Not run)

Automatically label topics using language models based on top terms

Description

Performs an automated labeling process of topics from topic models using language models. For this, the top terms and (optionally) a short context description are used.

Usage

label_topics(...)

## Default S3 method:
label_topics(
  terms,
  model = "mistralai/Mixtral-8x7B-Instruct-v0.1",
  params = list(),
  token = NA_character_,
  context = "",
  sep_terms = "; ",
  max_length_label = 5L,
  prompt_type = c("json", "plain", "json-roles"),
  max_wait = 0L,
  progress = TRUE,
  ...
)

## S3 method for class 'labelTopics'
label_topics(terms, stm_type = c("prob", "frex", "lift", "score"), ...)
label_topics(...)

## Default S3 method:
label_topics(
  terms,
  model = "mistralai/Mixtral-8x7B-Instruct-v0.1",
  params = list(),
  token = NA_character_,
  context = "",
  sep_terms = "; ",
  max_length_label = 5L,
  prompt_type = c("json", "plain", "json-roles"),
  max_wait = 0L,
  progress = TRUE,
  ...
)

## S3 method for class 'labelTopics'
label_topics(terms, stm_type = c("prob", "frex", "lift", "score"), ...)

Arguments

`...`	additional arguments
`terms`	[`list (k) of character`] List (each list entry represents one topic) of `character` vectors containing the top terms representing the topics that are to be labeled. If a single `character` vector is passed, this is interpreted as the top terms of a single topic. If a `character` matrix is passed, each column is interpreted as the top terms of a topic. The outputs of the packages `stm` (`label_topics` object, please specify the type of output using the parameter `stm_type`) and the `BTM` package (`list` of `data.frame`s with entries `token` and `probability` each) are also supported.
`model`	[`character(1)`] Optional. The language model to use for labeling the topics. The model must be accessible via the Huggingface API. Default is `mistralai/Mixtral-8x7B-Instruct-v0.1`. Other promising models are `HuggingFaceH4/zephyr-7b-beta` or `tiiuae/falcon-7b-instruct`. To find more models see: https://huggingface.co/models?other=conversational&sort=likes.
`params`	[`named list`] Optional. Model parameters to pass. Default parameters for common models are given in the details section.
`token`	[`character(1)`] Optional. If you want to address the Huggingface API with a Huggingface token, enter it here. The main advantage of this is a higher rate limit.
`context`	[`character(1)`] Optional. Explanatory context for the topics to be labeled. Using a (very) brief explanation of the thematic context may greatly improve the usefulness of automatically generated topic labels.
`sep_terms`	[`character(1)`] How should the top terms of a single topic be separated in the generated prompts? Default is separation via semicolon and space.
`max_length_label`	[`integer(1)`] What is the maximum number of words a label should consist of? Default is five words.
`prompt_type`	[`character(1)`] Which prompt type should be applied. We implemented various prompt types that differ mainly in how the response of the language model is requested. Examples are given in the details section. Default is the request of a json output.
`max_wait`	[`integer(1)`] In the case that the rate limit on Huggingface is reached: How long (in minutes) should the system wait until it asks the user whether to continue (in other words: to wait). The default is zero minutes, i.e the user is asked every time the rate limit is reached.
`progress`	[`logical(1)`] Should a nice progress bar be shown? Turning it off, could lead to significantly faster calculation. Default ist `TRUE`.
`stm_type`	[`character(1)`] For stm topics, which type of word weighting should be used? Default is "prob".

Details

The function builds helpful prompts based on the top terms and sends these prompts to language models on Huggingface. The output is in turn post-processed so that the labels for each topic are extracted automatically. If the automatically extracted labels show any errors, they can alternatively be extracted using custom functions or manually from the original output of the model using the model_output entry of the lm_topic_labels object.

Implemented default parameters for the models HuggingFaceH4/zephyr-7b-beta, tiiuae/falcon-7b-instruct, and mistralai/Mixtral-8x7B-Instruct-v0.1 are:

max_new_tokens: 300
return_full_text: FALSE

Implemented prompt types are:

json: the language model is asked to respond in JSON format with a single field called 'label', specifying the best label for the topic
plain: the language model is asked to return an answer that should only consist of the best label for the topic
json-roles: the language model is asked to respond in JSON format with a single field called 'label', specifying the best label for the topic; in addition, the model is queried using identifiers for <|user|> input and the beginning of the <|assistant|> output

Value

[named list] lm_topic_labels object.

Examples

## Not run: 
token = "" # please insert your hf token here
topwords_matrix = matrix(c("zidane", "figo", "kroos",
                           "gas", "power", "wind"), ncol = 2)
label_topics(topwords_matrix, token = token)
label_topics(list(c("zidane", "figo", "kroos"),
                  c("gas", "power", "wind")),
             token = token)
label_topics(list(c("zidane", "figo", "ronaldo"),
                  c("gas", "power", "wind")),
             token = token)

label_topics(list("wind", "greta", "hambach"),
             token = token)
label_topics(list("wind", "fire", "air"),
             token = token)
label_topics(list("wind", "feuer", "luft"),
             token = token)
label_topics(list("wind", "feuer", "luft"),
             context = "Elements of the Earth",
             token = token)

## End(Not run)
## Not run: 
token = "" # please insert your hf token here
topwords_matrix = matrix(c("zidane", "figo", "kroos",
                           "gas", "power", "wind"), ncol = 2)
label_topics(topwords_matrix, token = token)
label_topics(list(c("zidane", "figo", "kroos"),
                  c("gas", "power", "wind")),
             token = token)
label_topics(list(c("zidane", "figo", "ronaldo"),
                  c("gas", "power", "wind")),
             token = token)

label_topics(list("wind", "greta", "hambach"),
             token = token)
label_topics(list("wind", "fire", "air"),
             token = token)
label_topics(list("wind", "feuer", "luft"),
             token = token)
label_topics(list("wind", "feuer", "luft"),
             context = "Elements of the Earth",
             token = token)

## End(Not run)

Package 'topiclabels'

Help Index

Automated Topic Labeling with Language Models

Description

Labeling function

Constructor

Author(s)

See Also

lm_topic_labels object

Description

Usage

Arguments

Details

Value

Examples

Automatically label topics using language models based on top terms

Description

Usage

Arguments

Details

Value

Examples