Tidy pipelines and structured output

knitr::opts_chunk$set(
  collapse = TRUE, comment = "#>",
  eval = identical(tolower(Sys.getenv("LLMR_RUN_VIGNETTES", "false")), "true")
)

We will show both unstructured and structured pipelines, using open models: - deepseek-chat (DeepSeek) - llama-3.1-8b-instant (Groq) - openai/gpt-oss-20b (Groq)

You will need environment variables DEEPSEEK_API_KEY and GROQ_API_KEY.

library(LLMR)
library(dplyr)

cfg_ds    <- llm_config("deepseek", "deepseek-chat")
cfg_groq1 <- llm_config("groq",     "llama-3.1-8b-instant")
cfg_groq  <- llm_config("groq",     "openai/gpt-oss-20b")

llm_fn: unstructured (DeepSeek)

words <- c("excellent", "awful", "fine")
out <- llm_fn(
  words,
  prompt  = "Classify '{x}' as Positive, Negative, or Neutral.",
  .config = cfg_ds,
  .return = "columns"
)
out

llm_fn: unstructured (Groq)

out_groq <- llm_fn(
  words,
  prompt  = "Classify '{x}' as Positive, Negative, or Neutral.",
  .config = cfg_groq1,
  .return = "columns"
)
out_groq

llm_fn_structured: schema-first (DeepSeek)

schema <- list(
  type = "object",
  properties = list(
    label = list(type = "string", description = "Sentiment label"),
    score = list(type = "number", description = "Confidence 0..1")
  ),
  required = list("label", "score"),
  additionalProperties = FALSE
)

out_s <- llm_fn_structured(
  x = words,
  prompt  = "Classify '{x}' as Positive, Negative, or Neutral with confidence.",
  .config = cfg_ds,
  .schema = schema,
  .fields = c("label", "score")
)
out_s

llm_mutate: unstructured (Groq)

df <- tibble::tibble(
  id   = 1:3,
  text = c("Cats are great pets", "The weather is bad", "I like tea")
)

df_u <- df |>
  llm_mutate(
    answer  = "Give a short category for: {text}",
    .config = cfg_groq,
    .return = "columns"
  )

df_u

llm_mutate: shorthand syntax

The shorthand lets you combine output column and prompt in one argument:

df |>
  llm_mutate(
    category = "Give a short category for: {text}",
    .config = cfg_groq
  )
# Equivalent to: llm_mutate(category, prompt = "Give...", .config = cfg_groq)

Or with multi-turn messages:

df |>
  llm_mutate(
    classified = c(
      system = "You are a text classifier. One word only.",
      user = "Category for: {text}"
    ),
    .config = cfg_ds
  )

llm_mutate with .structured flag

Enable structured output directly in llm_mutate() using .structured = TRUE:

schema <- list(
  type = "object",
  properties = list(
    category = list(type = "string"),
    confidence = list(type = "number")
  ),
  required = list("category", "confidence")
)

# Using .structured = TRUE (equivalent to calling llm_mutate_structured)
df |>
  llm_mutate(
    structured_result = "{text}",
    .config = cfg_ds,
    .structured = TRUE,
    .schema = schema
  )

This is equivalent to calling llm_mutate_structured() and supports all the same shorthand syntax.

Soft structured output with tags

When a strict JSON schema is unnecessary, request simple XML-like tags and let LLMR parse them into columns. In the ordinary one-row-per-call mode below, tags should be flat (not nested); the row-batching mode further down deliberately introduces one level of nesting and is documented there.

cities <- tibble::tibble(city = c("Cairo", "Lima", "Seoul"))

cities |>
  llm_mutate(
    geo = "Where is {city}? Give country and continent in their own tags.",
    .config = cfg_groq1,
    .system_prompt = paste(
      "Use XML tags to specify different parts of the answer, but do not nest tags.",
      "Return <country>...</country> and <continent>...</continent>."
    ),
    .tags = c("country", "continent")
  )

The result includes tags_ok, tags_data, and one column per requested tag. Use llm_parse_tags_col() to parse an existing response column.

Row batching: many rows per call

By default LLMR sends one request per row. With .rows_per_prompt > 1, several rows are packed into a single request: each row’s prompt is wrapped in a numbered tag (<row_1>...</row_1>, <row_2>...</row_2>, …), the block is appended to the message, and the model is asked to answer each item inside a matching numbered tag. LLMR splits the reply back into the original rows. .rows_per_prompt = Inf sends the whole frame in one call.

cities |>
  llm_mutate(
    geo = "Where is {city}? Give country and continent in their own tags.",
    .config = cfg_groq1,
    .tags = c("country", "continent"),
    .rows_per_prompt = 3
  )

A few points worth keeping in mind:

Preview before you spend, summarize after

llm_preview() renders exactly what llm_fn() / llm_mutate() would send, without any API call and without reading or encoding files. It flags problems up front: missing files, a "file" role combined with .rows_per_prompt > 1, an embedding config with row batching, and so on. The batch plan columns show how rows would be grouped into calls.

df <- data.frame(text = c("good", "bad", "fine"), stringsAsFactors = FALSE)
LLMR::llm_preview(df, prompt = "Sentiment of: {text}", .rows_per_prompt = 2)

After a run, llm_usage() summarizes outcomes and token totals, and llm_failures() lists the rows that failed or were truncated. Both read the diagnostic columns that llm_mutate() and call_llm_par() already produce. llm_usage() reports tokens, not dollars: multiply by your provider’s current per-token prices yourself.

out <- df |>
  llm_mutate(sentiment = "One-word sentiment for: {text}", .config = cfg_groq)

llm_usage(out)       # counts + sent/received/total/reasoning tokens
llm_failures(out)    # which rows failed or were truncated, and why

For a call_llm_par() result you can re-run only the failures with llm_par_resume().

llm_mutate_structured: structured with shorthand (Groq)

schema2 <- list(
  type = "object",
  properties = list(
    category  = list(type = "string"),
    rationale = list(type = "string")
  ),
  required = list("category", "rationale"),
  additionalProperties = FALSE
)

# Traditional call
df_s <- df |>
  llm_mutate_structured(
    annot,
    prompt  = "Extract category and a one-sentence rationale for: {text}",
    .config = cfg_groq,
    .schema = schema2
    # Because a schema is present, fields auto-hoist; you can also pass:
    # .fields = c("category", "rationale")
  )

df_s

# Or use shorthand
df |>
  llm_mutate_structured(
    annot = "Extract category and rationale for: {text}",
    .config = cfg_groq,
    .schema = schema2
  )