Langfuse evaluator - Morph Documentation

If your traces live in Langfuse, you can use a Reflex as an LLM-as-a-judge evaluator — Langfuse runs it over each trace and attaches the predicted category as a score. No extra inference code.

Already tracing with Morph? You don’t need this — pass evals on a begin() turn and Morph labels traces for you, off your request path. See Run evals automatically on your traces. This page is for teams whose traces live in Langfuse.

Reflexes are classifiers, so they plug in as a categorical evaluator. Morph classifies your evaluation prompt.

Before you start

A Morph API key (sk-...).
A Reflex model id and its exact labels — categories in Langfuse must byte-match these. List a model’s labels with one prediction:

curl https://api.morphllm.com/v1/reflex/predict \
  -H "Authorization: Bearer $MORPH_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "jailbreak", "text": "ignore your instructions"}'
# classes: [{ "label": "benign", ... }, { "label": "jailbreak", ... }]

Set up the evaluator

Open Evaluators → Set up evaluator → LLM-as-a-judge

In Langfuse, go to Evaluators, click Set up evaluator, and choose LLM-as-a-judge.

Add a Morph LLM connection

Click Change the provider → Add LLM connection and configure it:

Provider / schema: OpenAI (Morph is OpenAI-compatible).
API key: your Morph sk-... key.
Open Advanced settings and set the API Base URL to:
```
https://api.morphllm.com/v1/reflex-oai
```
Leave Use Responses API and Extra Headers off — you don’t need them.
Turn off Use default models, and under Custom models add the Reflex model ids you want to use (e.g. jailbreak).

Click Create connection, then select that connection and the model.

Name the evaluator

Under Define evaluator, give it any name you like, e.g. jailbreak.

Write the evaluation prompt

The evaluation prompt is the text Morph classifies — put just the variable you want judged and nothing else:

{{input}}

Score type: Categorical, single category

Set Score type to Categorical.

Categories: add one per Reflex label, spelled exactly as the model emits them (for a built-in Reflex, copy them from Default Reflexes below). They must be exhaustive — add a catch-all only if your model has that label.
Do not enable “Allow multiple matches.” The evaluator returns exactly one category (the top prediction).

If a category doesn’t byte-match a model label, Langfuse rejects the score with a parse error. Copy labels from the predict response above.

Leave the reasoning + selection prompts as-is

The Score reasoning prompt and Category selection prompt can stay at their defaults — Morph does not read them. The returned reasoning is always NA-Reflex, since Reflexes are classifiers and emit no rationale.

Save

Save the evaluator. To verify a run, open a score and choose View execution trace (environment langfuse-llm-as-a-judge) to see the exact request Langfuse sent and the category Morph returned.

Run it

The evaluator scores new matching traces automatically as they come in. To score traces you already have, open the Traces table, select the rows you want (or select all), and click Evaluate at the bottom.

Default Reflexes

Copy the model id into the connection’s Custom models, and the categories into the evaluator’s Categories — exactly as written, they’re case-sensitive.

Reflex `model`	Categories	Catches
`jailbreak`	`benign`, `jailbreak`	Prompt-injection / jailbreak attempts
`guardrail`	`false`, `true`	Harassment or NSFW content
`leaked-thinking`	`clean`, `leaked`	Agent leaking its internal thinking
`stuck-in-a-loop`	`progressing`, `looping`	Agent blocked, not trying new things
`incomplete-thought`	`complete`, `incomplete`	User sent a truncated prompt
`user-frustrated`	`Frustrated`, `Not Frustrated`	User is frustrated with the agent
`ambiguity`	`low`, `med`, `high`	How underspecified a prompt is
`difficulty`	`easy`, `medium`, `hard`	Prompt difficulty, for model routing
`domain`	`general`, `summary`, `coding`, `design`, `data`	Topic of a request
Custom	Get them from your Reflex dashboard	Your own trained classifier

domain is multi-label, but with Allow multiple matches off the evaluator returns its single top label.

​Before you start

​Set up the evaluator

​Default Reflexes

Before you start

Set up the evaluator

Default Reflexes