> ## Documentation Index
> Fetch the complete documentation index at: https://docs.morphllm.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Open Source Models

> Open-weight models at 90–200 tok/s with automatic prefix caching

Running on Morph's custom kernels and inference stack optimized for codegen. OpenAI-compatible at `https://api.morphllm.com/v1`, and Anthropic-compatible at `/v1/messages` (see [Endpoints](/endpoints)).

| Model                                   | Model ID               | Context |
| :-------------------------------------- | :--------------------- | :------ |
| **GLM-5.2 744B**                        | `morph-glm52-744b`     | 1M      |
| **Kimi K3 2.8T** <sup>coming soon</sup> | `morph-kimik3`         | 1M      |
| **MiniMax M3 428B**                     | `morph-minimax3-428b`  | 256k    |
| **MiniMax M2.7 230B**                   | `morph-minimax27-230b` | 196k    |
| **DeepSeek V4 Flash** <sup>beta</sup>   | `morph-dsv4flash`      | 1M      |
| **Qwen 3.6 27B**                        | `morph-qwen36-27b`     | 131k    |
| **Gemma 4 31B** <sup>multimodal</sup>   | `morph-gemma4-31b`     | 175k    |

Throughput runs \~90–200 tok/s depending on model and load. All models support `tools`, `response_format` (JSON mode + JSON schema), structured outputs, logprobs, and reasoning. Per-token rates are on the [pricing page](https://www.morphllm.com/pricing) and live at [`/api/models/json`](https://www.morphllm.com/api/models/json).

## Quick Start

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    from openai import OpenAI

    client = OpenAI(
        api_key="YOUR_API_KEY",
        base_url="https://api.morphllm.com/v1",
    )

    response = client.chat.completions.create(
        model="morph-glm52-744b",
        messages=[
            {"role": "system", "content": "You are a senior backend engineer."},
            {"role": "user", "content": "Refactor this Express handler to use async/await: ..."},
        ],
        temperature=0.2,
    )

    print(response.choices[0].message.content)
    ```
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    import OpenAI from "openai";

    const client = new OpenAI({
      apiKey: "YOUR_API_KEY",
      baseURL: "https://api.morphllm.com/v1",
    });

    const stream = await client.chat.completions.create({
      model: "morph-glm52-744b",
      messages: [{ role: "user", content: "Write a tiny rate limiter in TS." }],
      stream: true,
    });

    for await (const chunk of stream) {
      process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
    }
    ```
  </Tab>

  <Tab title="Vercel AI SDK">
    ```typescript theme={null}
    import { createOpenAI } from "@ai-sdk/openai";
    import { generateText } from "ai";

    const morph = createOpenAI({
      apiKey: "YOUR_API_KEY",
      baseURL: "https://api.morphllm.com/v1",
    });

    const { text } = await generateText({
      model: morph("morph-glm52-744b"),
      prompt: "Summarize this PR diff in one paragraph: ...",
    });
    ```
  </Tab>

  <Tab title="cURL">
    ```bash theme={null}
    curl -X POST "https://api.morphllm.com/v1/chat/completions" \
      -H "Authorization: Bearer YOUR_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{
        "model": "morph-glm52-744b",
        "messages": [
          {"role": "user", "content": "Write a SQL query that finds the top 5 customers by revenue last quarter."}
        ],
        "temperature": 0.2
      }'
    ```
  </Tab>
</Tabs>

## Tools and Structured Output

```typescript theme={null}
const response = await client.chat.completions.create({
  model: "morph-glm52-744b",
  messages: [{ role: "user", content: "What's the weather in SF?" }],
  tools: [
    {
      type: "function",
      function: {
        name: "get_weather",
        description: "Get weather for a city",
        parameters: {
          type: "object",
          properties: { city: { type: "string" } },
          required: ["city"],
        },
      },
    },
  ],
  response_format: { type: "json_object" },
});
```

Reasoning is off by default. Enable with `reasoning: { effort: "medium" }` (`"low"` / `"high"`). Reasoning tokens bill as output.

Automatic [prefix caching](/sdk/components/caching) is on for all models, with per-request TTL control. Use [Model Router](/sdk/components/router) to pick automatically per request.

## Service Tiers

GLM-5.2 supports the OpenAI `service_tier` parameter.

| Tier                              | Behavior                                                                                                                            |
| --------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
| `default` (or `auto`, or omitted) | Standard processing. What every request gets today.                                                                                 |
| `standby`                         | Best-effort capacity. Served when the fleet has headroom, rejected with a retryable 429 when it doesn't. No latency target, no SLA. |

```python theme={null}
response = client.chat.completions.create(
    model="morph-glm52-744b",
    messages=[{"role": "user", "content": "Label these 500 rows: ..."}],
    service_tier="standby",
)
```

How standby works:

* A standby request is admitted only while the fleet is under roughly a quarter of its serving capacity. When no region qualifies, you get `429` with `error.code: "resource_unavailable"` and a `Retry-After` header. Nothing is generated and nothing is billed. Expect most standby throughput off-peak.
* On a 429, retry with exponential backoff. If you need the result now, resend with `service_tier: "default"`.
* The response echoes the tier that served it in a `service_tier` field (on the final usage chunk when streaming).
* Streaming, tools, and structured output work the same as `default`.
* Unknown tier values return `400` listing the accepted ones.

Use standby for evals, batch labeling, data generation, and anything a retry loop can absorb. Keep interactive and agent-loop traffic on `default`: under load, default requests are served in full while standby is shed in \~200ms.

Standby bills at the standard per-token rates today. Available on GLM-5.2 (`morph-glm52-744b`); sending it to other models is a no-op.

## Pitfalls

<AccordionGroup>
  <Accordion title="Latency worse than expected">
    TPS numbers are generation throughput, not end-to-end. With 30k tokens of context, prefill dominates first-token wait even with caching. For agent loops, keep a smaller working context with [Compact](/sdk/components/compact) rather than filling the full window.
  </Accordion>

  <Accordion title="Tool calls not working">
    On `/v1/chat/completions` these models use OpenAI tool-call shape; Anthropic `tool_use` blocks work on [`/v1/messages`](/endpoints). Match the tool format to the endpoint. Gemini `functionDeclarations` work on neither.
  </Accordion>

  <Accordion title="JSON mode returns prose">
    Pass `response_format: { type: "json_object" }` *and* say "respond in JSON" in your prompt. For strict shape control: `response_format: { type: "json_schema", json_schema: { ... } }`.
  </Accordion>
</AccordionGroup>

## See Also

* [Prompt Caching](/sdk/components/caching) — automatic cached-input discounts, per-request TTL
* [Standby Requests](/sdk/components/standby) — best-effort GLM-5.2 capacity for batch and background work
* [Model Router](/sdk/components/router) — auto-route between these and frontier models per request
* [Compact](/sdk/components/compact) — shrink context before paying for it
* [WarpGrep](/sdk/components/warp-grep/index) — code search for retrieval when context is the bottleneck