> ## Documentation Index
> Fetch the complete documentation index at: https://docs.morphllm.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Fast General Models

> Open-weight models at 90–200 tok/s with automatic prefix caching

Running on Morph's custom kernels and inference stack optimized for codegen. OpenAI-compatible. Same key as Fast Apply and Compact. Base URL `https://api.morphllm.com/v1`.

| Model                                 | ID                     | Speed       | Context | In / Out per 1M | Modalities   |
| ------------------------------------- | ---------------------- | ----------- | ------- | --------------- | ------------ |
| **Qwen 3.5 397B**                     | `morph-qwen35-397b`    | \~200 tok/s | 262k    | $0.478 / $3.50  | text + image |
| **MiniMax M2.7**                      | `morph-minimax27-230b` | \~90 tok/s  | 200k    | $0.279 / $1.20  | text         |
| **Qwen 3.6 27B**                      | `morph-qwen36-27b`     | \~100 tok/s | 131k    | $0.498 / $2.40  | text         |
| **DeepSeek V4 Flash** <sup>beta</sup> | `morph-dsv4flash`      | \~150 tok/s | 393k    | $0.30 / $0.40   | text         |

All models support `tools`, `response_format` (JSON mode + JSON schema), structured outputs, logprobs, and reasoning.

Automatic prefix caching is on for all models. No configuration needed. Use [Model Router](/sdk/components/router) to pick automatically per request.

## Quick Start

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    from openai import OpenAI

    client = OpenAI(
        api_key="YOUR_API_KEY",
        base_url="https://api.morphllm.com/v1",
    )

    response = client.chat.completions.create(
        model="morph-qwen35-397b",
        messages=[
            {"role": "system", "content": "You are a senior backend engineer."},
            {"role": "user", "content": "Refactor this Express handler to use async/await: ..."},
        ],
        temperature=0.2,
    )

    print(response.choices[0].message.content)
    ```
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    import OpenAI from "openai";

    const client = new OpenAI({
      apiKey: "YOUR_API_KEY",
      baseURL: "https://api.morphllm.com/v1",
    });

    const stream = await client.chat.completions.create({
      model: "morph-qwen35-397b",
      messages: [{ role: "user", content: "Write a tiny rate limiter in TS." }],
      stream: true,
    });

    for await (const chunk of stream) {
      process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
    }
    ```
  </Tab>

  <Tab title="Vercel AI SDK">
    ```typescript theme={null}
    import { createOpenAI } from "@ai-sdk/openai";
    import { generateText } from "ai";

    const morph = createOpenAI({
      apiKey: process.env.MORPH_API_KEY!,
      baseURL: "https://api.morphllm.com/v1",
    });

    const { text } = await generateText({
      model: morph("morph-qwen36-27b"),
      prompt: "Summarize this PR diff in one paragraph: ...",
    });
    ```
  </Tab>

  <Tab title="cURL">
    ```bash theme={null}
    curl -X POST "https://api.morphllm.com/v1/chat/completions" \
      -H "Authorization: Bearer YOUR_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{
        "model": "morph-qwen35-397b",
        "messages": [
          {"role": "user", "content": "Write a SQL query that finds the top 5 customers by revenue last quarter."}
        ],
        "temperature": 0.2
      }'
    ```
  </Tab>
</Tabs>

## Tools and Structured Output

```typescript theme={null}
const response = await client.chat.completions.create({
  model: "morph-minimax27-230b",
  messages: [{ role: "user", content: "What's the weather in SF?" }],
  tools: [
    {
      type: "function",
      function: {
        name: "get_weather",
        description: "Get weather for a city",
        parameters: {
          type: "object",
          properties: { city: { type: "string" } },
          required: ["city"],
        },
      },
    },
  ],
  response_format: { type: "json_object" },
});
```

Reasoning is off by default. Enable with `reasoning: { effort: "medium" }` (`"low"` / `"high"`). Reasoning tokens bill as output.

## Pricing

Per-token, no minimums. The table above is canonical. Live rates: [`/v1/models`](https://api.morphllm.com/v1/models).

* **Images** (Qwen 397B only) bill as text tokens at the input rate
* **4xx requests** are not billed; partial generations bill for tokens returned

## Pitfalls

<AccordionGroup>
  <Accordion title="Latency worse than expected">
    TPS numbers are generation throughput, not end-to-end. With 30k tokens of context, prefill dominates first-token wait even with caching. For agent loops, keep a smaller working context with [Compact](/sdk/components/compact) rather than filling the full window.
  </Accordion>

  <Accordion title="Tool calls not working">
    These models use OpenAI tool-call shape, not Anthropic `tool_use` blocks or Gemini `functionDeclarations`. Use the OpenAI SDK or `@ai-sdk/openai` pointed at our base URL.
  </Accordion>

  <Accordion title="JSON mode returns prose">
    Pass `response_format: { type: "json_object" }` *and* say "respond in JSON" in your prompt. For strict shape control: `response_format: { type: "json_schema", json_schema: { ... } }`.
  </Accordion>
</AccordionGroup>

## See Also

* [Model Router](/sdk/components/router) — auto-route between these and frontier models per request
* [Compact](/sdk/components/compact) — shrink context before paying for it
* [WarpGrep](/sdk/components/warp-grep/index) — code search for retrieval when context is the bottleneck
