Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.morphllm.com/llms.txt

Use this file to discover all available pages before exploring further.

Running on Morph’s custom kernels and inference stack optimized for codegen. OpenAI-compatible. Same key as Fast Apply and Compact. Base URL https://api.morphllm.com/v1.
ModelIDSpeedContextIn / Out per 1MModalities
Qwen 3.5 397Bmorph-qwen35-397b~200 tok/s262k0.478/0.478 / 3.50text + image
MiniMax M2.7morph-minimax27-230b~90 tok/s200k0.279/0.279 / 1.20text
Qwen 3.6 27Bmorph-qwen36-27b~100 tok/s131k0.498/0.498 / 2.40text
DeepSeek V4 Flash betamorph-dsv4flash~150 tok/s393k0.30/0.30 / 0.40text
All models support tools, response_format (JSON mode + JSON schema), structured outputs, logprobs, and reasoning. Automatic prefix caching is on for all models. No configuration needed. Use Model Router to pick automatically per request.

Quick Start

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://api.morphllm.com/v1",
)

response = client.chat.completions.create(
    model="morph-qwen35-397b",
    messages=[
        {"role": "system", "content": "You are a senior backend engineer."},
        {"role": "user", "content": "Refactor this Express handler to use async/await: ..."},
    ],
    temperature=0.2,
)

print(response.choices[0].message.content)

Tools and Structured Output

const response = await client.chat.completions.create({
  model: "morph-minimax27-230b",
  messages: [{ role: "user", content: "What's the weather in SF?" }],
  tools: [
    {
      type: "function",
      function: {
        name: "get_weather",
        description: "Get weather for a city",
        parameters: {
          type: "object",
          properties: { city: { type: "string" } },
          required: ["city"],
        },
      },
    },
  ],
  response_format: { type: "json_object" },
});
Reasoning is off by default. Enable with reasoning: { effort: "medium" } ("low" / "high"). Reasoning tokens bill as output.

Pricing

Per-token, no minimums. The table above is canonical. Live rates: /v1/models.
  • Images (Qwen 397B only) bill as text tokens at the input rate
  • 4xx requests are not billed; partial generations bill for tokens returned

Pitfalls

TPS numbers are generation throughput, not end-to-end. With 30k tokens of context, prefill dominates first-token wait even with caching. For agent loops, keep a smaller working context with Compact rather than filling the full window.
These models use OpenAI tool-call shape, not Anthropic tool_use blocks or Gemini functionDeclarations. Use the OpenAI SDK or @ai-sdk/openai pointed at our base URL.
Pass response_format: { type: "json_object" } and say “respond in JSON” in your prompt. For strict shape control: response_format: { type: "json_schema", json_schema: { ... } }.

See Also

  • Model Router — auto-route between these and frontier models per request
  • Compact — shrink context before paying for it
  • WarpGrep — code search for retrieval when context is the bottleneck