Skip to main content

Base URL

http://192.222.50.238:8080
faster proxy endpoint: (in progress)
http://192.222.50.238:9000

Health Check

Check server status and cache performance.
GET /health
curl http://192.222.50.238:8080/health

Response

{
  "status": "healthy",
  "server_role": "standalone",
  "model": "morph-test",
  "gpu_available": true,
  "cache_enabled": true,
  "cache_stats": {
    "enabled": true,
    "hit_rate": 0.92,
    "num_cached_tokens": 15420
  },
  "uptime_seconds": 3847.2
}
status
string
Service status: healthy or degraded
server_role
string
Server role: standalone, prefiller, or decoder
model
string
Model name being served
gpu_available
boolean
Whether GPU is available and initialized
cache_enabled
boolean
Whether prefix caching is enabled
cache_stats
object
Cache performance statistics (if caching enabled)
uptime_seconds
float
Server uptime in seconds

Generate Prediction

Generate next action prediction from a prompt.
POST /v1/predict
curl -X POST http://192.222.50.238:8080/v1/predict \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "{\"type\":3,\"data\":{\"source\":2,\"type\":6,\"id\":42,\"x\":385,\"y\":127}}\n{\"type\":3,\"data\":{\"source\":2,\"type\":2,\"id\":42,\"x\":385,\"y\":127,\"pointerType\":0}}\n{\"type\":3,\"data\":{\"source\":2,\"type\":1,\"id\":56}}\n{\"type\":3,\"data\":{\"source\":5,\"text\":\"user@example.com\",\"isChecked\":false,\"id\":56}}",
    "max_tokens": 50,
    "temperature": 0.3
  }'

Request Body

prompt
string
required
rrweb event data as newline-delimited JSON. Each line should be a valid rrweb event object
max_tokens
integer
default:50
Maximum number of tokens to generate (range: 1-512)
temperature
float
default:0.3
Sampling temperature (range: 0.0-2.0). Lower values produce more deterministic outputs
stream
boolean
default:false
Enable streaming response (currently not implemented)

Response

{
  "text": "{\"type\":3,\"data\":{\"source\":2,\"type\":1,\"id\":67}}\n{\"type\":3,\"data\":{\"source\":5,\"text\":\"password123\",\"isChecked\":false,\"id\":67}}",
  "latency_ms": 287,
  "tokens_generated": 42
}
text
string
Generated rrweb event predictions as newline-delimited JSON
latency_ms
integer
Request processing latency in milliseconds
tokens_generated
integer
Number of tokens generated in the response

Error Responses

All errors return JSON with a standard format:
{
  "detail": "Error message describing what went wrong"
}

Status Codes

200
Success
Request completed successfully
400
Bad Request
Invalid request parameters (e.g., temperature out of range)
503
Service Unavailable
Model not ready or server not initialized
500
Internal Server Error
Unexpected server error during prediction

Performance Tips

Optimize Cache Hits: Send rrweb events in consistent session sequences to maximize prefix cache reuse. Events from the same session with consistent ordering will achieve higher cache hit rates and lower latency.
Typical Latency:
  • Single-node: ~800ms (P50), ~1.5s (P99)
  • Disaggregated: ~250ms (P50), ~450ms (P99) (in progress)
  • Cache hit rate of 90%+ dramatically reduces latency for similar event sequences

rrweb Event Format

The API expects rrweb events as newline-delimited JSON strings. Common event types:
  • Type 2 (Meta): Page metadata and viewport info
  • Type 3 (Incremental): User interactions (clicks, input, scroll, etc.)
    • source: 2 = MouseInteraction
    • source: 5 = Input
    • source: 3 = MouseMove
  • Type 4 (IncrementalSnapshot): DOM mutations
Example event structure:
{
  "type": 3,
  "data": {
    "source": 2,
    "type": 2,
    "id": 42,
    "x": 385,
    "y": 127,
    "pointerType": 0
  }
}