Base URL
Health Check
Check server status and cache performance.Response
Service status:
healthy or degradedServer role:
standalone, prefiller, or decoderModel name being served
Whether GPU is available and initialized
Whether prefix caching is enabled
Cache performance statistics (if caching enabled)
Server uptime in seconds
Generate Prediction
Generate next action prediction from a prompt.Request Body
rrweb event data as newline-delimited JSON. Each line should be a valid rrweb event object
Maximum number of tokens to generate (range: 1-512)
Sampling temperature (range: 0.0-2.0). Lower values produce more deterministic outputs
Enable streaming response (currently not implemented)
Response
Generated rrweb event predictions as newline-delimited JSON
Request processing latency in milliseconds
Number of tokens generated in the response
Error Responses
All errors return JSON with a standard format:Status Codes
Request completed successfully
Invalid request parameters (e.g., temperature out of range)
Model not ready or server not initialized
Unexpected server error during prediction
Performance Tips
Typical Latency:
- Single-node: ~800ms (P50), ~1.5s (P99)
- Disaggregated: ~250ms (P50), ~450ms (P99) (in progress)
- Cache hit rate of 90%+ dramatically reduces latency for similar event sequences
rrweb Event Format
The API expects rrweb events as newline-delimited JSON strings. Common event types:- Type 2 (Meta): Page metadata and viewport info
- Type 3 (Incremental): User interactions (clicks, input, scroll, etc.)
source: 2= MouseInteractionsource: 5= Inputsource: 3= MouseMove
- Type 4 (IncrementalSnapshot): DOM mutations