Unified Generation Layer
The Unified Generation Layer is a backend architecture improvement introduced by Luker that consolidates generation logic scattered across various API endpoints into a shared module, achieving unified wrapping for multiple backends.
Problem Background
In SillyTavern, different AI backends (OpenAI, Anthropic, Google, Kobold, etc.) each have independent code paths. Each backend endpoint file independently handles request construction, streaming response parsing, error handling, and other logic. This leads to the following issues:
- Duplicated code — Similar streaming processing and error retry logic is duplicated across multiple files
- Inconsistent behavior — Token statistics and error response formats differ across backends
- Hard to extend — Adding a new backend requires implementing the complete request/response processing chain from scratch
- No unified metering — Lacks cross-backend token usage tracking capability
Solution
Luker introduces a Unified Generation Layer endpoint (/api/backends/luker-generation) as the primary path for the frontend to initiate AI generation requests. This endpoint receives generation requests from the frontend, forwards them to the corresponding upstream API based on the current Chat Completion Source, and completes streaming response handling, token metering, generation acknowledgment, and persistence within a unified processing pipeline.
Multi-Backend Unified Wrapping
The Unified Generation Layer supports unified processing for the following backends:
- OpenAI and its compatible APIs
- Anthropic (Claude)
- Google (Gemini / Vertex AI)
- Kobold / TabbyAPI
- Other Chat Completion compatible backends
The frontend initiates generation requests through the Unified Generation Layer, which is responsible for routing to the correct upstream service, rather than the frontend directly calling each backend's independent endpoint. The individual backend endpoints (chat-completions.js, kobold.js, etc.) still exist and independently integrate the Request Inspector, but the Unified Generation Layer provides a more complete processing pipeline.
Shared Token Metering
The Unified Generation Layer automatically calls the Request Inspector before and after generation requests:
- After streaming completes, calls
completeInspectionFromStreamto record token usage from stream events - On generation failure, calls
failInspectionto record error information - On generation abort, calls
abortInspectionto record the interruption
Note: startInspection is called by the backend endpoint (e.g. chat-completions.js), not by the generation layer itself.
This ensures that regardless of which backend is used, token consumption is accurately tracked.
Unified Streaming Processing
For SSE streaming responses, the Unified Generation Layer provides shared stream parsing logic:
- Parses SSE event formats from different backends
- Extracts token usage information from streaming events
- Unified stream interruption and recovery handling
- Works with the WebSocket Proxy to support stream offset recovery
Unified Error Handling
Different backends return errors in various formats. The Unified Generation Layer normalizes them into a consistent error response structure, simplifying the frontend's error handling logic.
Architecture Relationship
Frontend Request
↓
Individual Backend Endpoint Files (chat-completions.js, kobold.js, etc.)
↓
Unified Generation Layer
├── Token Metering → Request Inspector
├── Streaming Processing → SSE Parsing
└── Error Handling → Normalized Response
↓
Upstream AI ServiceNOTE
The Unified Generation Layer is an internal module, transparent to the frontend. Users don't need to be aware of its existence — just enjoy the consistent experience it provides.
Relationship with Other Modules
- Request Inspector — The Unified Generation Layer is the primary caller of the Request Inspector
- Auth & Quota — Storage quota middleware executes before the Unified Generation Layer, intercepting over-quota requests
chats.js— After generation completes, triggers theacknowledge-generationflow to associate generation results with chat records