Summary
The OpenAI API spec allows pre-tokenized inputs on two endpoints:
/v1/completions:
prompt may be string | string[] | integer[] | integer[][].
/v1/embeddings:
input may be string | string[] | integer[] | integer[][].
iGW's EPP request-body parser supports neither. Clients that tokenize on
their side (a common pattern for latency-sensitive or offline-tokenized
workloads, and some embedding pipelines fall
into one of two failure modes when routed through iGW:
- Parse failure.
Prompt.UnmarshalJSON tries to decode
[1, 2, 3, ...] into []string and returns
json: cannot unmarshal number into Go value of type string.
CompletionsRequest does not populate.
- Silent input-length corruption. If the body survives as a raw /
unparsed payload (or for embeddings, which we don't extract at all),
every downstream consumer that derives input length from
PromptText() ends up with InputTokenLength = 0. PromptText()
falls through to the default return "" case, and
strings.Fields("") is zero.
The second failure mode is the dangerous one — the request still routes,
but every prefix-scorer/ input tokens in flight / SLO / latency-predictor signal that depends on
input length is silently wrong.
Embeddings is in worse shape than completions today: PromptText() has
no Embeddings case at all, so it returns "" for every
embeddings request regardless of input type — string, string array, or
token IDs.
Where it breaks today
Completions parser
pkg/epp/framework/interface/requesthandling/types.go:137-145
— Prompt.UnmarshalJSON only branches on string / []string:
func (p *Prompt) UnmarshalJSON(data []byte) error {
if len(data) > 0 && data[0] == '"' {
return json.Unmarshal(data, &p.Raw)
}
if len(data) > 0 && data[0] == '[' {
return json.Unmarshal(data, &p.Strings)
}
return errors.New("prompt: must be a string or an array of strings")
}
pkg/epp/framework/interface/requesthandling/types.go:157-162
— Prompt.PlainText() only joins p.Strings, so token-ID prompts have
no representation even if we relaxed the unmarshal.
Embeddings has no PromptText path at all
pkg/epp/framework/interface/requesthandling/types.go:82-108
— the PromptText() switch handles Completions, ChatCompletions,
Responses, and Conversations, but not Embeddings. Every embeddings
request falls through to default: return "". EmbeddingsRequest.Input
is stored as any (types.go:256-261)
with no type discrimination.
Downstream consumers that silently produce wrong results
Any consumer that derives input length from the prompt string reads 0
tokens for a token-ID prompt (or for any embeddings request today):
-
Latency predictor — scheduling & training. Uses
len(strings.Fields(promptText)) as InputTokenLength in both the
prediction request and the training entry.
Impact: training data poisoned with InputTokenLength=0; TTFT/TPOT
predictions collapse to a single bucket; SLO headroom / admission
decisions are made on a fictitious input length.
-
In-flight load token estimator. Uses
len(request.Body.PromptText()) as an input-size signal
(token_estimator.go:60).
-
Prefix-cache / approximate-prefix plugins that hash or key off
PlainText() — token-ID prompts would produce an empty key and
collide across requests.
Proposal (suggested)
1. Extend Prompt to carry token-ID payloads
type Prompt struct {
Raw string
Strings []string
TokenIDs []int // single pre-tokenized prompt
TokenIDSet [][]int // batched pre-tokenized prompts
}
UnmarshalJSON sniffs the first non-whitespace byte, and for arrays
delegates to a helper that tries []string, then []int, then
[][]int, returning a structured error only if none match. Keep the
existing string / []string fast path unchanged so we don't regress.
2. Give EmbeddingsRequest.Input the same treatment
Today Input is any. Replace (or augment) with a typed wrapper that
mirrors Prompt:
type EmbeddingsInput struct {
Raw string
Strings []string
TokenIDs []int
TokenIDSet [][]int
}
Same UnmarshalJSON sniffing rules. This also fixes the pre-existing
"embeddings PromptText() always returns empty" bug — see §5.
3. Explicit token-count hint
Add a method that callers should prefer over
len(strings.Fields(...)):
// TokenCountHint returns a best-effort input token count when the
// caller knows it exactly (token-ID inputs), or -1 when the count has
// to be estimated from text.
func (p Prompt) TokenCountHint() int
func (e EmbeddingsInput) TokenCountHint() int
- For
TokenIDs: len(TokenIDs).
- For
TokenIDSet: sum of inner lengths (see open question 1).
- For
Raw / Strings: -1 (caller falls back to its current
word-count approximation).
Add InferenceRequestBody.InputTokenCountHint() int that returns the
populated variant's hint, else -1. This keeps per-consumer changes
small — consumers don't need to know which request type they're
looking at.
4. Latency-predictor: prefer the hint
Update newPredictedLatencyContext
(plugin.go:251-264)
and buildPredictionRequest / buildTrainingEntry
(training.go)
to prefer the hint when it's set:
inputLen := req.Body.InputTokenCountHint()
if inputLen < 0 {
inputLen = len(strings.Fields(prompt))
}
5. PromptText() contract + embeddings fix
- Add an
Embeddings case to PromptText()
(types.go:82-108)
that joins Strings when present — this alone fixes the "embeddings
always returns empty" bug for the common string-input case.
- Leave
PromptText() returning "" for pure token-ID inputs (there
is no text to return), and document on the method that callers
needing length must use InputTokenCountHint(), not
strings.Fields(PromptText()).
- Audit
inflightload/token_estimator.go and the approximate-prefix
plugin, and decide per-call whether they need a fallback path or can
be no-op'd for token-ID requests.
Open questions
-
Batched token-ID inputs ([[int]]). /v1/completions treats a
top-level array-of-arrays as a batch of independent prompts;
/v1/embeddings treats it as a batch of inputs to embed. iGW
scheduling is per-request, not per-batch. Options:
a. Accept it and sum the lengths (simple, but masks
scheduling-semantics issues since the model server will produce
N independent outputs from one request).
b. Accept it and take the max (conservative for scheduling).
c. Reject with a clear error in v1 and open a separate issue for
batched-prompt routing.
I lean toward (c): silently summing or maxing hides the fact that
batched prompts are a scheduling-semantics problem, not a token-count
problem.
-
Prefix-cache hashing. What should PlainText() return for a
token-ID input when an approximate-prefix plugin wants to hash it?
A stable string form of the token IDs
(strings.Join over strconv.Itoa) is probably fine, but it means
token-ID and equivalent-text inputs won't share a cache key. We
should decide whether that's acceptable. Detokenizing in the gateway
is a non-starter.
-
Usage-based backfill for training. For the training path only,
we could alternatively read Usage.PromptTokens from the response
body (authoritative) instead of deriving from the request. Worth
doing as a parallel improvement, but it doesn't help the scheduling
path, which has to decide before the response exists.
Out of scope — chat completions token-ID input (needs design, separate issue)
/v1/chat/completions does not officially support token-ID input
in the OpenAI spec. Content parts are typed text | image_url | input_audio | video_url — no token-IDs block. However, vLLM,
SGLang, and some OpenAI-compatible model servers accept pre-tokenized
chat input via non-standard shapes (top-level prompt_token_ids,
custom content parts, etc.), and clients using those extensions will
hit the same InputTokenLength=0 failure mode today.
Supporting chat token-ID input in iGW is not a parser fix and is
intentionally excluded from this issue. It's a design conversation:
- Which non-standard shape(s) does iGW commit to? vLLM's?
SGLang's? Both? A normalized iGW extension?
- Do token IDs replace or augment
messages? (If you pre-tokenize
user content, can you still have a system message? What about tool
messages?)
- How do chat templates interact? If the client pre-tokenized, they
already applied a template, and iGW shouldn't try to reason about
message roles anymore. That has knock-on effects on any plugin that
inspects message structure.
- How does this compose with multi-modal content blocks (image, audio,
video) that aren't pre-tokenized?
Recommend tracking separately and unblocking the completions +
embeddings path here first.
Repro
Completions (token IDs)
curl -s $GATEWAY/v1/completions \
-H 'content-type: application/json' \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"prompt": [128000, 9906, 1917],
"max_tokens": 16
}'
Expected: request routes, latency predictor logs show non-zero
InputTokenLength (3).
Actual: request either fails parsing or routes with
InputTokenLength=0 on every training / prediction log line.
Embeddings (any shape — pre-existing bug)
curl -s $GATEWAY/v1/embeddings \
-H 'content-type: application/json' \
-d '{
"model": "text-embedding-3-small",
"input": "hello world"
}'
Expected: latency predictor / load estimator sees a non-zero input
length.
Actual: PromptText() returns "" regardless of input shape,
InputTokenLength=0 across the board.
Acceptance criteria (suggested)
Related
Summary
The OpenAI API spec allows pre-tokenized inputs on two endpoints:
/v1/completions:promptmay bestring | string[] | integer[] | integer[][]./v1/embeddings:inputmay bestring | string[] | integer[] | integer[][].iGW's EPP request-body parser supports neither. Clients that tokenize on
their side (a common pattern for latency-sensitive or offline-tokenized
workloads, and some embedding pipelines fall
into one of two failure modes when routed through iGW:
Prompt.UnmarshalJSONtries to decode[1, 2, 3, ...]into[]stringand returnsjson: cannot unmarshal number into Go value of type string.CompletionsRequestdoes not populate.unparsed payload (or for embeddings, which we don't extract at all),
every downstream consumer that derives input length from
PromptText()ends up withInputTokenLength = 0.PromptText()falls through to the default
return ""case, andstrings.Fields("")is zero.The second failure mode is the dangerous one — the request still routes,
but every prefix-scorer/ input tokens in flight / SLO / latency-predictor signal that depends on
input length is silently wrong.
Embeddings is in worse shape than completions today:
PromptText()hasno
Embeddingscase at all, so it returns""for everyembeddings request regardless of input type — string, string array, or
token IDs.
Where it breaks today
Completions parser
pkg/epp/framework/interface/requesthandling/types.go:137-145—
Prompt.UnmarshalJSONonly branches onstring/[]string:pkg/epp/framework/interface/requesthandling/types.go:157-162—
Prompt.PlainText()only joinsp.Strings, so token-ID prompts haveno representation even if we relaxed the unmarshal.
Embeddings has no
PromptTextpath at allpkg/epp/framework/interface/requesthandling/types.go:82-108— the
PromptText()switch handlesCompletions,ChatCompletions,Responses, andConversations, but notEmbeddings. Every embeddingsrequest falls through to
default: return "".EmbeddingsRequest.Inputis stored as
any(types.go:256-261)with no type discrimination.
Downstream consumers that silently produce wrong results
Any consumer that derives input length from the prompt string reads 0
tokens for a token-ID prompt (or for any embeddings request today):
Latency predictor — scheduling & training. Uses
len(strings.Fields(promptText))asInputTokenLengthin both theprediction request and the training entry.
plugin.go:259training.go:50training.go:78Impact: training data poisoned with
InputTokenLength=0; TTFT/TPOTpredictions collapse to a single bucket; SLO headroom / admission
decisions are made on a fictitious input length.
In-flight load token estimator. Uses
len(request.Body.PromptText())as an input-size signal(token_estimator.go:60).
Prefix-cache / approximate-prefix plugins that hash or key off
PlainText()— token-ID prompts would produce an empty key andcollide across requests.
Proposal (suggested)
1. Extend
Promptto carry token-ID payloadsUnmarshalJSONsniffs the first non-whitespace byte, and for arraysdelegates to a helper that tries
[]string, then[]int, then[][]int, returning a structured error only if none match. Keep theexisting string /
[]stringfast path unchanged so we don't regress.2. Give
EmbeddingsRequest.Inputthe same treatmentToday
Inputisany. Replace (or augment) with a typed wrapper thatmirrors
Prompt:Same
UnmarshalJSONsniffing rules. This also fixes the pre-existing"embeddings
PromptText()always returns empty" bug — see §5.3. Explicit token-count hint
Add a method that callers should prefer over
len(strings.Fields(...)):TokenIDs:len(TokenIDs).TokenIDSet: sum of inner lengths (see open question 1).Raw/Strings:-1(caller falls back to its currentword-count approximation).
Add
InferenceRequestBody.InputTokenCountHint() intthat returns thepopulated variant's hint, else
-1. This keeps per-consumer changessmall — consumers don't need to know which request type they're
looking at.
4. Latency-predictor: prefer the hint
Update
newPredictedLatencyContext(plugin.go:251-264)
and
buildPredictionRequest/buildTrainingEntry(training.go)
to prefer the hint when it's set:
5.
PromptText()contract + embeddings fixEmbeddingscase toPromptText()(types.go:82-108)
that joins
Stringswhen present — this alone fixes the "embeddingsalways returns empty" bug for the common string-input case.
PromptText()returning""for pure token-ID inputs (thereis no text to return), and document on the method that callers
needing length must use
InputTokenCountHint(), notstrings.Fields(PromptText()).inflightload/token_estimator.goand the approximate-prefixplugin, and decide per-call whether they need a fallback path or can
be no-op'd for token-ID requests.
Open questions
Batched token-ID inputs (
[[int]])./v1/completionstreats atop-level array-of-arrays as a batch of independent prompts;
/v1/embeddingstreats it as a batch of inputs to embed. iGWscheduling is per-request, not per-batch. Options:
a. Accept it and sum the lengths (simple, but masks
scheduling-semantics issues since the model server will produce
N independent outputs from one request).
b. Accept it and take the max (conservative for scheduling).
c. Reject with a clear error in v1 and open a separate issue for
batched-prompt routing.
I lean toward (c): silently summing or maxing hides the fact that
batched prompts are a scheduling-semantics problem, not a token-count
problem.
Prefix-cache hashing. What should
PlainText()return for atoken-ID input when an approximate-prefix plugin wants to hash it?
A stable string form of the token IDs
(
strings.Joinoverstrconv.Itoa) is probably fine, but it meanstoken-ID and equivalent-text inputs won't share a cache key. We
should decide whether that's acceptable. Detokenizing in the gateway
is a non-starter.
Usage-based backfill for training. For the training path only,
we could alternatively read
Usage.PromptTokensfrom the responsebody (authoritative) instead of deriving from the request. Worth
doing as a parallel improvement, but it doesn't help the scheduling
path, which has to decide before the response exists.
Out of scope — chat completions token-ID input (needs design, separate issue)
/v1/chat/completionsdoes not officially support token-ID inputin the OpenAI spec. Content parts are typed
text | image_url | input_audio | video_url— no token-IDs block. However, vLLM,SGLang, and some OpenAI-compatible model servers accept pre-tokenized
chat input via non-standard shapes (top-level
prompt_token_ids,custom content parts, etc.), and clients using those extensions will
hit the same
InputTokenLength=0failure mode today.Supporting chat token-ID input in iGW is not a parser fix and is
intentionally excluded from this issue. It's a design conversation:
SGLang's? Both? A normalized iGW extension?
messages? (If you pre-tokenizeuser content, can you still have a system message? What about tool
messages?)
already applied a template, and iGW shouldn't try to reason about
message roles anymore. That has knock-on effects on any plugin that
inspects message structure.
video) that aren't pre-tokenized?
Recommend tracking separately and unblocking the completions +
embeddings path here first.
Repro
Completions (token IDs)
Expected: request routes, latency predictor logs show non-zero
InputTokenLength(3).Actual: request either fails parsing or routes with
InputTokenLength=0on every training / prediction log line.Embeddings (any shape — pre-existing bug)
Expected: latency predictor / load estimator sees a non-zero input
length.
Actual:
PromptText()returns""regardless of input shape,InputTokenLength=0across the board.Acceptance criteria (suggested)
Prompt.UnmarshalJSONaccepts[]intand[][]int(latter maybe explicitly rejected with a clear error — see open question 1).
EmbeddingsRequest.Inputis typed (or wrapped) and supportsstring,[]string,[]int,[][]int.InferenceRequestBody.PromptText()handles theEmbeddingscase for string inputs (fixes the pre-existing empty-string bug).
Prompt.TokenCountHint()andEmbeddingsInput.TokenCountHint()return exact token counts for token-ID inputs,
-1otherwise.InferenceRequestBody.InputTokenCountHint()exposes a unifiedhint across request types.
predictedlatencydataproducer uses the hint when present forboth
InputTokenLengthin training entries and predictionrequests, and falls back to
strings.Fieldsotherwise.Prompt.UnmarshalJSONandEmbeddingsInput.UnmarshalJSONcovering: string,[]string,[]int,[][]int, and malformed input.newPredictedLatencyContextproduces anon-zero
inputTokenCountfor a token-ID prompt and for astring embeddings input.
PromptText()that it is not a source oftruth for token count.
Related
promptacceptsstring | string[] | integer[] | integer[][]:https://platform.openai.com/docs/api-reference/completions/create#completions-create-prompt
inputacceptsstring | string[] | integer[] | integer[][]:https://platform.openai.com/docs/api-reference/embeddings/create#embeddings-create-input
prompt_token_idspattern (how clients commonly shippre-tokenized input today).