Skip to content

# Support token-ID prompts ([int] / [[int]]) in /v1/completions request parsing #2835

@kaushikmitr

Description

@kaushikmitr

Summary

The OpenAI API spec allows pre-tokenized inputs on two endpoints:

  • /v1/completions:
    prompt may be string | string[] | integer[] | integer[][].
  • /v1/embeddings:
    input may be string | string[] | integer[] | integer[][].

iGW's EPP request-body parser supports neither. Clients that tokenize on
their side (a common pattern for latency-sensitive or offline-tokenized
workloads, and some embedding pipelines fall
into one of two failure modes when routed through iGW:

  1. Parse failure. Prompt.UnmarshalJSON tries to decode
    [1, 2, 3, ...] into []string and returns
    json: cannot unmarshal number into Go value of type string.
    CompletionsRequest does not populate.
  2. Silent input-length corruption. If the body survives as a raw /
    unparsed payload (or for embeddings, which we don't extract at all),
    every downstream consumer that derives input length from
    PromptText() ends up with InputTokenLength = 0. PromptText()
    falls through to the default return "" case, and
    strings.Fields("") is zero.

The second failure mode is the dangerous one — the request still routes,
but every prefix-scorer/ input tokens in flight / SLO / latency-predictor signal that depends on
input length is silently wrong.

Embeddings is in worse shape than completions today: PromptText() has
no Embeddings case at all, so it returns "" for every
embeddings request regardless of input type — string, string array, or
token IDs.

Where it breaks today

Completions parser

pkg/epp/framework/interface/requesthandling/types.go:137-145
Prompt.UnmarshalJSON only branches on string / []string:

func (p *Prompt) UnmarshalJSON(data []byte) error {
    if len(data) > 0 && data[0] == '"' {
        return json.Unmarshal(data, &p.Raw)
    }
    if len(data) > 0 && data[0] == '[' {
        return json.Unmarshal(data, &p.Strings)
    }
    return errors.New("prompt: must be a string or an array of strings")
}

pkg/epp/framework/interface/requesthandling/types.go:157-162
Prompt.PlainText() only joins p.Strings, so token-ID prompts have
no representation even if we relaxed the unmarshal.

Embeddings has no PromptText path at all

pkg/epp/framework/interface/requesthandling/types.go:82-108
— the PromptText() switch handles Completions, ChatCompletions,
Responses, and Conversations, but not Embeddings. Every embeddings
request falls through to default: return "". EmbeddingsRequest.Input
is stored as any (types.go:256-261)
with no type discrimination.

Downstream consumers that silently produce wrong results

Any consumer that derives input length from the prompt string reads 0
tokens for a token-ID prompt (or for any embeddings request today):

  • Latency predictor — scheduling & training. Uses
    len(strings.Fields(promptText)) as InputTokenLength in both the
    prediction request and the training entry.

    Impact: training data poisoned with InputTokenLength=0; TTFT/TPOT
    predictions collapse to a single bucket; SLO headroom / admission
    decisions are made on a fictitious input length.

  • In-flight load token estimator. Uses
    len(request.Body.PromptText()) as an input-size signal
    (token_estimator.go:60).

  • Prefix-cache / approximate-prefix plugins that hash or key off
    PlainText() — token-ID prompts would produce an empty key and
    collide across requests.

Proposal (suggested)

1. Extend Prompt to carry token-ID payloads

type Prompt struct {
    Raw        string
    Strings    []string
    TokenIDs   []int     // single pre-tokenized prompt
    TokenIDSet [][]int   // batched pre-tokenized prompts
}

UnmarshalJSON sniffs the first non-whitespace byte, and for arrays
delegates to a helper that tries []string, then []int, then
[][]int, returning a structured error only if none match. Keep the
existing string / []string fast path unchanged so we don't regress.

2. Give EmbeddingsRequest.Input the same treatment

Today Input is any. Replace (or augment) with a typed wrapper that
mirrors Prompt:

type EmbeddingsInput struct {
    Raw        string
    Strings    []string
    TokenIDs   []int
    TokenIDSet [][]int
}

Same UnmarshalJSON sniffing rules. This also fixes the pre-existing
"embeddings PromptText() always returns empty" bug — see §5.

3. Explicit token-count hint

Add a method that callers should prefer over
len(strings.Fields(...)):

// TokenCountHint returns a best-effort input token count when the
// caller knows it exactly (token-ID inputs), or -1 when the count has
// to be estimated from text.
func (p Prompt) TokenCountHint() int
func (e EmbeddingsInput) TokenCountHint() int
  • For TokenIDs: len(TokenIDs).
  • For TokenIDSet: sum of inner lengths (see open question 1).
  • For Raw / Strings: -1 (caller falls back to its current
    word-count approximation).

Add InferenceRequestBody.InputTokenCountHint() int that returns the
populated variant's hint, else -1. This keeps per-consumer changes
small — consumers don't need to know which request type they're
looking at.

4. Latency-predictor: prefer the hint

Update newPredictedLatencyContext
(plugin.go:251-264)
and buildPredictionRequest / buildTrainingEntry
(training.go)
to prefer the hint when it's set:

inputLen := req.Body.InputTokenCountHint()
if inputLen < 0 {
    inputLen = len(strings.Fields(prompt))
}

5. PromptText() contract + embeddings fix

  • Add an Embeddings case to PromptText()
    (types.go:82-108)
    that joins Strings when present — this alone fixes the "embeddings
    always returns empty" bug for the common string-input case.
  • Leave PromptText() returning "" for pure token-ID inputs (there
    is no text to return), and document on the method that callers
    needing length must use InputTokenCountHint(), not
    strings.Fields(PromptText()).
  • Audit inflightload/token_estimator.go and the approximate-prefix
    plugin, and decide per-call whether they need a fallback path or can
    be no-op'd for token-ID requests.

Open questions

  1. Batched token-ID inputs ([[int]]). /v1/completions treats a
    top-level array-of-arrays as a batch of independent prompts;
    /v1/embeddings treats it as a batch of inputs to embed. iGW
    scheduling is per-request, not per-batch. Options:
    a. Accept it and sum the lengths (simple, but masks
    scheduling-semantics issues since the model server will produce
    N independent outputs from one request).
    b. Accept it and take the max (conservative for scheduling).
    c. Reject with a clear error in v1 and open a separate issue for
    batched-prompt routing.

    I lean toward (c): silently summing or maxing hides the fact that
    batched prompts are a scheduling-semantics problem, not a token-count
    problem.

  2. Prefix-cache hashing. What should PlainText() return for a
    token-ID input when an approximate-prefix plugin wants to hash it?
    A stable string form of the token IDs
    (strings.Join over strconv.Itoa) is probably fine, but it means
    token-ID and equivalent-text inputs won't share a cache key. We
    should decide whether that's acceptable. Detokenizing in the gateway
    is a non-starter.

  3. Usage-based backfill for training. For the training path only,
    we could alternatively read Usage.PromptTokens from the response
    body (authoritative) instead of deriving from the request. Worth
    doing as a parallel improvement, but it doesn't help the scheduling
    path, which has to decide before the response exists.

Out of scope — chat completions token-ID input (needs design, separate issue)

/v1/chat/completions does not officially support token-ID input
in the OpenAI spec. Content parts are typed text | image_url | input_audio | video_url — no token-IDs block. However, vLLM,
SGLang, and some OpenAI-compatible model servers accept pre-tokenized
chat input via non-standard shapes (top-level prompt_token_ids,
custom content parts, etc.), and clients using those extensions will
hit the same InputTokenLength=0 failure mode today.

Supporting chat token-ID input in iGW is not a parser fix and is
intentionally excluded from this issue. It's a design conversation:

  1. Which non-standard shape(s) does iGW commit to? vLLM's?
    SGLang's? Both? A normalized iGW extension?
  2. Do token IDs replace or augment messages? (If you pre-tokenize
    user content, can you still have a system message? What about tool
    messages?)
  3. How do chat templates interact? If the client pre-tokenized, they
    already applied a template, and iGW shouldn't try to reason about
    message roles anymore. That has knock-on effects on any plugin that
    inspects message structure.
  4. How does this compose with multi-modal content blocks (image, audio,
    video) that aren't pre-tokenized?

Recommend tracking separately and unblocking the completions +
embeddings path here first.

Repro

Completions (token IDs)

curl -s $GATEWAY/v1/completions \
  -H 'content-type: application/json' \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "prompt": [128000, 9906, 1917],
    "max_tokens": 16
  }'

Expected: request routes, latency predictor logs show non-zero
InputTokenLength (3).

Actual: request either fails parsing or routes with
InputTokenLength=0 on every training / prediction log line.

Embeddings (any shape — pre-existing bug)

curl -s $GATEWAY/v1/embeddings \
  -H 'content-type: application/json' \
  -d '{
    "model": "text-embedding-3-small",
    "input": "hello world"
  }'

Expected: latency predictor / load estimator sees a non-zero input
length.

Actual: PromptText() returns "" regardless of input shape,
InputTokenLength=0 across the board.

Acceptance criteria (suggested)

  • Prompt.UnmarshalJSON accepts []int and [][]int (latter may
    be explicitly rejected with a clear error — see open question 1).
  • EmbeddingsRequest.Input is typed (or wrapped) and supports
    string, []string, []int, [][]int.
  • InferenceRequestBody.PromptText() handles the Embeddings
    case for string inputs (fixes the pre-existing empty-string bug).
  • Prompt.TokenCountHint() and EmbeddingsInput.TokenCountHint()
    return exact token counts for token-ID inputs, -1 otherwise.
  • InferenceRequestBody.InputTokenCountHint() exposes a unified
    hint across request types.
  • predictedlatency dataproducer uses the hint when present for
    both InputTokenLength in training entries and prediction
    requests, and falls back to strings.Fields otherwise.
  • Unit tests for Prompt.UnmarshalJSON and
    EmbeddingsInput.UnmarshalJSON covering: string, []string,
    []int, [][]int, and malformed input.
  • Unit test showing newPredictedLatencyContext produces a
    non-zero inputTokenCount for a token-ID prompt and for a
    string embeddings input.
  • Documentation note on PromptText() that it is not a source of
    truth for token count.

Related

Metadata

Metadata

Assignees

Labels

needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions