# Support token-ID prompts (`[int]` / `[[int]]`) in `/v1/completions` request parsing


## Summary

The OpenAI API spec allows pre-tokenized inputs on two endpoints:

- [`/v1/completions`](https://platform.openai.com/docs/api-reference/completions/create#completions-create-prompt):
  `prompt` may be `string | string[] | integer[] | integer[][]`.
- [`/v1/embeddings`](https://platform.openai.com/docs/api-reference/embeddings/create#embeddings-create-input):
  `input` may be `string | string[] | integer[] | integer[][]`.

iGW's EPP request-body parser supports neither. Clients that tokenize on
their side (a common pattern for latency-sensitive or offline-tokenized
workloads, and some embedding pipelines fall
into one of two failure modes when routed through iGW:

1. **Parse failure.** `Prompt.UnmarshalJSON` tries to decode
   `[1, 2, 3, ...]` into `[]string` and returns
   `json: cannot unmarshal number into Go value of type string`.
   `CompletionsRequest` does not populate.
2. **Silent input-length corruption.** If the body survives as a raw /
   unparsed payload (or for embeddings, which we don't extract at all),
   every downstream consumer that derives input length from
   `PromptText()` ends up with `InputTokenLength = 0`. `PromptText()`
   falls through to the default `return ""` case, and
   `strings.Fields("")` is zero.

The second failure mode is the dangerous one — the request still routes,
but every prefix-scorer/ input tokens in flight / SLO / latency-predictor signal that depends on
input length is silently wrong.

Embeddings is in worse shape than completions today: `PromptText()` has
**no** `Embeddings` case at all, so it returns `""` for *every*
embeddings request regardless of input type — string, string array, or
token IDs.

## Where it breaks today

### Completions parser

[`pkg/epp/framework/interface/requesthandling/types.go:137-145`](pkg/epp/framework/interface/requesthandling/types.go#L137-L145)
— `Prompt.UnmarshalJSON` only branches on `string` / `[]string`:

```go
func (p *Prompt) UnmarshalJSON(data []byte) error {
    if len(data) > 0 && data[0] == '"' {
        return json.Unmarshal(data, &p.Raw)
    }
    if len(data) > 0 && data[0] == '[' {
        return json.Unmarshal(data, &p.Strings)
    }
    return errors.New("prompt: must be a string or an array of strings")
}
```

[`pkg/epp/framework/interface/requesthandling/types.go:157-162`](pkg/epp/framework/interface/requesthandling/types.go#L157-L162)
— `Prompt.PlainText()` only joins `p.Strings`, so token-ID prompts have
no representation even if we relaxed the unmarshal.

### Embeddings has no `PromptText` path at all

[`pkg/epp/framework/interface/requesthandling/types.go:82-108`](pkg/epp/framework/interface/requesthandling/types.go#L82-L108)
— the `PromptText()` switch handles `Completions`, `ChatCompletions`,
`Responses`, and `Conversations`, but not `Embeddings`. Every embeddings
request falls through to `default: return ""`. `EmbeddingsRequest.Input`
is stored as `any` ([types.go:256-261](pkg/epp/framework/interface/requesthandling/types.go#L256-L261))
with no type discrimination.

### Downstream consumers that silently produce wrong results

Any consumer that derives input length from the prompt string reads 0
tokens for a token-ID prompt (or for any embeddings request today):

- **Latency predictor — scheduling & training.** Uses
  `len(strings.Fields(promptText))` as `InputTokenLength` in both the
  prediction request and the training entry.
  - [`plugin.go:259`](pkg/epp/framework/plugins/requestcontrol/dataproducer/predictedlatency/plugin.go#L259)
  - [`training.go:50`](pkg/epp/framework/plugins/requestcontrol/dataproducer/predictedlatency/training.go#L50)
  - [`training.go:78`](pkg/epp/framework/plugins/requestcontrol/dataproducer/predictedlatency/training.go#L78)

  Impact: training data poisoned with `InputTokenLength=0`; TTFT/TPOT
  predictions collapse to a single bucket; SLO headroom / admission
  decisions are made on a fictitious input length.

- **In-flight load token estimator.** Uses
  `len(request.Body.PromptText())` as an input-size signal
  ([token_estimator.go:60](pkg/epp/framework/plugins/requestcontrol/dataproducer/inflightload/token_estimator.go#L60)).

- **Prefix-cache / approximate-prefix plugins** that hash or key off
  `PlainText()` — token-ID prompts would produce an empty key and
  collide across requests.

## Proposal (suggested)

### 1. Extend `Prompt` to carry token-ID payloads

```go
type Prompt struct {
    Raw        string
    Strings    []string
    TokenIDs   []int     // single pre-tokenized prompt
    TokenIDSet [][]int   // batched pre-tokenized prompts
}
```

`UnmarshalJSON` sniffs the first non-whitespace byte, and for arrays
delegates to a helper that tries `[]string`, then `[]int`, then
`[][]int`, returning a structured error only if none match. Keep the
existing string / `[]string` fast path unchanged so we don't regress.

### 2. Give `EmbeddingsRequest.Input` the same treatment

Today `Input` is `any`. Replace (or augment) with a typed wrapper that
mirrors `Prompt`:

```go
type EmbeddingsInput struct {
    Raw        string
    Strings    []string
    TokenIDs   []int
    TokenIDSet [][]int
}
```

Same `UnmarshalJSON` sniffing rules. This also fixes the pre-existing
"embeddings `PromptText()` always returns empty" bug — see §5.

### 3. Explicit token-count hint

Add a method that callers should prefer over
`len(strings.Fields(...))`:

```go
// TokenCountHint returns a best-effort input token count when the
// caller knows it exactly (token-ID inputs), or -1 when the count has
// to be estimated from text.
func (p Prompt) TokenCountHint() int
func (e EmbeddingsInput) TokenCountHint() int
```

- For `TokenIDs`: `len(TokenIDs)`.
- For `TokenIDSet`: sum of inner lengths (see open question 1).
- For `Raw` / `Strings`: `-1` (caller falls back to its current
  word-count approximation).

Add `InferenceRequestBody.InputTokenCountHint() int` that returns the
populated variant's hint, else `-1`. This keeps per-consumer changes
small — consumers don't need to know which request type they're
looking at.

### 4. Latency-predictor: prefer the hint

Update `newPredictedLatencyContext`
([plugin.go:251-264](pkg/epp/framework/plugins/requestcontrol/dataproducer/predictedlatency/plugin.go#L251-L264))
and `buildPredictionRequest` / `buildTrainingEntry`
([training.go](pkg/epp/framework/plugins/requestcontrol/dataproducer/predictedlatency/training.go))
to prefer the hint when it's set:

```go
inputLen := req.Body.InputTokenCountHint()
if inputLen < 0 {
    inputLen = len(strings.Fields(prompt))
}
```

### 5. `PromptText()` contract + embeddings fix

- Add an `Embeddings` case to `PromptText()`
  ([types.go:82-108](pkg/epp/framework/interface/requesthandling/types.go#L82-L108))
  that joins `Strings` when present — this alone fixes the "embeddings
  always returns empty" bug for the common string-input case.
- Leave `PromptText()` returning `""` for pure token-ID inputs (there
  is no text to return), and document on the method that callers
  needing length must use `InputTokenCountHint()`, not
  `strings.Fields(PromptText())`.
- Audit `inflightload/token_estimator.go` and the approximate-prefix
  plugin, and decide per-call whether they need a fallback path or can
  be no-op'd for token-ID requests.

## Open questions

1. **Batched token-ID inputs (`[[int]]`).** `/v1/completions` treats a
   top-level array-of-arrays as a *batch* of independent prompts;
   `/v1/embeddings` treats it as a batch of inputs to embed. iGW
   scheduling is per-request, not per-batch. Options:
   a. Accept it and sum the lengths (simple, but masks
      scheduling-semantics issues since the model server will produce
      N independent outputs from one request).
   b. Accept it and take the max (conservative for scheduling).
   c. **Reject with a clear error** in v1 and open a separate issue for
      batched-prompt routing.

   I lean toward (c): silently summing or maxing hides the fact that
   batched prompts are a scheduling-semantics problem, not a token-count
   problem.

2. **Prefix-cache hashing.** What should `PlainText()` return for a
   token-ID input when an approximate-prefix plugin wants to hash it?
   A stable string form of the token IDs
   (`strings.Join` over `strconv.Itoa`) is probably fine, but it means
   token-ID and equivalent-text inputs won't share a cache key. We
   should decide whether that's acceptable. Detokenizing in the gateway
   is a non-starter.

3. **Usage-based backfill for training.** For the training path only,
   we could alternatively read `Usage.PromptTokens` from the response
   body (authoritative) instead of deriving from the request. Worth
   doing as a parallel improvement, but it doesn't help the scheduling
   path, which has to decide before the response exists.

## Out of scope — chat completions token-ID input (needs design, separate issue)

`/v1/chat/completions` does **not** officially support token-ID input
in the OpenAI spec. Content parts are typed `text | image_url |
input_audio | video_url` — no token-IDs block. However, vLLM,
SGLang, and some OpenAI-compatible model servers accept pre-tokenized
chat input via non-standard shapes (top-level `prompt_token_ids`,
custom content parts, etc.), and clients using those extensions will
hit the same `InputTokenLength=0` failure mode today.

Supporting chat token-ID input in iGW is **not a parser fix** and is
intentionally excluded from this issue. It's a design conversation:

1. Which non-standard shape(s) does iGW commit to? vLLM's?
   SGLang's? Both? A normalized iGW extension?
2. Do token IDs *replace* or *augment* `messages`? (If you pre-tokenize
   user content, can you still have a system message? What about tool
   messages?)
3. How do chat templates interact? If the client pre-tokenized, they
   already applied a template, and iGW shouldn't try to reason about
   message roles anymore. That has knock-on effects on any plugin that
   inspects message structure.
4. How does this compose with multi-modal content blocks (image, audio,
   video) that *aren't* pre-tokenized?

Recommend tracking separately and unblocking the completions +
embeddings path here first.

## Repro

### Completions (token IDs)

```bash
curl -s $GATEWAY/v1/completions \
  -H 'content-type: application/json' \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "prompt": [128000, 9906, 1917],
    "max_tokens": 16
  }'
```

Expected: request routes, latency predictor logs show non-zero
`InputTokenLength` (3).

Actual: request either fails parsing or routes with
`InputTokenLength=0` on every training / prediction log line.

### Embeddings (any shape — pre-existing bug)

```bash
curl -s $GATEWAY/v1/embeddings \
  -H 'content-type: application/json' \
  -d '{
    "model": "text-embedding-3-small",
    "input": "hello world"
  }'
```

Expected: latency predictor / load estimator sees a non-zero input
length.

Actual: `PromptText()` returns `""` regardless of input shape,
`InputTokenLength=0` across the board.

## Acceptance criteria (suggested)

- [ ] `Prompt.UnmarshalJSON` accepts `[]int` and `[][]int` (latter may
      be explicitly rejected with a clear error — see open question 1).
- [ ] `EmbeddingsRequest.Input` is typed (or wrapped) and supports
      `string`, `[]string`, `[]int`, `[][]int`.
- [ ] `InferenceRequestBody.PromptText()` handles the `Embeddings`
      case for string inputs (fixes the pre-existing empty-string bug).
- [ ] `Prompt.TokenCountHint()` and `EmbeddingsInput.TokenCountHint()`
      return exact token counts for token-ID inputs, `-1` otherwise.
- [ ] `InferenceRequestBody.InputTokenCountHint()` exposes a unified
      hint across request types.
- [ ] `predictedlatency` dataproducer uses the hint when present for
      both `InputTokenLength` in training entries and prediction
      requests, and falls back to `strings.Fields` otherwise.
- [ ] Unit tests for `Prompt.UnmarshalJSON` and
      `EmbeddingsInput.UnmarshalJSON` covering: string, `[]string`,
      `[]int`, `[][]int`, and malformed input.
- [ ] Unit test showing `newPredictedLatencyContext` produces a
      non-zero `inputTokenCount` for a token-ID prompt and for a
      string embeddings input.
- [ ] Documentation note on `PromptText()` that it is not a source of
      truth for token count.

## Related

- OpenAI Completions API spec — `prompt` accepts
  `string | string[] | integer[] | integer[][]`:
  https://platform.openai.com/docs/api-reference/completions/create#completions-create-prompt
- OpenAI Embeddings API spec — `input` accepts
  `string | string[] | integer[] | integer[][]`:
  https://platform.openai.com/docs/api-reference/embeddings/create#embeddings-create-input
- vLLM `prompt_token_ids` pattern (how clients commonly ship
  pre-tokenized input today).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

# Support token-ID prompts (`[int]` / `[[int]]`) in `/v1/completions` request parsing #2835

Summary

Where it breaks today

Completions parser

Embeddings has no `PromptText` path at all

Downstream consumers that silently produce wrong results

Proposal (suggested)

1. Extend `Prompt` to carry token-ID payloads

2. Give `EmbeddingsRequest.Input` the same treatment

3. Explicit token-count hint

4. Latency-predictor: prefer the hint

5. `PromptText()` contract + embeddings fix

Open questions

Out of scope — chat completions token-ID input (needs design, separate issue)

Repro

Completions (token IDs)

Embeddings (any shape — pre-existing bug)

Acceptance criteria (suggested)

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

# Support token-ID prompts ([int] / [[int]]) in /v1/completions request parsing #2835

Description

Summary

Where it breaks today

Completions parser

Embeddings has no PromptText path at all

Downstream consumers that silently produce wrong results

Proposal (suggested)

1. Extend Prompt to carry token-ID payloads

2. Give EmbeddingsRequest.Input the same treatment

3. Explicit token-count hint

4. Latency-predictor: prefer the hint

5. PromptText() contract + embeddings fix

Open questions

Out of scope — chat completions token-ID input (needs design, separate issue)

Repro

Completions (token IDs)

Embeddings (any shape — pre-existing bug)

Acceptance criteria (suggested)

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

# Support token-ID prompts (`[int]` / `[[int]]`) in `/v1/completions` request parsing #2835

Embeddings has no `PromptText` path at all

1. Extend `Prompt` to carry token-ID payloads

2. Give `EmbeddingsRequest.Input` the same treatment

5. `PromptText()` contract + embeddings fix