ContentDateEnrichment: Filter _update_by_query to only unresolved documents by reakaleek · Pull Request #3118 · elastic/docs-builder

reakaleek · 2026-04-15T22:37:08Z

What

Filter _update_by_query in ResolveContentDatesAsync to only process documents that don't already have a valid content_last_updated, instead of re-indexing every document in the index.

Why

The unfiltered _update_by_query on the semantic index triggers ML inference for all 6 semantic_text fields on every document — causing the deploy workflow to hang for 3+ hours (see cancelled run). The lexical index is fast because it has no semantic_text fields.

How

After HashedBulkUpdate, unchanged documents (noop) retain their resolved content_last_updated from the previous run. Only new/changed documents have the field at DateTimeOffset.MinValue (0001-01-01). A must_not { range: gt 1970 } filter restricts _update_by_query to only these unresolved documents — typically a tiny fraction of the index.

Also enhances integration tests to use real HashedBulkUpdate-style scripted upserts with full DocumentationDocument serialization.

Test plan

Discovery test confirms HashedBulkUpdate writes content_last_updated: "0001-01-01T00:00:00+00:00" for default documents
Filter test confirms resolved docs are skipped, unresolved docs are processed
End-to-end test confirms second run only processes 1 changed doc out of 3 (visible in _update_by_query log: Documents 1: 1 updated)
All 7 integration tests pass (existing + 3 new)

Notes

Root cause: semantic_text inference fires during _update_by_query re-indexing, even when text hasn't changed
The filter handles both missing fields and DateTimeOffset.MinValue values
First-ever run still processes all documents (all have default dates), but subsequent runs are fast

🤖 Generated with Claude Code

…uments The _update_by_query in ResolveContentDatesAsync was re-indexing every document in both the lexical and semantic indices. On the semantic index, this triggered ML inference for all 6 semantic_text fields on every document — causing the deploy workflow to hang for 3+ hours. After HashedBulkUpdate, unchanged documents (noop) retain their resolved content_last_updated from the previous run. Only new/changed documents have the field at the default DateTimeOffset.MinValue (0001-01-01). The filter restricts _update_by_query to only these unresolved documents, reducing the typical deploy from hundreds of thousands of documents to just the changed ones. Also enhances integration tests to use real HashedBulkUpdate-style scripted upserts with full DocumentationDocument serialization, and adds tests proving the filter behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-04-15T22:42:15Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 55792808-fde2-4148-8d44-d749e553fa88

📥 Commits

Reviewing files that changed from the base of the PR and between fe54792 and 7cd628f.

📒 Files selected for processing (1)

tests-integration/Elastic.ContentDateEnrichment.IntegrationTests/ContentDateEnrichmentTests.cs

📝 Walkthrough

Walkthrough

ResolveContentDatesAsync now calls Elasticsearch's _update_by_query with an explicit JSON query (PostData.String(UnresolvedContentDatesQuery)) that excludes documents whose content_last_updated is greater than 1970-01-01T00:00:00Z, rather than using PostData.Empty. Integration tests were rewritten to index full serialized DocumentationDocument via a new hash-aware scripted upsert helper, updated assertions for default/pre-epoch content_last_updated behavior, and added tests for stored representation, resolution filtering, and a two-pass end-to-end scenario.

Sequence Diagram(s)

sequenceDiagram
  participant App as Application
  participant ES as Elasticsearch
  participant Pipeline as Ingest Pipeline

  App->>ES: _update_by_query (query: content_last_updated <= "1970-01-01T00:00:00Z", pipeline=content-date-enrich)
  ES->>ES: find matching documents
  ES->>Pipeline: apply ingest pipeline to each matching document
  Pipeline-->>ES: return enriched document
  ES-->>App: _update_by_query response (updated count)

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 63.64% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: filtering _update_by_query to process only unresolved documents with missing or default content_last_updated timestamps.
Description check	✅ Passed	The description comprehensively explains what changed, why (performance issue with ML inference), how (filtering by content_last_updated timestamp), and includes a clear test plan validating the changes.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

✨ Simplify code

Create PR with simplified code
Commit simplified code in branch royal-pancake

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

tests-integration/Elastic.ContentDateEnrichment.IntegrationTests/ContentDateEnrichmentTests.cs (1)
307-308: Minor: Comment is slightly misleading.

The comment says "no content_last_updated" but IndexDocumentsDirectly now uses full DocumentationDocument serialization which includes content_last_updated at the default 0001-01-01 value. Consider updating to match doc2's comment pattern.
Proposed fix
-		// doc1: no content_last_updated (simulates scripted upsert that omits the field)
+		// doc1: content_last_updated at default DateTimeOffset.MinValue (simulates production scripted upsert)
		await IndexDocumentsDirectly(index, ("url1", "hash_a", "Doc 1"));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests-integration/Elastic.ContentDateEnrichment.IntegrationTests/ContentDateEnrichmentTests.cs`
around lines 307 - 308, Update the test comment to accurately reflect what
IndexDocumentsDirectly now serializes: instead of saying "no
content_last_updated" state that the document is indexed with the default
DocumentationDocument.content_last_updated value (DateTime.MinValue/0001-01-01),
matching the doc2 comment pattern; locate the comment above the call to
IndexDocumentsDirectly in ContentDateEnrichmentTests.cs and amend the text to
mention the default/min value rather than omitting the field.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@tests-integration/Elastic.ContentDateEnrichment.IntegrationTests/ContentDateEnrichmentTests.cs`:
- Around line 307-308: Update the test comment to accurately reflect what
IndexDocumentsDirectly now serializes: instead of saying "no
content_last_updated" state that the document is indexed with the default
DocumentationDocument.content_last_updated value (DateTime.MinValue/0001-01-01),
matching the doc2 comment pattern; locate the comment above the call to
IndexDocumentsDirectly in ContentDateEnrichmentTests.cs and amend the text to
mention the default/min value rather than omitting the field.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 84a64253-e054-48d3-be04-19248f5a77ad

📥 Commits

Reviewing files that changed from the base of the PR and between 08d43d9 and e25ef46.

📒 Files selected for processing (2)

src/Elastic.Markdown/Exporters/Elasticsearch/ContentDateEnrichment.cs
tests-integration/Elastic.ContentDateEnrichment.IntegrationTests/ContentDateEnrichmentTests.cs

- Extract query JsonObject to static readonly string (avoid re-allocation per call) - Remove debug output.WriteLine from discovery test - Fix double JsonNode.Parse — reuse parsed node for params.doc - Use const for hash field name - Fix misleading comment on doc1 in FilteredResolve test - Remove unnecessary null-forgiving operators Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@tests-integration/Elastic.ContentDateEnrichment.IntegrationTests/ContentDateEnrichmentTests.cs`:
- Around line 576-582: The current assertion only checks
response.ApiCallDetails.HasSuccessfulStatusCode which can be true even if some
bulk items failed; update the test after the PostAsync<StringResponse> call to
assert that the bulk response reports no item errors by parsing or inspecting
the response body and asserting errors == false (e.g., deserialize response.Body
to the appropriate bulk-response model or search the JSON for "errors": false)
in addition to the existing HasSuccessfulStatusCode check so that variable
response (returned by PostAsync<StringResponse>) explicitly verifies no per-item
failures.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 513dae87-38ce-4451-b9f2-12d2601cd2b9

📥 Commits

Reviewing files that changed from the base of the PR and between e25ef46 and 2791de7.

📒 Files selected for processing (2)

src/Elastic.Markdown/Exporters/Elasticsearch/ContentDateEnrichment.cs
tests-integration/Elastic.ContentDateEnrichment.IntegrationTests/ContentDateEnrichmentTests.cs

🚧 Files skipped from review as they are similar to previous changes (1)

src/Elastic.Markdown/Exporters/Elasticsearch/ContentDateEnrichment.cs

The existing check only verified HTTP status code, which can be 200 even when individual bulk items fail. Parse the response body and assert "errors": false. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

tests-integration/Elastic.ContentDateEnrichment.IntegrationTests/ContentDateEnrichmentTests.cs (1)
102-103: Avoid hard-coded year checks in resolved-date assertions.

Using Year >= 2026 makes these tests unnecessarily time-dependent. The behavior under test is “resolved date is not default/unresolved,” so compare against DateTimeOffset.UnixEpoch (or a captured test start time) instead.

Also applies to: 214-215, 232-233, 341-342, 345-345
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests-integration/Elastic.ContentDateEnrichment.IntegrationTests/ContentDateEnrichmentTests.cs`
around lines 102 - 103, The assertion currently checks
doc.ContentLastUpdated.Value.Year.Should().BeGreaterThanOrEqualTo(2026) which
hard-codes a year; replace these time-dependent checks (the assertions
referencing doc.ContentLastUpdated.Value.Year in ContentDateEnrichmentTests)
with a comparison against a non-default baseline such as
DateTimeOffset.UnixEpoch (or a captured test start time variable) — e.g., assert
doc.ContentLastUpdated.Should().BeAfter(DateTimeOffset.UnixEpoch) or
BeAfter(testStart) — and update the other occurrences referenced in the comment
(the assertions at the other noted locations) similarly so you verify “not
default/unresolved” rather than a specific year.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@tests-integration/Elastic.ContentDateEnrichment.IntegrationTests/ContentDateEnrichmentTests.cs`:
- Around line 584-586: The test currently treats a missing or non-boolean
"errors" field as false; change the assertion to first verify the bulkResult
JSON exists and contains an "errors" property of the expected boolean shape
(e.g., ensure bulkResult != null and bulkResult["errors"] is present and can be
parsed as bool) and fail the test if the property is missing or malformed, then
assert that that boolean value is false; reference the variables bulkResult and
hasErrors in ContentDateEnrichmentTests to locate and update the validation
logic.

---

Nitpick comments:
In
`@tests-integration/Elastic.ContentDateEnrichment.IntegrationTests/ContentDateEnrichmentTests.cs`:
- Around line 102-103: The assertion currently checks
doc.ContentLastUpdated.Value.Year.Should().BeGreaterThanOrEqualTo(2026) which
hard-codes a year; replace these time-dependent checks (the assertions
referencing doc.ContentLastUpdated.Value.Year in ContentDateEnrichmentTests)
with a comparison against a non-default baseline such as
DateTimeOffset.UnixEpoch (or a captured test start time variable) — e.g., assert
doc.ContentLastUpdated.Should().BeAfter(DateTimeOffset.UnixEpoch) or
BeAfter(testStart) — and update the other occurrences referenced in the comment
(the assertions at the other noted locations) similarly so you verify “not
default/unresolved” rather than a specific year.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 2731e163-6446-4342-9111-596cd7a4f5f2

📥 Commits

Reviewing files that changed from the base of the PR and between 2791de7 and fe54792.

📒 Files selected for processing (1)

tests-integration/Elastic.ContentDateEnrichment.IntegrationTests/ContentDateEnrichmentTests.cs

Assert bulkResult and its "errors" property exist before checking the boolean value, rather than defaulting to false on missing/malformed JSON. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

reakaleek requested a review from a team as a code owner April 15, 2026 22:37

reakaleek requested a review from cotti April 15, 2026 22:37

github-actions bot deployed to assembler-preview April 15, 2026 22:37 View deployment

coderabbitai bot added the fix label Apr 15, 2026

coderabbitai bot reviewed Apr 15, 2026

View reviewed changes

coderabbitai bot approved these changes Apr 15, 2026

View reviewed changes

github-actions bot deployed to assembler-preview April 15, 2026 22:44 View deployment

coderabbitai bot requested changes Apr 15, 2026

View reviewed changes

Comment thread tests-integration/Elastic.ContentDateEnrichment.IntegrationTests/ContentDateEnrichmentTests.cs

Tests: Assert bulk response has no item-level errors

fe54792

The existing check only verified HTTP status code, which can be 200 even when individual bulk items fail. Parse the response body and assert "errors": false. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions bot deployed to assembler-preview April 15, 2026 22:55 View deployment

coderabbitai bot requested changes Apr 15, 2026

View reviewed changes

Comment thread tests-integration/Elastic.ContentDateEnrichment.IntegrationTests/ContentDateEnrichmentTests.cs Outdated

Tests: Fail explicitly on missing or malformed bulk errors field

7cd628f

Assert bulkResult and its "errors" property exist before checking the boolean value, rather than defaulting to false on missing/malformed JSON. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions bot deployed to assembler-preview April 15, 2026 23:05 View deployment

coderabbitai bot approved these changes Apr 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ContentDateEnrichment: Filter _update_by_query to only unresolved documents#3118

ContentDateEnrichment: Filter _update_by_query to only unresolved documents#3118
reakaleek wants to merge 4 commits intomainfrom
royal-pancake

reakaleek commented Apr 15, 2026

Uh oh!

coderabbitai bot commented Apr 15, 2026 •

edited

Loading

Walkthrough

Sequence Diagram(s)

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

reakaleek commented Apr 15, 2026

What

Why

How

Test plan

Notes

Uh oh!

coderabbitai bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Sequence Diagram(s)

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai bot commented Apr 15, 2026 •

edited

Loading