ContentDateEnrichment: Filter _update_by_query to only unresolved documents#3118
ContentDateEnrichment: Filter _update_by_query to only unresolved documents#3118
Conversation
…uments The _update_by_query in ResolveContentDatesAsync was re-indexing every document in both the lexical and semantic indices. On the semantic index, this triggered ML inference for all 6 semantic_text fields on every document — causing the deploy workflow to hang for 3+ hours. After HashedBulkUpdate, unchanged documents (noop) retain their resolved content_last_updated from the previous run. Only new/changed documents have the field at the default DateTimeOffset.MinValue (0001-01-01). The filter restricts _update_by_query to only these unresolved documents, reducing the typical deploy from hundreds of thousands of documents to just the changed ones. Also enhances integration tests to use real HashedBulkUpdate-style scripted upserts with full DocumentationDocument serialization, and adds tests proving the filter behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughResolveContentDatesAsync now calls Elasticsearch's _update_by_query with an explicit JSON query (PostData.String(UnresolvedContentDatesQuery)) that excludes documents whose content_last_updated is greater than 1970-01-01T00:00:00Z, rather than using PostData.Empty. Integration tests were rewritten to index full serialized DocumentationDocument via a new hash-aware scripted upsert helper, updated assertions for default/pre-epoch content_last_updated behavior, and added tests for stored representation, resolution filtering, and a two-pass end-to-end scenario. Sequence Diagram(s)sequenceDiagram
participant App as Application
participant ES as Elasticsearch
participant Pipeline as Ingest Pipeline
App->>ES: _update_by_query (query: content_last_updated <= "1970-01-01T00:00:00Z", pipeline=content-date-enrich)
ES->>ES: find matching documents
ES->>Pipeline: apply ingest pipeline to each matching document
Pipeline-->>ES: return enriched document
ES-->>App: _update_by_query response (updated count)
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches✨ Simplify code
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests-integration/Elastic.ContentDateEnrichment.IntegrationTests/ContentDateEnrichmentTests.cs (1)
307-308: Minor: Comment is slightly misleading.The comment says "no content_last_updated" but
IndexDocumentsDirectlynow uses fullDocumentationDocumentserialization which includescontent_last_updatedat the default 0001-01-01 value. Consider updating to match doc2's comment pattern.Proposed fix
- // doc1: no content_last_updated (simulates scripted upsert that omits the field) + // doc1: content_last_updated at default DateTimeOffset.MinValue (simulates production scripted upsert) await IndexDocumentsDirectly(index, ("url1", "hash_a", "Doc 1"));🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests-integration/Elastic.ContentDateEnrichment.IntegrationTests/ContentDateEnrichmentTests.cs` around lines 307 - 308, Update the test comment to accurately reflect what IndexDocumentsDirectly now serializes: instead of saying "no content_last_updated" state that the document is indexed with the default DocumentationDocument.content_last_updated value (DateTime.MinValue/0001-01-01), matching the doc2 comment pattern; locate the comment above the call to IndexDocumentsDirectly in ContentDateEnrichmentTests.cs and amend the text to mention the default/min value rather than omitting the field.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In
`@tests-integration/Elastic.ContentDateEnrichment.IntegrationTests/ContentDateEnrichmentTests.cs`:
- Around line 307-308: Update the test comment to accurately reflect what
IndexDocumentsDirectly now serializes: instead of saying "no
content_last_updated" state that the document is indexed with the default
DocumentationDocument.content_last_updated value (DateTime.MinValue/0001-01-01),
matching the doc2 comment pattern; locate the comment above the call to
IndexDocumentsDirectly in ContentDateEnrichmentTests.cs and amend the text to
mention the default/min value rather than omitting the field.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 84a64253-e054-48d3-be04-19248f5a77ad
📒 Files selected for processing (2)
src/Elastic.Markdown/Exporters/Elasticsearch/ContentDateEnrichment.cstests-integration/Elastic.ContentDateEnrichment.IntegrationTests/ContentDateEnrichmentTests.cs
- Extract query JsonObject to static readonly string (avoid re-allocation per call) - Remove debug output.WriteLine from discovery test - Fix double JsonNode.Parse — reuse parsed node for params.doc - Use const for hash field name - Fix misleading comment on doc1 in FilteredResolve test - Remove unnecessary null-forgiving operators Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@tests-integration/Elastic.ContentDateEnrichment.IntegrationTests/ContentDateEnrichmentTests.cs`:
- Around line 576-582: The current assertion only checks
response.ApiCallDetails.HasSuccessfulStatusCode which can be true even if some
bulk items failed; update the test after the PostAsync<StringResponse> call to
assert that the bulk response reports no item errors by parsing or inspecting
the response body and asserting errors == false (e.g., deserialize response.Body
to the appropriate bulk-response model or search the JSON for "errors": false)
in addition to the existing HasSuccessfulStatusCode check so that variable
response (returned by PostAsync<StringResponse>) explicitly verifies no per-item
failures.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 513dae87-38ce-4451-b9f2-12d2601cd2b9
📒 Files selected for processing (2)
src/Elastic.Markdown/Exporters/Elasticsearch/ContentDateEnrichment.cstests-integration/Elastic.ContentDateEnrichment.IntegrationTests/ContentDateEnrichmentTests.cs
🚧 Files skipped from review as they are similar to previous changes (1)
- src/Elastic.Markdown/Exporters/Elasticsearch/ContentDateEnrichment.cs
The existing check only verified HTTP status code, which can be 200 even when individual bulk items fail. Parse the response body and assert "errors": false. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests-integration/Elastic.ContentDateEnrichment.IntegrationTests/ContentDateEnrichmentTests.cs (1)
102-103: Avoid hard-coded year checks in resolved-date assertions.Using
Year >= 2026makes these tests unnecessarily time-dependent. The behavior under test is “resolved date is not default/unresolved,” so compare againstDateTimeOffset.UnixEpoch(or a captured test start time) instead.Also applies to: 214-215, 232-233, 341-342, 345-345
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests-integration/Elastic.ContentDateEnrichment.IntegrationTests/ContentDateEnrichmentTests.cs` around lines 102 - 103, The assertion currently checks doc.ContentLastUpdated.Value.Year.Should().BeGreaterThanOrEqualTo(2026) which hard-codes a year; replace these time-dependent checks (the assertions referencing doc.ContentLastUpdated.Value.Year in ContentDateEnrichmentTests) with a comparison against a non-default baseline such as DateTimeOffset.UnixEpoch (or a captured test start time variable) — e.g., assert doc.ContentLastUpdated.Should().BeAfter(DateTimeOffset.UnixEpoch) or BeAfter(testStart) — and update the other occurrences referenced in the comment (the assertions at the other noted locations) similarly so you verify “not default/unresolved” rather than a specific year.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@tests-integration/Elastic.ContentDateEnrichment.IntegrationTests/ContentDateEnrichmentTests.cs`:
- Around line 584-586: The test currently treats a missing or non-boolean
"errors" field as false; change the assertion to first verify the bulkResult
JSON exists and contains an "errors" property of the expected boolean shape
(e.g., ensure bulkResult != null and bulkResult["errors"] is present and can be
parsed as bool) and fail the test if the property is missing or malformed, then
assert that that boolean value is false; reference the variables bulkResult and
hasErrors in ContentDateEnrichmentTests to locate and update the validation
logic.
---
Nitpick comments:
In
`@tests-integration/Elastic.ContentDateEnrichment.IntegrationTests/ContentDateEnrichmentTests.cs`:
- Around line 102-103: The assertion currently checks
doc.ContentLastUpdated.Value.Year.Should().BeGreaterThanOrEqualTo(2026) which
hard-codes a year; replace these time-dependent checks (the assertions
referencing doc.ContentLastUpdated.Value.Year in ContentDateEnrichmentTests)
with a comparison against a non-default baseline such as
DateTimeOffset.UnixEpoch (or a captured test start time variable) — e.g., assert
doc.ContentLastUpdated.Should().BeAfter(DateTimeOffset.UnixEpoch) or
BeAfter(testStart) — and update the other occurrences referenced in the comment
(the assertions at the other noted locations) similarly so you verify “not
default/unresolved” rather than a specific year.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 2731e163-6446-4342-9111-596cd7a4f5f2
📒 Files selected for processing (1)
tests-integration/Elastic.ContentDateEnrichment.IntegrationTests/ContentDateEnrichmentTests.cs
Assert bulkResult and its "errors" property exist before checking the boolean value, rather than defaulting to false on missing/malformed JSON. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
What
Filter
_update_by_queryinResolveContentDatesAsyncto only process documents that don't already have a validcontent_last_updated, instead of re-indexing every document in the index.Why
The unfiltered
_update_by_queryon the semantic index triggers ML inference for all 6semantic_textfields on every document — causing the deploy workflow to hang for 3+ hours (see cancelled run). The lexical index is fast because it has nosemantic_textfields.How
After
HashedBulkUpdate, unchanged documents (noop) retain their resolvedcontent_last_updatedfrom the previous run. Only new/changed documents have the field atDateTimeOffset.MinValue(0001-01-01). Amust_not { range: gt 1970 }filter restricts_update_by_queryto only these unresolved documents — typically a tiny fraction of the index.Also enhances integration tests to use real
HashedBulkUpdate-style scripted upserts with fullDocumentationDocumentserialization.Test plan
HashedBulkUpdatewritescontent_last_updated: "0001-01-01T00:00:00+00:00"for default documents_update_by_querylog:Documents 1: 1 updated)Notes
semantic_textinference fires during_update_by_queryre-indexing, even when text hasn't changedDateTimeOffset.MinValuevalues🤖 Generated with Claude Code