Skip to content

ContentDateEnrichment: Filter _update_by_query to only unresolved documents#3118

Open
reakaleek wants to merge 4 commits intomainfrom
royal-pancake
Open

ContentDateEnrichment: Filter _update_by_query to only unresolved documents#3118
reakaleek wants to merge 4 commits intomainfrom
royal-pancake

Conversation

@reakaleek
Copy link
Copy Markdown
Member

What

Filter _update_by_query in ResolveContentDatesAsync to only process documents that don't already have a valid content_last_updated, instead of re-indexing every document in the index.

Why

The unfiltered _update_by_query on the semantic index triggers ML inference for all 6 semantic_text fields on every document — causing the deploy workflow to hang for 3+ hours (see cancelled run). The lexical index is fast because it has no semantic_text fields.

How

After HashedBulkUpdate, unchanged documents (noop) retain their resolved content_last_updated from the previous run. Only new/changed documents have the field at DateTimeOffset.MinValue (0001-01-01). A must_not { range: gt 1970 } filter restricts _update_by_query to only these unresolved documents — typically a tiny fraction of the index.

Also enhances integration tests to use real HashedBulkUpdate-style scripted upserts with full DocumentationDocument serialization.

Test plan

  • Discovery test confirms HashedBulkUpdate writes content_last_updated: "0001-01-01T00:00:00+00:00" for default documents
  • Filter test confirms resolved docs are skipped, unresolved docs are processed
  • End-to-end test confirms second run only processes 1 changed doc out of 3 (visible in _update_by_query log: Documents 1: 1 updated)
  • All 7 integration tests pass (existing + 3 new)

Notes

  • Root cause: semantic_text inference fires during _update_by_query re-indexing, even when text hasn't changed
  • The filter handles both missing fields and DateTimeOffset.MinValue values
  • First-ever run still processes all documents (all have default dates), but subsequent runs are fast

🤖 Generated with Claude Code

…uments

The _update_by_query in ResolveContentDatesAsync was re-indexing every
document in both the lexical and semantic indices. On the semantic index,
this triggered ML inference for all 6 semantic_text fields on every
document — causing the deploy workflow to hang for 3+ hours.

After HashedBulkUpdate, unchanged documents (noop) retain their resolved
content_last_updated from the previous run. Only new/changed documents
have the field at the default DateTimeOffset.MinValue (0001-01-01). The
filter restricts _update_by_query to only these unresolved documents,
reducing the typical deploy from hundreds of thousands of documents to
just the changed ones.

Also enhances integration tests to use real HashedBulkUpdate-style
scripted upserts with full DocumentationDocument serialization, and adds
tests proving the filter behavior.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 15, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 55792808-fde2-4148-8d44-d749e553fa88

📥 Commits

Reviewing files that changed from the base of the PR and between fe54792 and 7cd628f.

📒 Files selected for processing (1)
  • tests-integration/Elastic.ContentDateEnrichment.IntegrationTests/ContentDateEnrichmentTests.cs

📝 Walkthrough

Walkthrough

ResolveContentDatesAsync now calls Elasticsearch's _update_by_query with an explicit JSON query (PostData.String(UnresolvedContentDatesQuery)) that excludes documents whose content_last_updated is greater than 1970-01-01T00:00:00Z, rather than using PostData.Empty. Integration tests were rewritten to index full serialized DocumentationDocument via a new hash-aware scripted upsert helper, updated assertions for default/pre-epoch content_last_updated behavior, and added tests for stored representation, resolution filtering, and a two-pass end-to-end scenario.

Sequence Diagram(s)

sequenceDiagram
  participant App as Application
  participant ES as Elasticsearch
  participant Pipeline as Ingest Pipeline

  App->>ES: _update_by_query (query: content_last_updated <= "1970-01-01T00:00:00Z", pipeline=content-date-enrich)
  ES->>ES: find matching documents
  ES->>Pipeline: apply ingest pipeline to each matching document
  Pipeline-->>ES: return enriched document
  ES-->>App: _update_by_query response (updated count)
Loading
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 63.64% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: filtering _update_by_query to process only unresolved documents with missing or default content_last_updated timestamps.
Description check ✅ Passed The description comprehensively explains what changed, why (performance issue with ML inference), how (filtering by content_last_updated timestamp), and includes a clear test plan validating the changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch royal-pancake

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests-integration/Elastic.ContentDateEnrichment.IntegrationTests/ContentDateEnrichmentTests.cs (1)

307-308: Minor: Comment is slightly misleading.

The comment says "no content_last_updated" but IndexDocumentsDirectly now uses full DocumentationDocument serialization which includes content_last_updated at the default 0001-01-01 value. Consider updating to match doc2's comment pattern.

Proposed fix
-		// doc1: no content_last_updated (simulates scripted upsert that omits the field)
+		// doc1: content_last_updated at default DateTimeOffset.MinValue (simulates production scripted upsert)
		await IndexDocumentsDirectly(index, ("url1", "hash_a", "Doc 1"));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests-integration/Elastic.ContentDateEnrichment.IntegrationTests/ContentDateEnrichmentTests.cs`
around lines 307 - 308, Update the test comment to accurately reflect what
IndexDocumentsDirectly now serializes: instead of saying "no
content_last_updated" state that the document is indexed with the default
DocumentationDocument.content_last_updated value (DateTime.MinValue/0001-01-01),
matching the doc2 comment pattern; locate the comment above the call to
IndexDocumentsDirectly in ContentDateEnrichmentTests.cs and amend the text to
mention the default/min value rather than omitting the field.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@tests-integration/Elastic.ContentDateEnrichment.IntegrationTests/ContentDateEnrichmentTests.cs`:
- Around line 307-308: Update the test comment to accurately reflect what
IndexDocumentsDirectly now serializes: instead of saying "no
content_last_updated" state that the document is indexed with the default
DocumentationDocument.content_last_updated value (DateTime.MinValue/0001-01-01),
matching the doc2 comment pattern; locate the comment above the call to
IndexDocumentsDirectly in ContentDateEnrichmentTests.cs and amend the text to
mention the default/min value rather than omitting the field.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 84a64253-e054-48d3-be04-19248f5a77ad

📥 Commits

Reviewing files that changed from the base of the PR and between 08d43d9 and e25ef46.

📒 Files selected for processing (2)
  • src/Elastic.Markdown/Exporters/Elasticsearch/ContentDateEnrichment.cs
  • tests-integration/Elastic.ContentDateEnrichment.IntegrationTests/ContentDateEnrichmentTests.cs

- Extract query JsonObject to static readonly string (avoid re-allocation per call)
- Remove debug output.WriteLine from discovery test
- Fix double JsonNode.Parse — reuse parsed node for params.doc
- Use const for hash field name
- Fix misleading comment on doc1 in FilteredResolve test
- Remove unnecessary null-forgiving operators

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@tests-integration/Elastic.ContentDateEnrichment.IntegrationTests/ContentDateEnrichmentTests.cs`:
- Around line 576-582: The current assertion only checks
response.ApiCallDetails.HasSuccessfulStatusCode which can be true even if some
bulk items failed; update the test after the PostAsync<StringResponse> call to
assert that the bulk response reports no item errors by parsing or inspecting
the response body and asserting errors == false (e.g., deserialize response.Body
to the appropriate bulk-response model or search the JSON for "errors": false)
in addition to the existing HasSuccessfulStatusCode check so that variable
response (returned by PostAsync<StringResponse>) explicitly verifies no per-item
failures.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 513dae87-38ce-4451-b9f2-12d2601cd2b9

📥 Commits

Reviewing files that changed from the base of the PR and between e25ef46 and 2791de7.

📒 Files selected for processing (2)
  • src/Elastic.Markdown/Exporters/Elasticsearch/ContentDateEnrichment.cs
  • tests-integration/Elastic.ContentDateEnrichment.IntegrationTests/ContentDateEnrichmentTests.cs
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/Elastic.Markdown/Exporters/Elasticsearch/ContentDateEnrichment.cs

The existing check only verified HTTP status code, which can be 200 even
when individual bulk items fail. Parse the response body and assert
"errors": false.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests-integration/Elastic.ContentDateEnrichment.IntegrationTests/ContentDateEnrichmentTests.cs (1)

102-103: Avoid hard-coded year checks in resolved-date assertions.

Using Year >= 2026 makes these tests unnecessarily time-dependent. The behavior under test is “resolved date is not default/unresolved,” so compare against DateTimeOffset.UnixEpoch (or a captured test start time) instead.

Also applies to: 214-215, 232-233, 341-342, 345-345

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests-integration/Elastic.ContentDateEnrichment.IntegrationTests/ContentDateEnrichmentTests.cs`
around lines 102 - 103, The assertion currently checks
doc.ContentLastUpdated.Value.Year.Should().BeGreaterThanOrEqualTo(2026) which
hard-codes a year; replace these time-dependent checks (the assertions
referencing doc.ContentLastUpdated.Value.Year in ContentDateEnrichmentTests)
with a comparison against a non-default baseline such as
DateTimeOffset.UnixEpoch (or a captured test start time variable) — e.g., assert
doc.ContentLastUpdated.Should().BeAfter(DateTimeOffset.UnixEpoch) or
BeAfter(testStart) — and update the other occurrences referenced in the comment
(the assertions at the other noted locations) similarly so you verify “not
default/unresolved” rather than a specific year.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@tests-integration/Elastic.ContentDateEnrichment.IntegrationTests/ContentDateEnrichmentTests.cs`:
- Around line 584-586: The test currently treats a missing or non-boolean
"errors" field as false; change the assertion to first verify the bulkResult
JSON exists and contains an "errors" property of the expected boolean shape
(e.g., ensure bulkResult != null and bulkResult["errors"] is present and can be
parsed as bool) and fail the test if the property is missing or malformed, then
assert that that boolean value is false; reference the variables bulkResult and
hasErrors in ContentDateEnrichmentTests to locate and update the validation
logic.

---

Nitpick comments:
In
`@tests-integration/Elastic.ContentDateEnrichment.IntegrationTests/ContentDateEnrichmentTests.cs`:
- Around line 102-103: The assertion currently checks
doc.ContentLastUpdated.Value.Year.Should().BeGreaterThanOrEqualTo(2026) which
hard-codes a year; replace these time-dependent checks (the assertions
referencing doc.ContentLastUpdated.Value.Year in ContentDateEnrichmentTests)
with a comparison against a non-default baseline such as
DateTimeOffset.UnixEpoch (or a captured test start time variable) — e.g., assert
doc.ContentLastUpdated.Should().BeAfter(DateTimeOffset.UnixEpoch) or
BeAfter(testStart) — and update the other occurrences referenced in the comment
(the assertions at the other noted locations) similarly so you verify “not
default/unresolved” rather than a specific year.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 2731e163-6446-4342-9111-596cd7a4f5f2

📥 Commits

Reviewing files that changed from the base of the PR and between 2791de7 and fe54792.

📒 Files selected for processing (1)
  • tests-integration/Elastic.ContentDateEnrichment.IntegrationTests/ContentDateEnrichmentTests.cs

Assert bulkResult and its "errors" property exist before checking the
boolean value, rather than defaulting to false on missing/malformed JSON.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant