Skip to content

Added extended auto-QA features#998

Open
lasztoth wants to merge 1 commit intowebrecorder:mainfrom
lasztoth:auto-qa-features
Open

Added extended auto-QA features#998
lasztoth wants to merge 1 commit intowebrecorder:mainfrom
lasztoth:auto-qa-features

Conversation

@lasztoth
Copy link
Copy Markdown

Introduction

In this fork, we introduce a few refinements to the default Browsertrix QA workflow:

  1. The ability to set a maximum page count for the QA,
  2. The ability to choose between 3 different QA algorithms.

These algorithms are: linear (first come first serve), regex (define a regular expression to perform QA only on URLs that match it), and random (define a probability for each page to be QA'd).

New CLI arguments

We provide the following new CLI arguments for the crawler:

  • qaMaxUrls: the maximum number of pages to perform QA on,
  • qaPolicy: can be one of linear, regex or random,
  • qaRegex: if qaPolicy is regex, then you can define your regular expression here,
  • qaProbability: if qaPolicy is random, you can define your per-page QA probability here. This is a floating-point number between 0 and 1.

QA Policy: linear

In this QA mode, the first qaMaxUrls pages in the pages.jsonl file(s) will be scanned. Example:

--qaPolicy "linear" --qaMaxUrls 50"

QA Policy: regex

In this QA mode, only the pages that match the regular expression in qaRegex will be scanned. Example:

--qaPolicy "regex" --qaRegex='^https:\/\/en\.wikipedia\.org\/wiki\/R.*$' --qaMaxUrls 50"

This will match all english Wikipedia articles that start with R.

QA policy: random

In this QA mode, pages will be scanned with a probability equal to qaProbability. This is a floating-point number between 0 and 1. Example:

--qaPolicy "random" --qaProbability 0.3 --qaMaxUrls 50"

This policy allows to get a good overall impression of a harvest by scanning a random sample of it.

Maximum number of pages to scan

In every case mentioned above, the number of pages that will be queued for scanning will be at most qaMaxUrls.

@tw4l tw4l moved this to In Review in Webrecorder Projects Mar 18, 2026
@tw4l tw4l requested review from ikreymer and tw4l March 18, 2026 17:17
@ikreymer
Copy link
Copy Markdown
Member

@lasztoth thank you for adding this!
Do you think you could:

  • Add the content here to the QA docs pages in https://github.com/webrecorder/browsertrix-crawler/blob/main/docs/docs/user-guide/qa.md maybe with a section like 'QA Policies and Additional Options' ?
  • Add a test or two that uses the new QA - happy to help with this, but would be good to have test coverage for it in case something breaks it. maybe just testing each policy with the above examples and ensuring maxUrl limit is hit and pages are different. Can help with this if you have questions.

Would be great to get this in the next release!

@laszlototh-lux
Copy link
Copy Markdown

@lasztoth thank you for adding this! Do you think you could:

  • Add the content here to the QA docs pages in https://github.com/webrecorder/browsertrix-crawler/blob/main/docs/docs/user-guide/qa.md maybe with a section like 'QA Policies and Additional Options' ?
  • Add a test or two that uses the new QA - happy to help with this, but would be good to have test coverage for it in case something breaks it. maybe just testing each policy with the above examples and ensuring maxUrl limit is hit and pages are different. Can help with this if you have questions.

Would be great to get this in the next release!

Thank you so much @ikreymer for reviewing this PR :) I will get started on these as soon as possible.

Kind regards,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

4 participants