Added extended auto-QA features by lasztoth · Pull Request #998 · webrecorder/browsertrix-crawler

lasztoth · 2026-03-17T09:20:50Z

Introduction

In this fork, we introduce a few refinements to the default Browsertrix QA workflow:

The ability to set a maximum page count for the QA,
The ability to choose between 3 different QA algorithms.

These algorithms are: linear (first come first serve), regex (define a regular expression to perform QA only on URLs that match it), and random (define a probability for each page to be QA'd).

New CLI arguments

We provide the following new CLI arguments for the crawler:

qaMaxUrls: the maximum number of pages to perform QA on,
qaPolicy: can be one of linear, regex or random,
qaRegex: if qaPolicy is regex, then you can define your regular expression here,
qaProbability: if qaPolicy is random, you can define your per-page QA probability here. This is a floating-point number between 0 and 1.

QA Policy: `linear`

In this QA mode, the first qaMaxUrls pages in the pages.jsonl file(s) will be scanned. Example:

--qaPolicy "linear" --qaMaxUrls 50"

QA Policy: `regex`

In this QA mode, only the pages that match the regular expression in qaRegex will be scanned. Example:

--qaPolicy "regex" --qaRegex='^https:\/\/en\.wikipedia\.org\/wiki\/R.*$' --qaMaxUrls 50"

This will match all english Wikipedia articles that start with R.

QA policy: `random`

In this QA mode, pages will be scanned with a probability equal to qaProbability. This is a floating-point number between 0 and 1. Example:

--qaPolicy "random" --qaProbability 0.3 --qaMaxUrls 50"

This policy allows to get a good overall impression of a harvest by scanning a random sample of it.

Maximum number of pages to scan

In every case mentioned above, the number of pages that will be queued for scanning will be at most qaMaxUrls.

ikreymer · 2026-04-10T00:24:45Z

@lasztoth thank you for adding this!
Do you think you could:

Add the content here to the QA docs pages in https://github.com/webrecorder/browsertrix-crawler/blob/main/docs/docs/user-guide/qa.md maybe with a section like 'QA Policies and Additional Options' ?
Add a test or two that uses the new QA - happy to help with this, but would be good to have test coverage for it in case something breaks it. maybe just testing each policy with the above examples and ensuring maxUrl limit is hit and pages are different. Can help with this if you have questions.

Would be great to get this in the next release!

laszlototh-lux · 2026-04-13T06:55:37Z

@lasztoth thank you for adding this! Do you think you could:

Add the content here to the QA docs pages in https://github.com/webrecorder/browsertrix-crawler/blob/main/docs/docs/user-guide/qa.md maybe with a section like 'QA Policies and Additional Options' ?

Add a test or two that uses the new QA - happy to help with this, but would be good to have test coverage for it in case something breaks it. maybe just testing each policy with the above examples and ensuring maxUrl limit is hit and pages are different. Can help with this if you have questions.

Would be great to get this in the next release!

Thank you so much @ikreymer for reviewing this PR :) I will get started on these as soon as possible.

Kind regards,

Added extended auto-QA features

d6e1f01

tw4l added this to Webrecorder Projects Mar 18, 2026

tw4l moved this to In Review in Webrecorder Projects Mar 18, 2026

tw4l requested review from ikreymer and tw4l March 18, 2026 17:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added extended auto-QA features#998

Added extended auto-QA features#998
lasztoth wants to merge 1 commit intowebrecorder:mainfrom
lasztoth:auto-qa-features

lasztoth commented Mar 17, 2026

Uh oh!

ikreymer commented Apr 10, 2026

Uh oh!

laszlototh-lux commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

lasztoth commented Mar 17, 2026

Introduction

New CLI arguments

QA Policy: linear

QA Policy: regex

QA policy: random

Maximum number of pages to scan

Uh oh!

ikreymer commented Apr 10, 2026

Uh oh!

laszlototh-lux commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

QA Policy: `linear`

QA Policy: `regex`

QA policy: `random`