Add initial pages + pagesQuery endpoint to /replay.json APIs #2380

ikreymer · 2025-02-11T00:48:52Z

Fixes #2360

Adds seedPages to /replay.json response for collections, returning upto 25 seed pages.
Adds pagesQueryUrl to /replay.json
Adds a public pages search endpoint to support public collections.
Adds preloadResources, including list of WACZ files that should always be loaded, to /replay.json

Draft pending work in wabac.js to ensure this is complete.

backend/btrixcloud/models.py

tw4l · 2025-02-11T15:58:23Z

Do we also want to return seeds and the additional pages query url in the replay.json for crawls?

backend/btrixcloud/colls.py

backend/test/test_collections.py

tests: add test 'pagesQuery' for both private and public collections

Needs some testing and possible refinement

Includes information for all WACZ files in collection that contain seed pages or have no associated pages in the database.

- rename pages -> seedPages - rename alwaysLoad -> preloadResources - rename pagesQuery -> pagesQueryUrl - optimize loading preloadResources as part of resource lookup - remove seed page wacz files from preloadResources, can be computed from seedPages list - tests: add additional tests for preloadResources, seedPages

update tests

ikreymer · 2025-02-12T11:11:59Z

Do we also want to return seeds and the additional pages query url in the replay.json for crawls?

Added pagesQueryUrl and seedPages for crawls.

tests: add test 'pagesQuery' for both private and public collections

Needs some testing and possible refinement

Includes information for all WACZ files in collection that contain seed pages or have no associated pages in the database.

- rename pages -> seedPages - rename alwaysLoad -> preloadResources - rename pagesQuery -> pagesQueryUrl - optimize loading preloadResources as part of resource lookup - remove seed page wacz files from preloadResources, can be computed from seedPages list - tests: add additional tests for preloadResources, seedPages

update tests

ikreymer · 2025-02-13T05:18:06Z

One more small change, what if instead of seedPages we went with initialPages and it included the first 25 pages?
The default sort order can be {"$sort": {"isSeed": -1, "ts": 1}}, so we still list all seeds, then all non-seeds, sorted by time. This would allow for infinite scroll to load additional pages w/o a query.

add totalPages collection pages: default sort order seeds first, then by timestamp

…x-cloud into replay-json-pages

ikreymer requested a review from tw4l February 11, 2025 00:48

tw4l reviewed Feb 11, 2025

View reviewed changes

backend/btrixcloud/models.py Outdated Show resolved Hide resolved

tw4l reviewed Feb 11, 2025

View reviewed changes

backend/btrixcloud/models.py Outdated Show resolved Hide resolved

tw4l reviewed Feb 11, 2025

View reviewed changes

backend/btrixcloud/colls.py Outdated Show resolved Hide resolved

tw4l reviewed Feb 11, 2025

View reviewed changes

backend/test/test_collections.py Outdated Show resolved Hide resolved

ikreymer force-pushed the replay-json-pages branch from 8271965 to bb9bd32 Compare February 12, 2025 00:12

ikreymer and others added 15 commits February 11, 2025 23:18

add pages to coll replay.json endpoint

e5ee366

create CollOut after

cb1715c

add pagesQuery to replay.json

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

cbeb6e7

update model

291dd1c

add /public/pages endpoint for public collections

988cc18

tests: add test 'pagesQuery' for both private and public collections

work on tests

b7243f4

test work

d878b53

Add initial URL + title search to collection pages endpoint

ade353d

Needs some testing and possible refinement

Add alwaysLoad to collection replay.json

9c0c3b2

Includes information for all WACZ files in collection that contain seed pages or have no associated pages in the database.

Add pylint comment

964ef09

crawlfileout: rename to itemId, use basename for 'name'

6552034

switch to crawlId instead

477bb84

handle CORS for public pages endpoint

ea776cd

fix typo

b8a8c18

ikreymer force-pushed the replay-json-pages branch from 386fb98 to 1e1cf12 Compare February 12, 2025 07:53

ikreymer added 4 commits February 12, 2025 00:18

fix test?

Loading
Loading status checks…

4a6d655

ensure public pages access to private collecton is 404

Loading
Loading status checks…

cb448a1

add seedPages and pagesQueryUrl to crawl /replay.json for consistency

Loading
Loading status checks…

7ab0c51

update tests

fix typo

Loading
Loading status checks…

0e43c95

ikreymer added 2 commits February 12, 2025 12:23

add pages to coll replay.json endpoint

62145f8

create CollOut after

b9c5e3b

ikreymer and others added 18 commits February 12, 2025 12:23

add pagesQuery to replay.json

b072934

update model

10e2721

add /public/pages endpoint for public collections

84bacbd

tests: add test 'pagesQuery' for both private and public collections

work on tests

f1fdc5f

test work

05b714c

Add initial URL + title search to collection pages endpoint

79522b8

Needs some testing and possible refinement

Add alwaysLoad to collection replay.json

a7b11df

Includes information for all WACZ files in collection that contain seed pages or have no associated pages in the database.

Add pylint comment

a80c52f

crawlfileout: rename to itemId, use basename for 'name'

d6800f7

switch to crawlId instead

fcdbc40

handle CORS for public pages endpoint

b965004

fix typo

2690b2a

fix test?

529f260

ensure public pages access to private collecton is 404

e394646

add seedPages and pagesQueryUrl to crawl /replay.json for consistency

2f27531

update tests

fix typo

e9d6c9e

Add tests for collection pages search filter

Loading
Loading status checks…

dd664cb

tw4l force-pushed the replay-json-pages branch from 61d0a15 to dd664cb Compare February 12, 2025 17:23

ikreymer added this to the Public Collections milestone Feb 13, 2025

ikreymer marked this pull request as ready for review February 13, 2025 04:21

ikreymer added 2 commits February 12, 2025 22:25

seedPages -> initialPages

bfb7076

add totalPages collection pages: default sort order seeds first, then by timestamp

Merge branch 'replay-json-pages' of github.com:webrecorder/browsertri…

Loading
Loading status checks…

a6d848e

…x-cloud into replay-json-pages

ikreymer merged commit 7b2932c into main Feb 14, 2025
23 checks passed

ikreymer deleted the replay-json-pages branch February 14, 2025 00:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add initial pages + pagesQuery endpoint to /replay.json APIs #2380

Add initial pages + pagesQuery endpoint to /replay.json APIs #2380

ikreymer commented Feb 11, 2025 •

edited by tw4l

Loading

tw4l commented Feb 11, 2025

ikreymer commented Feb 12, 2025

ikreymer commented Feb 13, 2025

Add initial pages + pagesQuery endpoint to /replay.json APIs #2380

Add initial pages + pagesQuery endpoint to /replay.json APIs #2380

Conversation

ikreymer commented Feb 11, 2025 • edited by tw4l Loading

tw4l commented Feb 11, 2025

ikreymer commented Feb 12, 2025

ikreymer commented Feb 13, 2025

ikreymer commented Feb 11, 2025 •

edited by tw4l

Loading