[Feature Request] Allow user to control whether URL fragments are ignored when crawling #1454

karanbirsingh · 2021-06-04T00:11:37Z

Is your feature request related to a problem? Please describe.

By default, Apify ignores URL fragments when computing URL uniqueness. This means http://www.example.com#foo and http://www.example.com#bar are considered equal. These URLs are skipped once http://www.example.com is crawled. This makes sense for many websites because URL fragments often link to sections within the same HTML page.

We found an example of an internal website whose URL fragment links load different HTML pages. Running ai-scan --crawl produced logs that discovered many links but exited without crawling them.

Specifying explicit input URLs doesn't work around the problem. We still strip URL fragments from input URLs.

Describe the solution you'd like

Clients of the service and the accessibility-insights-scan package should be able to control whether Apify includes URL fragments in its uniqueness check. We may be able to leverage the keepUrlFragment argument in Apify.

Clients who use URL fragments to link to sections of a page (like we do in https://accessibilityinsights.io/) would not use the option (to avoid scans on duplicate UI).

The text was updated successfully, but these errors were encountered:

ghost · 2021-06-04T00:12:09Z

This issue has been marked as ready for team triage; we will triage it in our weekly review and update the issue. Thank you for contributing to Accessibility Insights!

ferBonnin · 2021-06-14T20:21:41Z

Reviewed with Maxim, marking this as ready for work for CLI and to be leveraged in GH action only. This is bug sized.

ferBonnin · 2022-08-15T20:14:26Z

adding a note that another user has encountered this issue

karanbirsingh added the feature request label Jun 4, 2021

ghost added the status: new label Jun 4, 2021

ghost assigned karanbirsingh Jun 4, 2021

karanbirsingh added the status: ready for triage label Jun 4, 2021

ghost removed the status: new label Jun 4, 2021

karanbirsingh assigned ferBonnin and unassigned karanbirsingh Jun 4, 2021

ferBonnin added bug status: ready for work and removed feature request status: ready for triage labels Jun 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Allow user to control whether URL fragments are ignored when crawling #1454

[Feature Request] Allow user to control whether URL fragments are ignored when crawling #1454

karanbirsingh commented Jun 4, 2021 •

edited

ghost commented Jun 4, 2021

ferBonnin commented Jun 14, 2021

ferBonnin commented Aug 15, 2022

[Feature Request] Allow user to control whether URL fragments are ignored when crawling #1454

[Feature Request] Allow user to control whether URL fragments are ignored when crawling #1454

Comments

karanbirsingh commented Jun 4, 2021 • edited

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

ghost commented Jun 4, 2021

ferBonnin commented Jun 14, 2021

ferBonnin commented Aug 15, 2022

karanbirsingh commented Jun 4, 2021 •

edited