Skip to content

Commit

Permalink
community[patch]: RecursiveUrlLoader: add base_url option (langchain-…
Browse files Browse the repository at this point in the history
…ai#19421)

RecursiveUrlLoader does not currently provide an option to set
`base_url` other than the `url`, though it uses a function with such an
option.
For example, this causes it unable to parse the
`https://python.langchain.com/docs`, as it returns the 404 page, and
`https://python.langchain.com/docs/get_started/introduction` has no
child routes to parse.
`base_url` allows setting the `https://python.langchain.com/docs` to
filter by, while the starting URL is anything inside, that contains
relevant links to continue crawling.
I understand that for this case, the docusaurus loader could be used,
but it's a common issue with many websites.

---------

Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
  • Loading branch information
3 people authored and Dave Bechberger committed Mar 29, 2024
1 parent d9acda0 commit 2167ae4
Showing 1 changed file with 6 additions and 2 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,8 @@ def __init__(
headers: Optional[dict] = None,
check_response_status: bool = False,
continue_on_failure: bool = True,
*,
base_url: Optional[str] = None,
) -> None:
"""Initialize with URL to crawl and any subdirectories to exclude.
Expand All @@ -120,6 +122,7 @@ def __init__(
URLs with error responses (400-599).
continue_on_failure: If True, continue if getting or parsing a link raises
an exception. Otherwise, raise the exception.
base_url: The base url to check for outside links against.
"""

self.url = url
Expand All @@ -146,6 +149,7 @@ def __init__(
self.headers = headers
self.check_response_status = check_response_status
self.continue_on_failure = continue_on_failure
self.base_url = base_url if base_url is not None else url

def _get_child_links_recursive(
self, url: str, visited: Set[str], *, depth: int = 0
Expand Down Expand Up @@ -187,7 +191,7 @@ def _get_child_links_recursive(
sub_links = extract_sub_links(
response.text,
url,
base_url=self.url,
base_url=self.base_url,
pattern=self.link_regex,
prevent_outside=self.prevent_outside,
exclude_prefixes=self.exclude_dirs,
Expand Down Expand Up @@ -273,7 +277,7 @@ async def _async_get_child_links_recursive(
sub_links = extract_sub_links(
text,
url,
base_url=self.url,
base_url=self.base_url,
pattern=self.link_regex,
prevent_outside=self.prevent_outside,
exclude_prefixes=self.exclude_dirs,
Expand Down

0 comments on commit 2167ae4

Please sign in to comment.