Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
community[patch]: RecursiveUrlLoader: add base_url option (langchain-…
…ai#19421) RecursiveUrlLoader does not currently provide an option to set `base_url` other than the `url`, though it uses a function with such an option. For example, this causes it unable to parse the `https://python.langchain.com/docs`, as it returns the 404 page, and `https://python.langchain.com/docs/get_started/introduction` has no child routes to parse. `base_url` allows setting the `https://python.langchain.com/docs` to filter by, while the starting URL is anything inside, that contains relevant links to continue crawling. I understand that for this case, the docusaurus loader could be used, but it's a common issue with many websites. --------- Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com> Co-authored-by: Bagatur <baskaryan@gmail.com>
- Loading branch information