Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

langchain[minor]: Firecrawl Document Loader #5180

Merged
merged 18 commits into from
Apr 24, 2024

Conversation

nickscamara
Copy link
Contributor

Added the FireCrawl document loader. Firecrawl crawls and convert any website into LLM-ready data. It crawls all accessible subpages and give you clean markdown for each.

Similar to langchain-ai/langchain#20364 but JS/TS :)

Thank you!

@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Apr 22, 2024
Copy link

vercel bot commented Apr 22, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchainjs-api-refs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Apr 24, 2024 9:26pm
langchainjs-docs ✅ Ready (Inspect) Visit Preview Apr 24, 2024 9:26pm

@dosubot dosubot bot added the auto:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features label Apr 22, 2024
@@ -0,0 +1,13 @@
import { FireCrawlLoader } from "langchain/document_loaders/web/firecrawl";
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there! 👋 I've reviewed the code and noticed that the added lines explicitly access an environment variable using process.env. I've flagged this for your review to ensure it aligns with our best practices for handling environment variables. Let me know if you need further assistance with this.

@@ -1314,6 +1315,7 @@
"@gomomento/sdk-web": "^1.51.1",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there! I noticed that this PR adds a new hard dependency with the addition of "@mendable/firecrawl-js". I've flagged this for your review as it's important to ensure the impact of this change on the project's dependencies. Keep up the great work!

@@ -0,0 +1,34 @@
/* eslint-disable no-process-env */
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there! I noticed that the recent changes in the FireCrawlLoader tests are accessing environment variables directly. I've flagged this for your review to ensure proper handling of sensitive information. Let me know if you need any further assistance with this.

@@ -0,0 +1,96 @@
import FirecrawlApp from "@mendable/firecrawl-js";
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there! I noticed that the recent code changes explicitly access an environment variable using getEnvironmentVariable. I've flagged this for your review to ensure it aligns with the intended functionality. Keep up the great work!

Copy link
Collaborator

@bracesproul bracesproul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple nits, besides that it looks pretty good!


import IntegrationInstallTooltip from "@mdx_components/integration_install_tooltip.mdx";

<IntegrationInstallTooltip></IntegrationInstallTooltip>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for an IntegrationInstallTooltip for non LC deps. You only this for langchain package installs


## Setup

You'll need to sign up and retrieve your [FireCrawl API key](https://firecrawl.dev).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You'll need to sign up and retrieve your [FireCrawl API key](https://firecrawl.dev).
You'll need to sign up and retrieve your [FireCrawl API key](https://firecrawl.dev).

I would add some mention of the 100 free credits. I could easily see a user reading up until the API key requirement part and getting discouraged before they're even able to try it out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100%

examples/src/document_loaders/firecrawl.ts Outdated Show resolved Hide resolved
Comment on lines 15 to 16
apiKey?: string;
mode?: "crawl" | "scrape";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add JSDoc here documenting what the default values are for both of these

langchain/src/document_loaders/web/firecrawl.ts Outdated Show resolved Hide resolved
* @returns An array of Documents representing the retrieved data.
* @throws An error if the data could not be loaded.
*/
public async load(): Promise<Document[]> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer DocumentInterface

Suggested change
public async load(): Promise<Document[]> {
public async load(): Promise<DocumentInterface[]> {

nickscamara and others added 3 commits April 22, 2024 17:24
Co-authored-by: Brace Sproul <braceasproul@gmail.com>
Co-authored-by: Brace Sproul <braceasproul@gmail.com>
@nickscamara
Copy link
Contributor Author

@bracesproul Thank you. Just fixed all the suggestions. Let me know if you spot any other ones.

Copy link
Collaborator

@bracesproul bracesproul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks for adding this!!

@dosubot dosubot bot added the lgtm PRs that are ready to be merged as-is label Apr 23, 2024
Copy link
Collaborator

@bracesproul bracesproul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, the amount of changed lines in the yarn.lock is concerning. I'm going to push up a PR in a second resolving

Resolved

@bracesproul bracesproul added hold On hold and removed lgtm PRs that are ready to be merged as-is labels Apr 24, 2024
@bracesproul bracesproul added lgtm PRs that are ready to be merged as-is and removed hold On hold labels Apr 24, 2024
@jacoblee93
Copy link
Collaborator

Thank you!

Note that we'll move this to @langchain/community in 0.2 in a few weeks.

@jacoblee93 jacoblee93 merged commit e0da231 into langchain-ai:main Apr 24, 2024
17 checks passed
jeasonnow pushed a commit to jeasonnow/langchainjs that referenced this pull request Apr 25, 2024
* Nick: init

* Update firecrawl.ts

* Nick:

* Nick:

* Update package.json

* Nick: fixes docs

* Update yarn.lock

* Update examples/src/document_loaders/firecrawl.ts

Co-authored-by: Brace Sproul <braceasproul@gmail.com>

* Update langchain/src/document_loaders/web/firecrawl.ts

Co-authored-by: Brace Sproul <braceasproul@gmail.com>

* Nick: fixes

* Update yarn.lock

* Fix yarn.lock

* lint & format

* Update firecrawl.ts

* Add entrypoint

---------

Co-authored-by: Brace Sproul <braceasproul@gmail.com>
Co-authored-by: Jacob Lee <jacoblee93@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features lgtm PRs that are ready to be merged as-is size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants