community: use jq schema for content_key in json_loader #18003

kzk-maeda · 2024-02-23T08:11:17Z

Description

Changed the value specified for content_key in JSONLoader from a single key to a value based on jq schema.
I created similar PR before, but it has several conflicts because of the architectural change associated stable version release, so I re-create this PR to fit new architecture.

Why

For json data like the following, specify .data[].attributes.message for page_content and .data[].attributes.id or .data[].attributes.attributes. tags, etc., the content_key must also parse the json structure.

sample json data

{
  "data": [
    {
      "attributes": {
        "message": "message1",
        "tags": [
          "tag1"
        ]
      },
      "id": "1"
    },
    {
      "attributes": {
        "message": "message2",
        "tags": [
          "tag2"
        ]
      },
      "id": "2"
    }
  ]
}

sample code

def metadata_func(record: dict, metadata: dict) -> dict:

    metadata["source"] = None
    metadata["id"] = record.get("id")
    metadata["tags"] = record["attributes"].get("tags")

    return metadata

sample_file = "sample1.json"
loader = JSONLoader(
    file_path=sample_file,
    jq_schema=".data[]",
    content_key=".attributes.message", ## content_key is parsable into jq schema
    is_content_key_jq_parsable=True, ## this is added parameter
    metadata_func=metadata_func
)

data = loader.load()
data

Dependencies

none

Twitter handle

kzk_maeda

vercel · 2024-02-23T08:11:21Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Feb 23, 2024 8:18am

…hain-ai#18003) ### Description Changed the value specified for `content_key` in JSONLoader from a single key to a value based on jq schema. I created [similar PR](langchain-ai#11255) before, but it has several conflicts because of the architectural change associated stable version release, so I re-create this PR to fit new architecture. ### Why For json data like the following, specify `.data[].attributes.message` for page_content and `.data[].attributes.id` or `.data[].attributes.attributes. tags`, etc., the `content_key` must also parse the json structure. <details> <summary>sample json data</summary> ```json { "data": [ { "attributes": { "message": "message1", "tags": [ "tag1" ] }, "id": "1" }, { "attributes": { "message": "message2", "tags": [ "tag2" ] }, "id": "2" } ] } ``` </details> <details> <summary>sample code</summary> ```python def metadata_func(record: dict, metadata: dict) -> dict: metadata["source"] = None metadata["id"] = record.get("id") metadata["tags"] = record["attributes"].get("tags") return metadata sample_file = "sample1.json" loader = JSONLoader( file_path=sample_file, jq_schema=".data[]", content_key=".attributes.message", ## content_key is parsable into jq schema is_content_key_jq_parsable=True, ## this is added parameter metadata_func=metadata_func ) data = loader.load() data ``` </details> ### Dependencies none ### Twitter handle [kzk_maeda](https://twitter.com/kzk_maeda)

kzk-maeda added 2 commits February 23, 2024 16:33

add is_content_key_jq_parsable param into json_loader

e221b1f

add new section into json loader doc

c5b3992

dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Feb 23, 2024

dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases labels Feb 23, 2024

kzk-maeda mentioned this pull request Feb 23, 2024

use jq schema for content_key in json_loader #11255

Closed

vercel bot deployed to Preview February 23, 2024 08:18 View deployment

baskaryan approved these changes Mar 1, 2024

View reviewed changes

dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Mar 1, 2024

baskaryan approved these changes Mar 5, 2024

View reviewed changes

baskaryan merged commit 60c5d96 into langchain-ai:master Mar 5, 2024
63 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

community: use jq schema for content_key in json_loader #18003

community: use jq schema for content_key in json_loader #18003

kzk-maeda commented Feb 23, 2024

vercel bot commented Feb 23, 2024 •

edited

community: use jq schema for content_key in json_loader #18003

community: use jq schema for content_key in json_loader #18003

Conversation

kzk-maeda commented Feb 23, 2024

Description

Why

Dependencies

Twitter handle

vercel bot commented Feb 23, 2024 • edited

vercel bot commented Feb 23, 2024 •

edited