use jq schema for content_key in json_loader #11255

kzk-maeda · 2023-10-01T04:12:17Z

Description

Changed the value specified for content_key in JSONLoader from a single key to a value based on jq schema.

Sorry, I created the same Pull Request in the past, but some time has elapsed and it was closed, so I am recreating it.

Why

For json data like the following, specify .data[].attributes.message for page_content and .data[].attributes.id or .data[].attributes.attributes. tags, etc., the content_key must also parse the json structure.

sample json data

{
  "data": [
    {
      "attributes": {
        "attributes": {
          "dd": {
            "service": "worker",
            "env": "production",
            "version": "6b2c46a5883a9097aa2cb09907786b5c06ca3bd0"
          },
          "source": "stderr",
          "service": "worker",
          "name": "app.services.predict_services",
          "levelname": "ERROR",
          "container_id": "f9f880b243cc41f1a7d9bae5bf922d60-1262558729",
          "timestamp": 1686679407827.0
        },
        "message": "error while processing user #19084: 'numpy.dtype[bool_]' object is not callable",
        "service": "worker",
        "status": "error",
        "tags": [
          "datadog.submission_auth:private_api_key",
          "env:production"
        ],
        "timestamp": "2023-06-14T03:03:27.827000+09:00"
      },
      "id": "AgAAAYi17UzTUeIczgAAAAAAAAAYAAAAAEFZaTE3VThXQUFEOF9sS2J4Z3psRmdBRAAAACQAAAAAMDE4OGI2MzYtMmZlNC00ZDEwLThjZDMtMzhkZTI0NmUyNWMz",
      "type": "log"
    },
    {
      "attributes": {
        "attributes": {
          "dd": {
            "service": "worker",
            "env": "production",
            "version": "6b2c46a5883a9097aa2cb09907786b5c06ca3bd0"
          },
          "process": 42.0,
          "messages": "error while processing user #19084: 'numpy.dtype[bool_]' object is not callable",
          "levelname": "ERROR",
          "container_id": "f9f880b243cc41f1a7d9bae5bf922d60-1262558729",
          "timestamp": 1686679407831.0
        },
        "message": "{\"messages\": \"error while processing user #19084: 'numpy.dtype[bool_]' object is not callable\"}",
        "service": "worker",
        "status": "error",
        "tags": [
          "datadog.submission_auth:private_api_key",
          "env:production"
        ],
        "timestamp": "2023-06-14T03:03:27.831000+09:00"
      },
      "id": "AgAAAYi17UzXUeIczwAAAAAAAAAYAAAAAEFZaTE3VThXQUFEOF9sS2J4Z3psRmdBRQAAACQAAAAAMDE4OGI2MzYtMmZlNC00ZDEwLThjZDMtMzhkZTI0NmUyNWMz",
      "type": "log"
    }
  ],
  "meta": {
    "elapsed": 26,
    "request_id": "pddv1ChY0SmttaDA0a1REeXZRM01yNkFwYnd3Ii0KHWIANr8mghGpsMIX2cOarI6t4WyTVObXx3wrAuudEgzSbtmduLtPxFVkSo0",
    "status": "done"
  }
}

sample code

def metadata_func(record: dict, metadata: dict) -> dict:
    print(record)

    metadata["id"] = record.get("id")
    metadata["tags"] = record["attributes"].get("tags")

    return metadata

sample_file = "sample.json"
loader = JSONLoader(
    file_path=sample_file,
    jq_schema=".data[]",
    content_key=".attributes.message",
    metadata_func=metadata_func
)

Dependencies

none

Tag maintainer

@rlancemartin, @eyurtsev

Twitter handle

kzk_maeda

vercel · 2023-10-01T04:12:21Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)	Visit Preview		Oct 5, 2023 3:11pm

eyurtsev · 2023-10-01T19:35:10Z

libs/langchain/langchain/document_loaders/json_loader.py

@@ -49,7 +50,7 @@ def __init__(

        self.file_path = Path(file_path).resolve()
        self._jq_schema = jq.compile(jq_schema)
-        self._content_key = content_key
+        self._content_key = jq.compile(content_key) if content_key else None


The current implementation was added in May, so there might be existing users relying on the previous functionality.

Would you be able to make this change backwards compatible? We could introduce a new attribute that if provided will take precedence

Understood, I will think about it a bit.

@eyurtsev
To ensure backward compatibility, I've added a flag named is_content_key_jq_parsable, which is set to False by default.
When is_content_key_jq_parsable=False, you can specify content_key as a string, as before.
When is_content_key_jq_parsable=True, you can specify a jq schema key for content_key.

baskaryan · 2024-02-14T21:17:06Z

Apologies for the slow review! Pr has some merge conflicts, happy to re-review if you'd like to resolve

kzk-maeda · 2024-02-23T02:03:06Z

I've confirmed that the code I'd like to change has moved into the community package due to the architectural change associated with the stable version release.
So, I'm going to close this PR and re-write the code I propose in community package and open new PR asap. After that, please review my new PR!

kzk-maeda · 2024-02-23T08:12:40Z

I re-created new PR: #18003
and I'm going to close this PR.

### Description Changed the value specified for `content_key` in JSONLoader from a single key to a value based on jq schema. I created [similar PR](#11255) before, but it has several conflicts because of the architectural change associated stable version release, so I re-create this PR to fit new architecture. ### Why For json data like the following, specify `.data[].attributes.message` for page_content and `.data[].attributes.id` or `.data[].attributes.attributes. tags`, etc., the `content_key` must also parse the json structure. <details> <summary>sample json data</summary> ```json { "data": [ { "attributes": { "message": "message1", "tags": [ "tag1" ] }, "id": "1" }, { "attributes": { "message": "message2", "tags": [ "tag2" ] }, "id": "2" } ] } ``` </details> <details> <summary>sample code</summary> ```python def metadata_func(record: dict, metadata: dict) -> dict: metadata["source"] = None metadata["id"] = record.get("id") metadata["tags"] = record["attributes"].get("tags") return metadata sample_file = "sample1.json" loader = JSONLoader( file_path=sample_file, jq_schema=".data[]", content_key=".attributes.message", ## content_key is parsable into jq schema is_content_key_jq_parsable=True, ## this is added parameter metadata_func=metadata_func ) data = loader.load() data ``` </details> ### Dependencies none ### Twitter handle [kzk_maeda](https://twitter.com/kzk_maeda)

…hain-ai#18003) ### Description Changed the value specified for `content_key` in JSONLoader from a single key to a value based on jq schema. I created [similar PR](langchain-ai#11255) before, but it has several conflicts because of the architectural change associated stable version release, so I re-create this PR to fit new architecture. ### Why For json data like the following, specify `.data[].attributes.message` for page_content and `.data[].attributes.id` or `.data[].attributes.attributes. tags`, etc., the `content_key` must also parse the json structure. <details> <summary>sample json data</summary> ```json { "data": [ { "attributes": { "message": "message1", "tags": [ "tag1" ] }, "id": "1" }, { "attributes": { "message": "message2", "tags": [ "tag2" ] }, "id": "2" } ] } ``` </details> <details> <summary>sample code</summary> ```python def metadata_func(record: dict, metadata: dict) -> dict: metadata["source"] = None metadata["id"] = record.get("id") metadata["tags"] = record["attributes"].get("tags") return metadata sample_file = "sample1.json" loader = JSONLoader( file_path=sample_file, jq_schema=".data[]", content_key=".attributes.message", ## content_key is parsable into jq schema is_content_key_jq_parsable=True, ## this is added parameter metadata_func=metadata_func ) data = loader.load() data ``` </details> ### Dependencies none ### Twitter handle [kzk_maeda](https://twitter.com/kzk_maeda)

use jq schema for content_key in json_loader

88cb9bc

dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases labels Oct 1, 2023

vercel bot deployed to Preview October 1, 2023 04:23 View deployment

eyurtsev reviewed Oct 1, 2023

View reviewed changes

kzk-maeda added 3 commits October 5, 2023 23:42

Add a flag to maintain backward compatibility

b2a2089

delete unnecessary line

93f22ae

change jq dependency

c196a7a

hwchase17 closed this Jan 30, 2024

baskaryan reopened this Jan 30, 2024

kzk-maeda mentioned this pull request Feb 23, 2024

community: use jq schema for content_key in json_loader #18003

Merged

kzk-maeda closed this Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use jq schema for content_key in json_loader #11255

use jq schema for content_key in json_loader #11255

kzk-maeda commented Oct 1, 2023

vercel bot commented Oct 1, 2023 •

edited

eyurtsev Oct 1, 2023

kzk-maeda Oct 2, 2023

kzk-maeda Oct 5, 2023

baskaryan commented Feb 14, 2024

kzk-maeda commented Feb 23, 2024

kzk-maeda commented Feb 23, 2024

use jq schema for content_key in json_loader #11255

use jq schema for content_key in json_loader #11255

Conversation

kzk-maeda commented Oct 1, 2023

Description

Why

Dependencies

Tag maintainer

Twitter handle

vercel bot commented Oct 1, 2023 • edited

eyurtsev Oct 1, 2023

Choose a reason for hiding this comment

kzk-maeda Oct 2, 2023

Choose a reason for hiding this comment

kzk-maeda Oct 5, 2023

Choose a reason for hiding this comment

baskaryan commented Feb 14, 2024

kzk-maeda commented Feb 23, 2024

kzk-maeda commented Feb 23, 2024

vercel bot commented Oct 1, 2023 •

edited