Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use jq schema for content_key in json_loader #11255

Closed
wants to merge 4 commits into from

Conversation

kzk-maeda
Copy link
Contributor

Description

Changed the value specified for content_key in JSONLoader from a single key to a value based on jq schema.

Sorry, I created the same Pull Request in the past, but some time has elapsed and it was closed, so I am recreating it.

Why

For json data like the following, specify .data[].attributes.message for page_content and .data[].attributes.id or .data[].attributes.attributes. tags, etc., the content_key must also parse the json structure.

sample json data
{
  "data": [
    {
      "attributes": {
        "attributes": {
          "dd": {
            "service": "worker",
            "env": "production",
            "version": "6b2c46a5883a9097aa2cb09907786b5c06ca3bd0"
          },
          "source": "stderr",
          "service": "worker",
          "name": "app.services.predict_services",
          "levelname": "ERROR",
          "container_id": "f9f880b243cc41f1a7d9bae5bf922d60-1262558729",
          "timestamp": 1686679407827.0
        },
        "message": "error while processing user #19084: 'numpy.dtype[bool_]' object is not callable",
        "service": "worker",
        "status": "error",
        "tags": [
          "datadog.submission_auth:private_api_key",
          "env:production"
        ],
        "timestamp": "2023-06-14T03:03:27.827000+09:00"
      },
      "id": "AgAAAYi17UzTUeIczgAAAAAAAAAYAAAAAEFZaTE3VThXQUFEOF9sS2J4Z3psRmdBRAAAACQAAAAAMDE4OGI2MzYtMmZlNC00ZDEwLThjZDMtMzhkZTI0NmUyNWMz",
      "type": "log"
    },
    {
      "attributes": {
        "attributes": {
          "dd": {
            "service": "worker",
            "env": "production",
            "version": "6b2c46a5883a9097aa2cb09907786b5c06ca3bd0"
          },
          "process": 42.0,
          "messages": "error while processing user #19084: 'numpy.dtype[bool_]' object is not callable",
          "levelname": "ERROR",
          "container_id": "f9f880b243cc41f1a7d9bae5bf922d60-1262558729",
          "timestamp": 1686679407831.0
        },
        "message": "{\"messages\": \"error while processing user #19084: 'numpy.dtype[bool_]' object is not callable\"}",
        "service": "worker",
        "status": "error",
        "tags": [
          "datadog.submission_auth:private_api_key",
          "env:production"
        ],
        "timestamp": "2023-06-14T03:03:27.831000+09:00"
      },
      "id": "AgAAAYi17UzXUeIczwAAAAAAAAAYAAAAAEFZaTE3VThXQUFEOF9sS2J4Z3psRmdBRQAAACQAAAAAMDE4OGI2MzYtMmZlNC00ZDEwLThjZDMtMzhkZTI0NmUyNWMz",
      "type": "log"
    }
  ],
  "meta": {
    "elapsed": 26,
    "request_id": "pddv1ChY0SmttaDA0a1REeXZRM01yNkFwYnd3Ii0KHWIANr8mghGpsMIX2cOarI6t4WyTVObXx3wrAuudEgzSbtmduLtPxFVkSo0",
    "status": "done"
  }
}
sample code
def metadata_func(record: dict, metadata: dict) -> dict:
    print(record)

    metadata["id"] = record.get("id")
    metadata["tags"] = record["attributes"].get("tags")

    return metadata

sample_file = "sample.json"
loader = JSONLoader(
    file_path=sample_file,
    jq_schema=".data[]",
    content_key=".attributes.message",
    metadata_func=metadata_func
)

Dependencies

none

Tag maintainer

@rlancemartin, @eyurtsev

Twitter handle

kzk_maeda

@vercel
Copy link

vercel bot commented Oct 1, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Oct 5, 2023 3:11pm

@dosubot dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases labels Oct 1, 2023
@@ -49,7 +50,7 @@ def __init__(

self.file_path = Path(file_path).resolve()
self._jq_schema = jq.compile(jq_schema)
self._content_key = content_key
self._content_key = jq.compile(content_key) if content_key else None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation was added in May, so there might be existing users relying on the previous functionality.

Would you be able to make this change backwards compatible? We could introduce a new attribute that if provided will take precedence

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, I will think about it a bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eyurtsev
To ensure backward compatibility, I've added a flag named is_content_key_jq_parsable, which is set to False by default.
When is_content_key_jq_parsable=False, you can specify content_key as a string, as before.
When is_content_key_jq_parsable=True, you can specify a jq schema key for content_key.

@hwchase17 hwchase17 closed this Jan 30, 2024
@baskaryan baskaryan reopened this Jan 30, 2024
@baskaryan
Copy link
Collaborator

Apologies for the slow review! Pr has some merge conflicts, happy to re-review if you'd like to resolve

@kzk-maeda
Copy link
Contributor Author

I've confirmed that the code I'd like to change has moved into the community package due to the architectural change associated with the stable version release.
So, I'm going to close this PR and re-write the code I propose in community package and open new PR asap. After that, please review my new PR!

@kzk-maeda
Copy link
Contributor Author

I re-created new PR: #18003
and I'm going to close this PR.

@kzk-maeda kzk-maeda closed this Feb 23, 2024
baskaryan pushed a commit that referenced this pull request Mar 5, 2024
### Description
Changed the value specified for `content_key` in JSONLoader from a
single key to a value based on jq schema.
I created [similar
PR](#11255) before, but it
has several conflicts because of the architectural change associated
stable version release, so I re-create this PR to fit new architecture.

### Why
For json data like the following, specify `.data[].attributes.message`
for page_content and `.data[].attributes.id` or
`.data[].attributes.attributes. tags`, etc., the `content_key` must also
parse the json structure.

<details>
<summary>sample json data</summary>

```json
{
  "data": [
    {
      "attributes": {
        "message": "message1",
        "tags": [
          "tag1"
        ]
      },
      "id": "1"
    },
    {
      "attributes": {
        "message": "message2",
        "tags": [
          "tag2"
        ]
      },
      "id": "2"
    }
  ]
}
```

</details>

<details>
<summary>sample code</summary>

```python
def metadata_func(record: dict, metadata: dict) -> dict:

    metadata["source"] = None
    metadata["id"] = record.get("id")
    metadata["tags"] = record["attributes"].get("tags")

    return metadata

sample_file = "sample1.json"
loader = JSONLoader(
    file_path=sample_file,
    jq_schema=".data[]",
    content_key=".attributes.message", ## content_key is parsable into jq schema
    is_content_key_jq_parsable=True, ## this is added parameter
    metadata_func=metadata_func
)

data = loader.load()
data
```

</details>

### Dependencies
none

### Twitter handle
[kzk_maeda](https://twitter.com/kzk_maeda)
thebhulawat pushed a commit to thebhulawat/langchain that referenced this pull request Mar 6, 2024
…hain-ai#18003)

### Description
Changed the value specified for `content_key` in JSONLoader from a
single key to a value based on jq schema.
I created [similar
PR](langchain-ai#11255) before, but it
has several conflicts because of the architectural change associated
stable version release, so I re-create this PR to fit new architecture.

### Why
For json data like the following, specify `.data[].attributes.message`
for page_content and `.data[].attributes.id` or
`.data[].attributes.attributes. tags`, etc., the `content_key` must also
parse the json structure.

<details>
<summary>sample json data</summary>

```json
{
  "data": [
    {
      "attributes": {
        "message": "message1",
        "tags": [
          "tag1"
        ]
      },
      "id": "1"
    },
    {
      "attributes": {
        "message": "message2",
        "tags": [
          "tag2"
        ]
      },
      "id": "2"
    }
  ]
}
```

</details>

<details>
<summary>sample code</summary>

```python
def metadata_func(record: dict, metadata: dict) -> dict:

    metadata["source"] = None
    metadata["id"] = record.get("id")
    metadata["tags"] = record["attributes"].get("tags")

    return metadata

sample_file = "sample1.json"
loader = JSONLoader(
    file_path=sample_file,
    jq_schema=".data[]",
    content_key=".attributes.message", ## content_key is parsable into jq schema
    is_content_key_jq_parsable=True, ## this is added parameter
    metadata_func=metadata_func
)

data = loader.load()
data
```

</details>

### Dependencies
none

### Twitter handle
[kzk_maeda](https://twitter.com/kzk_maeda)
bechbd pushed a commit to bechbd/langchain that referenced this pull request Mar 29, 2024
…hain-ai#18003)

### Description
Changed the value specified for `content_key` in JSONLoader from a
single key to a value based on jq schema.
I created [similar
PR](langchain-ai#11255) before, but it
has several conflicts because of the architectural change associated
stable version release, so I re-create this PR to fit new architecture.

### Why
For json data like the following, specify `.data[].attributes.message`
for page_content and `.data[].attributes.id` or
`.data[].attributes.attributes. tags`, etc., the `content_key` must also
parse the json structure.

<details>
<summary>sample json data</summary>

```json
{
  "data": [
    {
      "attributes": {
        "message": "message1",
        "tags": [
          "tag1"
        ]
      },
      "id": "1"
    },
    {
      "attributes": {
        "message": "message2",
        "tags": [
          "tag2"
        ]
      },
      "id": "2"
    }
  ]
}
```

</details>

<details>
<summary>sample code</summary>

```python
def metadata_func(record: dict, metadata: dict) -> dict:

    metadata["source"] = None
    metadata["id"] = record.get("id")
    metadata["tags"] = record["attributes"].get("tags")

    return metadata

sample_file = "sample1.json"
loader = JSONLoader(
    file_path=sample_file,
    jq_schema=".data[]",
    content_key=".attributes.message", ## content_key is parsable into jq schema
    is_content_key_jq_parsable=True, ## this is added parameter
    metadata_func=metadata_func
)

data = loader.load()
data
```

</details>

### Dependencies
none

### Twitter handle
[kzk_maeda](https://twitter.com/kzk_maeda)
gkorland pushed a commit to FalkorDB/langchain that referenced this pull request Mar 30, 2024
…hain-ai#18003)

### Description
Changed the value specified for `content_key` in JSONLoader from a
single key to a value based on jq schema.
I created [similar
PR](langchain-ai#11255) before, but it
has several conflicts because of the architectural change associated
stable version release, so I re-create this PR to fit new architecture.

### Why
For json data like the following, specify `.data[].attributes.message`
for page_content and `.data[].attributes.id` or
`.data[].attributes.attributes. tags`, etc., the `content_key` must also
parse the json structure.

<details>
<summary>sample json data</summary>

```json
{
  "data": [
    {
      "attributes": {
        "message": "message1",
        "tags": [
          "tag1"
        ]
      },
      "id": "1"
    },
    {
      "attributes": {
        "message": "message2",
        "tags": [
          "tag2"
        ]
      },
      "id": "2"
    }
  ]
}
```

</details>

<details>
<summary>sample code</summary>

```python
def metadata_func(record: dict, metadata: dict) -> dict:

    metadata["source"] = None
    metadata["id"] = record.get("id")
    metadata["tags"] = record["attributes"].get("tags")

    return metadata

sample_file = "sample1.json"
loader = JSONLoader(
    file_path=sample_file,
    jq_schema=".data[]",
    content_key=".attributes.message", ## content_key is parsable into jq schema
    is_content_key_jq_parsable=True, ## this is added parameter
    metadata_func=metadata_func
)

data = loader.load()
data
```

</details>

### Dependencies
none

### Twitter handle
[kzk_maeda](https://twitter.com/kzk_maeda)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants