Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: children nodes not carrying metadata from source nodes #15254

Merged
merged 5 commits into from
Aug 9, 2024

Conversation

EmanuelCampos
Copy link
Contributor

when running 2 node parsers the metadata is being lost

reproducible example:

from os import path

from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import MarkdownNodeParser, SentenceSplitter
from llama_index.core.schema import Document

current_dir = path.dirname(path.realpath(__file__))

sample_pdf_path = path.join(current_dir, "apple_2023.txt")

with open(sample_pdf_path, "r") as f:
    documents = [Document(text=f.read())]

pipeline = IngestionPipeline(
    documents=documents,
    transformations=[
        MarkdownNodeParser(),
        SentenceSplitter(),
    ],
)

nodes = pipeline.run()

with open("output.txt", "w") as f:
    for node in nodes:
        f.write(node.get_content(metadata_mode="all"))
        f.write("\n---\n")

wip
@dosubot dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Aug 9, 2024
@EmanuelCampos EmanuelCampos changed the title fix: parent nodes metadata being lost when have multiple transformations fix: children nodes not carrying metadata from source nodes Aug 9, 2024
@@ -89,7 +90,13 @@ def _postprocess_parsed_nodes(

# update metadata
if self.include_metadata:
node.metadata.update(parent_doc.metadata)
# Update parent_doc.metadata with node.metadata, giving preference to node's values
node.metadata = {**parent_doc.metadata, **node.metadata}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we resolve conflicts if there are multiple child nodes all updating the same parent? It seems like it would be good to put it in a separate key and merge when you refer to the nodes. That way you get more control over whether you access only the parent attributes or not.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I think my comment applies to the chunk below. The comment in the code is incorrect- this line is merging parent metadata into the current node's metadata.

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:XS This PR changes 0-9 lines, ignoring generated files. labels Aug 9, 2024
wip

Verified

This commit was signed with the committer’s verified signature.
philip-peterson Philip Peterson

Verified

This commit was signed with the committer’s verified signature.
philip-peterson Philip Peterson
…lama_index into fix/parent-node-metadata
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Aug 9, 2024
@nerdai nerdai merged commit 5e28220 into run-llama:main Aug 9, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm This PR has been approved by a maintainer size:M This PR changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants