Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix json bytes content type detection #3941

Merged
merged 10 commits into from
Mar 7, 2025

Conversation

plutasnyy
Copy link
Contributor

@plutasnyy plutasnyy commented Mar 5, 2025

Fixes order of content type detection strategies for byte-encoded jsons.

Before

json_bytes = json.dumps([{"example": "data"}]).encode("utf-8")
file_buffer = io.BytesIO(json_bytes)
detect_filetype(file=file_buffer, metadata_file_path="filename.pdf") 

Before
PDF

Now
JSON


file_buffer = io.BytesIO(json_bytes)
predicted_type = detect_filetype(file=file_buffer, metadata_file_path="filename.pdf")
assert predicted_type == FileType.JSON
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was previously resolved as FileType.PDF

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, and what is the reason this file has pdf extension? Shouldn't it be changed somewhere earlier in the pipeline?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filename extension should be used as a last resort, the unit test only illustrates this behaviour ;)

@plutasnyy plutasnyy marked this pull request as ready for review March 5, 2025 12:24
Comment on lines +971 to +978
ndjson_string = "\n".join(json.dumps(item) for item in data) + "\n"
ndjson_bytes = ndjson_string.encode("utf-8")

file_buffer = io.BytesIO(ndjson_bytes)
predicted_type = detect_filetype(
file=file_buffer, metadata_file_path="filename.pdf", content_type="application/json"
)
assert predicted_type == FileType.NDJSON
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cragwolfe is this expected behaviour? or in such a case we should fully trust the provided content type and return FileType.JSON?

(if so the logic for the edge case can be simply hidden inside the 'magic' library strategy)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have an exact unit test for that test case, which was added in some 'fix something' git PR, so I assume this was very intentional

def test_it_identifies_NDJSON_for_file_with_ndjson_extension_but_JSON_content_type():
    file_path = example_doc_path("simple.ndjson")
    assert detect_filetype(file_path, content_type=FileType.JSON.mime_type) == FileType.NDJSON

So I am moving forward with the current implementation

Copy link
Contributor

@MaksOpp MaksOpp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

# Unstructured JSON serialization format
text = '{"hi": "there"}'
def test_auto_partition_processes_simple_ndjson(tmp_path: pathlib.Path):
text = '{"text": "hello", "type": "NarrativeText"}'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is valid one-element ndjson

@plutasnyy plutasnyy self-assigned this Mar 5, 2025
@plutasnyy plutasnyy changed the title Fix json stream content type detection Fix json bytes content type detection Mar 5, 2025
@plutasnyy plutasnyy enabled auto-merge March 7, 2025 08:49
@plutasnyy plutasnyy disabled auto-merge March 7, 2025 09:06
@plutasnyy plutasnyy added this pull request to the merge queue Mar 7, 2025
Merged via the queue into main with commit 74b0647 Mar 7, 2025
43 checks passed
@plutasnyy plutasnyy deleted the fix-json-stream-content-type-detection branch March 7, 2025 11:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants