Fix json bytes content type detection #3941

plutasnyy · 2025-03-05T12:13:07Z

Fixes order of content type detection strategies for byte-encoded jsons.

Before

json_bytes = json.dumps([{"example": "data"}]).encode("utf-8")
file_buffer = io.BytesIO(json_bytes)
detect_filetype(file=file_buffer, metadata_file_path="filename.pdf")

Before
PDF

Now
JSON

test_unstructured/file_utils/test_filetype.py

plutasnyy · 2025-03-05T12:22:06Z

test_unstructured/file_utils/test_filetype.py

+
+    file_buffer = io.BytesIO(json_bytes)
+    predicted_type = detect_filetype(file=file_buffer, metadata_file_path="filename.pdf")
+    assert predicted_type == FileType.JSON


This was previously resolved as FileType.PDF

Hmm, and what is the reason this file has pdf extension? Shouldn't it be changed somewhere earlier in the pipeline?

filename extension should be used as a last resort, the unit test only illustrates this behaviour ;)

plutasnyy · 2025-03-05T12:27:45Z

test_unstructured/file_utils/test_filetype.py

+    ndjson_string = "\n".join(json.dumps(item) for item in data) + "\n"
+    ndjson_bytes = ndjson_string.encode("utf-8")
+
+    file_buffer = io.BytesIO(ndjson_bytes)
+    predicted_type = detect_filetype(
+        file=file_buffer, metadata_file_path="filename.pdf", content_type="application/json"
+    )
+    assert predicted_type == FileType.NDJSON


@cragwolfe is this expected behaviour? or in such a case we should fully trust the provided content type and return FileType.JSON?

(if so the logic for the edge case can be simply hidden inside the 'magic' library strategy)

We have an exact unit test for that test case, which was added in some 'fix something' git PR, so I assume this was very intentional

def test_it_identifies_NDJSON_for_file_with_ndjson_extension_but_JSON_content_type(): file_path = example_doc_path("simple.ndjson") assert detect_filetype(file_path, content_type=FileType.JSON.mime_type) == FileType.NDJSON

So I am moving forward with the current implementation

MaksOpp

lgtm

plutasnyy · 2025-03-05T12:56:42Z

test_unstructured/partition/test_auto.py

-    # Unstructured JSON serialization format
-    text = '{"hi": "there"}'
+def test_auto_partition_processes_simple_ndjson(tmp_path: pathlib.Path):
+    text = '{"text": "hello", "type": "NarrativeText"}'


This is valid one-element ndjson

…nt-type-detection

plutasnyy added 2 commits March 5, 2025 12:55

Fix filetype recognition of jsons encoded in bytes

64bb59f

Refactor

Loading
Loading status checks…

fbe2dc3

plutasnyy commented Mar 5, 2025

View reviewed changes

test_unstructured/file_utils/test_filetype.py Show resolved Hide resolved

plutasnyy commented Mar 5, 2025

View reviewed changes

plutasnyy marked this pull request as ready for review March 5, 2025 12:24

plutasnyy requested review from cragwolfe, rbiseck3 and MaksOpp March 5, 2025 12:24

plutasnyy commented Mar 5, 2025

View reviewed changes

MaksOpp approved these changes Mar 5, 2025

View reviewed changes

Fix unit test

Loading
Loading status checks…

448706a

plutasnyy commented Mar 5, 2025

View reviewed changes

plutasnyy self-assigned this Mar 5, 2025

plutasnyy added 4 commits March 5, 2025 14:37

Compile deps

Loading
Loading status checks…

00f0f0a

Compile deps

Loading
Loading status checks…

18c9873

Fix markdown installation

Loading
Loading status checks…

9630d99

Fix ndjson installation

Loading
Loading status checks…

bf96664

plutasnyy changed the title ~~Fix json stream content type detection~~ Fix json bytes content type detection Mar 5, 2025

plutasnyy added 2 commits March 7, 2025 09:45

Merge remote-tracking branch 'origin/main' into fix-json-stream-conte…

1186d78

…nt-type-detection

Bump version

Loading
Loading status checks…

8bce9cf

plutasnyy enabled auto-merge March 7, 2025 08:49

plutasnyy disabled auto-merge March 7, 2025 09:06

Remove ndjson

Loading
Loading status checks…

11668bb

plutasnyy added this pull request to the merge queue Mar 7, 2025

Merged via the queue into main with commit 74b0647 Mar 7, 2025
43 checks passed

plutasnyy deleted the fix-json-stream-content-type-detection branch March 7, 2025 11:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix json bytes content type detection #3941

Fix json bytes content type detection #3941

plutasnyy commented Mar 5, 2025 •

edited

Loading

plutasnyy Mar 5, 2025

MaksOpp Mar 5, 2025

plutasnyy Mar 5, 2025

plutasnyy Mar 5, 2025

plutasnyy Mar 7, 2025

MaksOpp left a comment

plutasnyy Mar 5, 2025

Fix json bytes content type detection #3941

Fix json bytes content type detection #3941

Conversation

plutasnyy commented Mar 5, 2025 • edited Loading

plutasnyy Mar 5, 2025

Choose a reason for hiding this comment

MaksOpp Mar 5, 2025

Choose a reason for hiding this comment

plutasnyy Mar 5, 2025

Choose a reason for hiding this comment

plutasnyy Mar 5, 2025

Choose a reason for hiding this comment

plutasnyy Mar 7, 2025

Choose a reason for hiding this comment

MaksOpp left a comment

Choose a reason for hiding this comment

plutasnyy Mar 5, 2025

Choose a reason for hiding this comment

plutasnyy commented Mar 5, 2025 •

edited

Loading