-
Notifications
You must be signed in to change notification settings - Fork 886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix json bytes content type detection #3941
Conversation
|
||
file_buffer = io.BytesIO(json_bytes) | ||
predicted_type = detect_filetype(file=file_buffer, metadata_file_path="filename.pdf") | ||
assert predicted_type == FileType.JSON |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was previously resolved as FileType.PDF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, and what is the reason this file has pdf
extension? Shouldn't it be changed somewhere earlier in the pipeline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filename extension should be used as a last resort, the unit test only illustrates this behaviour ;)
ndjson_string = "\n".join(json.dumps(item) for item in data) + "\n" | ||
ndjson_bytes = ndjson_string.encode("utf-8") | ||
|
||
file_buffer = io.BytesIO(ndjson_bytes) | ||
predicted_type = detect_filetype( | ||
file=file_buffer, metadata_file_path="filename.pdf", content_type="application/json" | ||
) | ||
assert predicted_type == FileType.NDJSON |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cragwolfe is this expected behaviour? or in such a case we should fully trust the provided content type and return FileType.JSON?
(if so the logic for the edge case can be simply hidden inside the 'magic' library strategy)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have an exact unit test for that test case, which was added in some 'fix something' git PR, so I assume this was very intentional
def test_it_identifies_NDJSON_for_file_with_ndjson_extension_but_JSON_content_type():
file_path = example_doc_path("simple.ndjson")
assert detect_filetype(file_path, content_type=FileType.JSON.mime_type) == FileType.NDJSON
So I am moving forward with the current implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
# Unstructured JSON serialization format | ||
text = '{"hi": "there"}' | ||
def test_auto_partition_processes_simple_ndjson(tmp_path: pathlib.Path): | ||
text = '{"text": "hello", "type": "NarrativeText"}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is valid one-element ndjson
…nt-type-detection
Fixes order of content type detection strategies for byte-encoded jsons.
Before
Before
PDF
Now
JSON