ROB: Relax flate decoding for too many lookup values #2331

stefan6419846 · 2023-12-07T09:00:00Z

When handling flate objects with a lookup table and the image mode 1, we would previously raise a generic AssertionError if the number of lookup values did not match.

This PR proposes to add a more meaningful error message. Additionally, cases where too many values are specified are now considered a warning only as I could not see any real difference.

I might try to find a document version I can use for public test cases later on to at least cover the case where there are too many values.

codecov · 2023-12-07T09:06:02Z

Codecov Report

Attention: 4 lines in your changes are missing coverage. Please review.

Comparison is base (8d3f879) 94.37% compared to head (5d3fc66) 94.32%.
Report is 1 commits behind head on main.

Files	Patch %	Lines
pypdf/_xobj_image_helpers.py	50.00%	2 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2331      +/-   ##
==========================================
- Coverage   94.37%   94.32%   -0.05%     
==========================================
  Files          43       43              
  Lines        7660     7666       +6     
  Branches     1515     1518       +3     
==========================================
+ Hits         7229     7231       +2     
- Misses        267      269       +2     
- Partials      164      166       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

stefan6419846 · 2023-12-07T10:16:10Z

Example: We have the lookup table b'\x00\x00\x00\xff\xff\xff\n', id est with a trailing whitespace character. This is generated by the SkyPDF Pro Driver for example.

pypdf/_xobj_image_helpers.py

stefan6419846 · 2023-12-08T07:54:21Z

Example document: out1.pdf

pubpub-zz

Looks good!
Now we just need to improve test coverage😉

stefan6419846 · 2023-12-08T20:26:19Z

I might have a look at how to emulate the two remaining error cases where we do not have actual example files for now in the next days.

MartinThoma · 2023-12-09T18:13:03Z

@stefan6419846 Please correct me if I'm wrong, but I don't think it's a bugfix-PR.

A bug in pypdf would mean that pypdf is either...

... not acting according to the PDF specification, or
... not acting according to the pypdf documentation (directly or indirectly, if the user could get a reasonable expectation for a specific behavior)

This PR has two components:

ROB: Not raising an exception but just throwing a warning in case of too many values.
STY: Throwing a more specific error message / different types of error message

As robustness improvements (ROB) are typically more important for pypdf users, I'd start the PR with "ROB".

See https://pypdf.readthedocs.io/en/latest/dev/intro.html (I should probably extend those :-) )

MartinThoma · 2023-12-09T18:13:37Z

From my side this looks good to be merged 👍 Just let me know if I can change the PR title

stefan6419846 · 2023-12-09T18:36:28Z

@MartinThoma No worries, I am completely fine with you adjusting the title if required, referrring to robustness instead. And adjusting the dev docs sounds like a nice idea as well to make it more clear.

Feel free to merge this PR after updating the title. If I find some time during the next week, I might provide some more test data for the two new edge cases to increase coverage in a separate PR - for now, I just did not stumble upon such files as in theory they violate the PDF standard anyway, while the whitespace stuff seems more spec-like and thus appears in the real world as well.

@pubpub-zz

## What's new ### Bug Fixes (BUG) - Cope with deflated images with CMYK Black Only (#2322) by @pubpub-zz - Handle indirect objects as parameters for CCITTFaxDecode (#2307) by @stefan6419846 - check words length in _cmap type1_alternative function (#2310) by @Takher ### Robustness (ROB) - Relax flate decoding for too many lookup values (#2331) by @stefan6419846 - Let _build_destination skip in case of missing /D key (#2018) by @nickryand ### Documentation (DOC) - Note in reading form data (#2338) by @MartinThoma - Pull Request prefixes and size by @MartinThoma - Add https://github.com/zuypt for #2325 as a contributor by @MartinThoma - Fix docstring for RunLengthDecode.decode (#2302) by @stefan6419846 ### Maintenance (MAINT) - Enable `disallow_any_generics` and add missing generics (#2278) by @nilehmann ### Testing (TST) - Centralize file downloads (#2324) by @MartinThoma ### Code Style (STY) - Fix typo "steam" \xe2\x86\x92 "stream" (#2327) by @stefan6419846 - Run black by @MartinThoma - Make Traceback in bug report template uppercase (#2304) by @stefan6419846 [Full Changelog](3.17.1...3.17.2)

As mentioned in #2331, this will improve the test coverage for the edge cases. Further refactoring was necessary as iterating over bytes will yield integers instead of single bytes and thus the whitespace check has been broken. Additionally, the whitespace check has previously always been performed on the shortened bytes data.

BUG: Relax flate decoding for too many lookup values

7621d8e

pubpub-zz reviewed Dec 7, 2023

View reviewed changes

pypdf/_xobj_image_helpers.py Outdated Show resolved Hide resolved

only accept whitespace characters as trailing ones

676abc8

stefan6419846 added 2 commits December 8, 2023 08:55

fix import sort oder

c34e9fc

add test with trailing newline

73308dc

pubpub-zz reviewed Dec 8, 2023

View reviewed changes

Merge branch 'main' into relax-flate-lookup

5d3fc66

stefan6419846 mentioned this pull request Dec 9, 2023

ENH: Tolerate PDF with invalid xref pointed objects #2335

Merged

MartinThoma mentioned this pull request Dec 9, 2023

ROB: Replace error by warning for EOD in RunLengthDecode/ASCIIHexDecode #2334

Merged

MartinThoma changed the title ~~BUG: Relax flate decoding for too many lookup values~~ ROB: Relax flate decoding for too many lookup values Dec 10, 2023

MartinThoma merged commit 6dad92a into py-pdf:main Dec 10, 2023
12 of 14 checks passed

stefan6419846 deleted the relax-flate-lookup branch December 10, 2023 11:28

stefan6419846 mentioned this pull request Dec 11, 2023

TST: Increase test coverage for flate handling of image mode 1 #2339

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROB: Relax flate decoding for too many lookup values #2331

ROB: Relax flate decoding for too many lookup values #2331

stefan6419846 commented Dec 7, 2023

codecov bot commented Dec 7, 2023 •

edited

stefan6419846 commented Dec 7, 2023

stefan6419846 commented Dec 8, 2023

pubpub-zz left a comment

stefan6419846 commented Dec 8, 2023

MartinThoma commented Dec 9, 2023

MartinThoma commented Dec 9, 2023

stefan6419846 commented Dec 9, 2023

ROB: Relax flate decoding for too many lookup values #2331

ROB: Relax flate decoding for too many lookup values #2331

Conversation

stefan6419846 commented Dec 7, 2023

codecov bot commented Dec 7, 2023 • edited

Codecov Report

stefan6419846 commented Dec 7, 2023

stefan6419846 commented Dec 8, 2023

pubpub-zz left a comment

Choose a reason for hiding this comment

stefan6419846 commented Dec 8, 2023

MartinThoma commented Dec 9, 2023

MartinThoma commented Dec 9, 2023

stefan6419846 commented Dec 9, 2023

codecov bot commented Dec 7, 2023 •

edited