Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROB: Relax flate decoding for too many lookup values #2331

Merged
merged 5 commits into from Dec 10, 2023

Conversation

stefan6419846
Copy link
Collaborator

When handling flate objects with a lookup table and the image mode 1, we would previously raise a generic AssertionError if the number of lookup values did not match.

This PR proposes to add a more meaningful error message. Additionally, cases where too many values are specified are now considered a warning only as I could not see any real difference.

I might try to find a document version I can use for public test cases later on to at least cover the case where there are too many values.

Copy link

codecov bot commented Dec 7, 2023

Codecov Report

Attention: 4 lines in your changes are missing coverage. Please review.

Comparison is base (8d3f879) 94.37% compared to head (5d3fc66) 94.32%.
Report is 1 commits behind head on main.

Files Patch % Lines
pypdf/_xobj_image_helpers.py 50.00% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2331      +/-   ##
==========================================
- Coverage   94.37%   94.32%   -0.05%     
==========================================
  Files          43       43              
  Lines        7660     7666       +6     
  Branches     1515     1518       +3     
==========================================
+ Hits         7229     7231       +2     
- Misses        267      269       +2     
- Partials      164      166       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@stefan6419846
Copy link
Collaborator Author

Example: We have the lookup table b'\x00\x00\x00\xff\xff\xff\n', id est with a trailing whitespace character. This is generated by the SkyPDF Pro Driver for example.

@stefan6419846
Copy link
Collaborator Author

Example document: out1.pdf

Copy link
Collaborator

@pubpub-zz pubpub-zz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!
Now we just need to improve test coverage😉

@stefan6419846
Copy link
Collaborator Author

I might have a look at how to emulate the two remaining error cases where we do not have actual example files for now in the next days.

@MartinThoma
Copy link
Member

@stefan6419846 Please correct me if I'm wrong, but I don't think it's a bugfix-PR.

A bug in pypdf would mean that pypdf is either...

  • ... not acting according to the PDF specification, or
  • ... not acting according to the pypdf documentation (directly or indirectly, if the user could get a reasonable expectation for a specific behavior)

This PR has two components:

  • ROB: Not raising an exception but just throwing a warning in case of too many values.
  • STY: Throwing a more specific error message / different types of error message

As robustness improvements (ROB) are typically more important for pypdf users, I'd start the PR with "ROB".

See https://pypdf.readthedocs.io/en/latest/dev/intro.html (I should probably extend those :-) )

@MartinThoma
Copy link
Member

From my side this looks good to be merged 👍 Just let me know if I can change the PR title

@stefan6419846
Copy link
Collaborator Author

@MartinThoma No worries, I am completely fine with you adjusting the title if required, referrring to robustness instead. And adjusting the dev docs sounds like a nice idea as well to make it more clear.

Feel free to merge this PR after updating the title. If I find some time during the next week, I might provide some more test data for the two new edge cases to increase coverage in a separate PR - for now, I just did not stumble upon such files as in theory they violate the PDF standard anyway, while the whitespace stuff seems more spec-like and thus appears in the real world as well.

@MartinThoma MartinThoma changed the title BUG: Relax flate decoding for too many lookup values ROB: Relax flate decoding for too many lookup values Dec 10, 2023
@MartinThoma MartinThoma merged commit 6dad92a into py-pdf:main Dec 10, 2023
12 of 14 checks passed
@stefan6419846 stefan6419846 deleted the relax-flate-lookup branch December 10, 2023 11:28
MartinThoma added a commit that referenced this pull request Dec 10, 2023
## What's new

### Bug Fixes (BUG)
-  Cope with deflated images with CMYK Black Only (#2322) by @pubpub-zz
-  Handle indirect objects as parameters for CCITTFaxDecode (#2307) by @stefan6419846
-  check words length in _cmap type1_alternative function (#2310) by @Takher

### Robustness (ROB)
-  Relax flate decoding for too many lookup values (#2331) by @stefan6419846
-  Let _build_destination skip in case of missing /D key (#2018) by @nickryand

### Documentation (DOC)
-  Note in reading form data (#2338) by @MartinThoma
-  Pull Request prefixes and size by @MartinThoma
-  Add https://github.com/zuypt for #2325 as a contributor by @MartinThoma
-  Fix docstring for RunLengthDecode.decode (#2302) by @stefan6419846

### Maintenance (MAINT)
-  Enable `disallow_any_generics` and add missing generics (#2278) by @nilehmann

### Testing (TST)
-  Centralize file downloads (#2324) by @MartinThoma

### Code Style (STY)
-  Fix typo "steam" \xe2\x86\x92 "stream" (#2327) by @stefan6419846
-  Run black by @MartinThoma
-  Make Traceback in bug report template uppercase (#2304) by @stefan6419846

[Full Changelog](3.17.1...3.17.2)
MartinThoma pushed a commit that referenced this pull request Dec 18, 2023
As mentioned in #2331, this will improve the test coverage for the edge cases.

Further refactoring was necessary as iterating over bytes will yield integers instead of single bytes and thus the whitespace check has been broken.

Additionally, the whitespace check has previously always been performed on the shortened bytes data.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants