Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Infinite recursion when using PdfWriter(clone_from=reader) #2264

Merged
merged 1 commit into from Oct 29, 2023

Conversation

Alexhuszagh
Copy link
Contributor

@Alexhuszagh Alexhuszagh commented Oct 23, 2023

Use a visited memo to check if the current object in the clone operation has already been visited, and if so, do not add it to the list of objects. This avoids infinite recursion in case there are references to identical objects inside a PDF.

Unfortunately, since the example PDF contains financial data from a client I'm not free to provide the failing example that caused this PR. If you have a suggestion on how to remove that data while keeping the references identical, I'd be more than happy to contribute that. I hope the implementation is logical enough so this isn't needed.

The code that fails is as simple as:

from pypdf import PdfReader, PdfWriter
reader = PdfReader('/path/to/sensitive/file.pdf')
writer = PdfWriter('myfilename.pdf', clone_from=reader)

Use a visited memo to check if the current object in the clone operation
has already been visited, and if so, do not add it to the list of
objects. This avoids infinite recursion in case there are links to
identical objects inside a PDF.
@Alexhuszagh
Copy link
Contributor Author

Alexhuszagh commented Oct 23, 2023

I didn't apply the mypy lint to filters.py but would be more than happy to do so. My only major question would be then do I need to check if the PDFs are identical for the id(pdf) inside (so rather than just the generation and idnum, but also the PDF id)?

@codecov
Copy link

codecov bot commented Oct 23, 2023

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (9047079) 94.44% compared to head (e275585) 94.43%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2264      +/-   ##
==========================================
- Coverage   94.44%   94.43%   -0.01%     
==========================================
  Files          43       43              
  Lines        7634     7643       +9     
  Branches     1506     1508       +2     
==========================================
+ Hits         7210     7218       +8     
  Misses        262      262              
- Partials      162      163       +1     
Files Coverage Δ
pypdf/generic/_data_structures.py 91.90% <92.30%> (-0.04%) ⬇️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pubpub-zz
Copy link
Collaborator

Unfortunately, since the example PDF contains financial data from a client I'm not free to provide the failing example that caused this PR. If you have a suggestion on how to remove that data while keeping the references identical, I'd be more than happy to contribute that. I hope the implementation is logical enough so this isn't needed.

@Alexhuszagh
Do you accept to wipe out text with .remove_text() and then share the file in private by mail with @MartinThoma ?

@Alexhuszagh
Copy link
Contributor Author

Alexhuszagh commented Oct 23, 2023

Unfortunately, since the example PDF contains financial data from a client I'm not free to provide the failing example that caused this PR. If you have a suggestion on how to remove that data while keeping the references identical, I'd be more than happy to contribute that. I hope the implementation is logical enough so this isn't needed.

@Alexhuszagh Do you accept to wipe out text with .remove_text() and then share the file in private by mail with @MartinThoma ?

I just got approval to send the clean version over from the client and just sent it over.

@MartinThoma MartinThoma changed the title Fix an issue with infinite recursion. BUG: infinite recursion Oct 28, 2023
@MartinThoma MartinThoma changed the title BUG: infinite recursion BUG: Infinite recursion when using PdfWriter(clone_from=reader) Oct 29, 2023
@MartinThoma MartinThoma added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF nf-security Non-functional change: Security labels Oct 29, 2023
@MartinThoma
Copy link
Member

MartinThoma commented Oct 29, 2023

With the PDF you provided I can confirm that there is an infinite loop causing 100% CPU utilization for the executing core. Hence I added the nf-security Non-functional change: Security label

^CTraceback (most recent call last):
  File "/home/moose/Downloads/pypdf-issue.py", line 4, in <module>
    reader = PdfWriter("issuepypdf.pdf", clone_from="pypdf-clean.pdf")
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/pypdf/_writer.py", line 207, in __init__
    self.clone_document_from_reader(clone_from)
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/pypdf/_writer.py", line 1191, in clone_document_from_reader
    self.clone_reader_document_root(reader)
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/pypdf/_writer.py", line 1101, in clone_reader_document_root
    self._root_object = cast(DictionaryObject, reader.trailer[TK.ROOT].clone(self))
                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/pypdf/generic/_data_structures.py", line 197, in clone
    d__._clone(self, pdf_dest, force_duplicate, ignore_fields)
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/pypdf/generic/_data_structures.py", line 297, in _clone
    v.clone(pdf_dest, force_duplicate, ignore_fields)
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/pypdf/generic/_base.py", line 300, in clone
    obj.clone(pdf_dest, force_duplicate, ignore_fields)
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/pypdf/generic/_data_structures.py", line 197, in clone
    d__._clone(self, pdf_dest, force_duplicate, ignore_fields)
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/pypdf/generic/_data_structures.py", line 297, in _clone
    v.clone(pdf_dest, force_duplicate, ignore_fields)
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/pypdf/generic/_base.py", line 300, in clone
    obj.clone(pdf_dest, force_duplicate, ignore_fields)
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/pypdf/generic/_data_structures.py", line 197, in clone
    d__._clone(self, pdf_dest, force_duplicate, ignore_fields)
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/pypdf/generic/_data_structures.py", line 285, in _clone
    c._clone(s, pdf_dest, force_duplicate, ignore_fields)
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/pypdf/generic/_data_structures.py", line 269, in _clone
    cur_obj._reference_clone(
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/pypdf/generic/_base.py", line 129, in _reference_clone
    if id(ind.pdf) not in pdf_dest._id_translated:
       ^^^^^^^^^^^
KeyboardInterrupt

Meta-information about the problematic file

It's not completely broken:

$ qpdf --check pypdf-issue-2264.pdf
checking pypdf-issue-2264.pdf
PDF Version: 1.3
File is not encrypted
File is not linearized
No syntax or stream encoding errors found; the file may still contain
errors that qpdf cannot detect

I can also just open it fine in chrome / evince.

@MartinThoma MartinThoma merged commit 9b23ac3 into py-pdf:main Oct 29, 2023
11 of 13 checks passed
@MartinThoma
Copy link
Member

Thank you for your contribution 🙏 I will make a release in a few minutes that contains your fix. Additionally, I will also add a security advisory to update.

What a great first PR! If you want, I'll add you to https://pypdf.readthedocs.io/en/latest/meta/CONTRIBUTORS.html :-)

MartinThoma added a commit that referenced this pull request Oct 29, 2023
## What's new

### Security (SEC)
-  Infinite recursion when using PdfWriter(clone_from=reader) (#2264) by @Alexhuszagh

### New Features (ENH)
-  Add parameter to select images to be removed (#2214) by @pubpub-zz

### Bug Fixes (BUG)
-  Correctly handle image mode 1 with FlateDecode (#2249) by @stefan6419846
-  Error when filling a value with parentheses #2268 (#2269) by @KanorUbu
-  Handle empty root outline (#2239) by @pubpub-zz

### Documentation (DOC)
-  Improve merging docs (#2247) by @stefan6419846

### Developer Experience (DEV)
-  Test Python 3.7 with cryptopgraphy provider as well (#2276) by @stefan6419846
-  Run CI with windows-latest (#2258) by @MartinThoma
-  Use pytest-xdist (#2254) by @MartinThoma
-  Attribute correct authors in the release notes (#2246) by @stefan6419846

### Maintenance (MAINT)
-  Apply pre-commit hooks (#2277) by @MartinThoma
-  Update requirements + mypy fixes (#2275) by @MartinThoma
-  Explicitly provide Any for IO generic argument (#2272) by @nilehmann

### Testing (TST)
-  Fix test_image_without_pillow in windows environment (#2257) by @pubpub-zz

### Code Style (STY)
-  Remove unused import by @MartinThoma

[Full Changelog](3.16.4...3.17.0)
@Alexhuszagh
Copy link
Contributor Author

Thank you for your contribution 🙏 I will make a release in a few minutes that contains your fix. Additionally, I will also add a security advisory to update.

What a great first PR! If you want, I'll add you to https://pypdf.readthedocs.io/en/latest/meta/CONTRIBUTORS.html :-)

That would be wonderful, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF nf-security Non-functional change: Security
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants