TST: Newlines in text extraction #807

MartinThoma · 2022-04-23T12:48:47Z

No description provided.

A change I would like to highlight is the performance improvement for large PDF files (#808) 🎉 New Features (ENH): - Add papersizes (#800) - Allow setting permission flags when encrypting (#803) - Allow setting form field flags (#802) Bug Fixes (BUG): - TypeError in xmp._converter_date (#813) - Improve spacing for text extraction (#806) - Fix PDFDocEncoding Character Set (#809) Robustness (ROB): - Use null ID when encrypted but no ID given (#812) - Handle recursion error (#804) Documentation (DOC): - CMaps (#811) - The PDF Format + commit prefixes (#810) - Add compression example (#792) Developer Experience (DEV): - Add Benchmark for Performance Testing (#781) Maintenance (MAINT): - Validate PDF magic byte in strict mode (#814) - Make PdfFileMerger.addBookmark() behave life PdfFileWriters\' (#339) - Quadratic runtime while parsing reduced to linear (#808) Testing (TST): - Newlines in text extraction (#807) Full Changelog: 1.27.8...1.27.9

A change I would like to highlight is the performance improvement for large PDF files (py-pdf#808) 🎉 New Features (ENH): - Add papersizes (py-pdf#800) - Allow setting permission flags when encrypting (py-pdf#803) - Allow setting form field flags (py-pdf#802) Bug Fixes (BUG): - TypeError in xmp._converter_date (py-pdf#813) - Improve spacing for text extraction (py-pdf#806) - Fix PDFDocEncoding Character Set (py-pdf#809) Robustness (ROB): - Use null ID when encrypted but no ID given (py-pdf#812) - Handle recursion error (py-pdf#804) Documentation (DOC): - CMaps (py-pdf#811) - The PDF Format + commit prefixes (py-pdf#810) - Add compression example (py-pdf#792) Developer Experience (DEV): - Add Benchmark for Performance Testing (py-pdf#781) Maintenance (MAINT): - Validate PDF magic byte in strict mode (py-pdf#814) - Make PdfFileMerger.addBookmark() behave life PdfFileWriters\' (py-pdf#339) - Quadratic runtime while parsing reduced to linear (py-pdf#808) Testing (TST): - Newlines in text extraction (py-pdf#807) Full Changelog: py-pdf/pypdf@1.27.8...1.27.9

TST: Newlines in text extraction

5cf1ac7

MartinThoma merged commit 9941099 into main Apr 23, 2022

MartinThoma deleted the tst-text-extract branch April 23, 2022 12:50

VictorCarlquist pushed a commit to VictorCarlquist/PyPDF2 that referenced this pull request Apr 29, 2022

TST: Newlines in text extraction (py-pdf#807)

bcf62e0

MartinThoma added the whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. label Jan 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TST: Newlines in text extraction #807

TST: Newlines in text extraction #807

MartinThoma commented Apr 23, 2022

TST: Newlines in text extraction #807

TST: Newlines in text extraction #807

Conversation

MartinThoma commented Apr 23, 2022