Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add advanced text run filter for text extraction #545

Merged
merged 2 commits into from
Jan 13, 2025

Conversation

jbockler
Copy link
Contributor

We are currently stuck on v2.7.0, as later versions break text extraction for us. I finally had some time to investigate this issue.

Our use case involves using the reader to validate generated shipping labels in our tests. These PDFs contain a large "MUSTER" (english "SAMPLE") watermark in the background, which is actually a text element. This watermark appears to confuse the PageLayout, which relies on the mean font size to calculate the rows.

As a result:

  • The calculated line size is too large, causing some lines to be overwritten by subsequent ones.
  • The "MUSTER" watermark appears in the extracted text, where it doesn't belong.
Example PDF as image

test

The Fix

I implemented a filter to exclude unwanted text during extraction. The filter already supports complex conditions (or, and) and additional operators, which should minimize the risk of future breaking changes and allow more users to customize their parsing.

With this fix, the text can now be read correctly using the following syntax:

reader = PDF::Reader.new(...)
page = reader.pages[0]
page.text(exclude: {text: {include: "MUSTER"}})

@yob
Copy link
Owner

yob commented Jan 13, 2025

Thanks! I'm open to this, it seems like a pragmatic addition to the API.

tbh I'm not thrilled with the layout logic in PageLayout. The mean font size gear works well enough in the common case, but it has many edge cases. I've dreamed of overhauling it, but real life keeps getting in the way. This would be a tangible difference for users now.

Interesting that the sorbet tests fail in CI. It's complaining about missing types for the constant, but I see you have the constant typed in the rbi file. I think I've run into that limitation of using external rbi files before, but now that I've merged #546 I wonder if rebasing this onto main will help?

It's frustrating that sorbet doesn't have better changelogs!

Could we also get a couple of basic specs in spec/integration_spec.rb? With existing PDFs in the test corpus, or one you create specifically. Either is fine - just something that executes a couple of happy paths of the new API and will help prevent regressions.

@jbockler
Copy link
Contributor Author

Thanks for the feedback and the quick response!

The layout logic seems like a non-trivial problem. I’d be really interested to see what you come up with if you ever get the chance to revisit it. Hopefully this filter helps for some edge cases in the meantime.

Interesting that the sorbet tests fail in CI...

I can't reproduce the sorbet error locally, but i'm already using the newer version. So i think rebasing should fix this.

Could we also get a couple of basic specs in spec/integration_spec.rb?

Absolutely! I'll add some integration tests.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Some special PDFs (e.g. watermark as text in the background)
seem to break the text rendering of PageLayout, because it relies
on the mean font size and median glyph width. Which could lead to
silently overwriting text which should belong to another row.

In order to prevent this we add an option to select which text
should be extracted. With that you can exclude special text, that
is not relevant for the parsing.
@yob yob merged commit f470197 into yob:main Jan 13, 2025
1 check passed
@yob
Copy link
Owner

yob commented Jan 13, 2025

Thanks!

I don't plan to release this immediately, I'm working with the rubygems folks to fix publishing gems from Buildkite using OIDC tokens and this gem is my testbed. If you want to use this immediately I'd suggest loading it via git 🙏

@olivier-thatch olivier-thatch mentioned this pull request Feb 3, 2025
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request Feb 9, 2025
2.14.0 (2025-01-29)

* Raise minimum supported ruby to 2.1
  (yob/pdf-reader#543)

* Add support for filtering to Page#text
  (yob/pdf-reader#545)

2.14.1 (2025-02-04)

* Fix issue in RBI signatures, introduced in
  v2.14.0(yob/pdf-reader#550)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants