Add advanced text run filter for text extraction #545

jbockler · 2025-01-12T12:39:47Z

We are currently stuck on v2.7.0, as later versions break text extraction for us. I finally had some time to investigate this issue.

Our use case involves using the reader to validate generated shipping labels in our tests. These PDFs contain a large "MUSTER" (english "SAMPLE") watermark in the background, which is actually a text element. This watermark appears to confuse the PageLayout, which relies on the mean font size to calculate the rows.

As a result:

The calculated line size is too large, causing some lines to be overwritten by subsequent ones.
The "MUSTER" watermark appears in the extracted text, where it doesn't belong.

Example PDF as image

The Fix

I implemented a filter to exclude unwanted text during extraction. The filter already supports complex conditions (or, and) and additional operators, which should minimize the risk of future breaking changes and allow more users to customize their parsing.

With this fix, the text can now be read correctly using the following syntax:

reader = PDF::Reader.new(...)
page = reader.pages[0]
page.text(exclude: {text: {include: "MUSTER"}})

yob · 2025-01-13T00:50:38Z

Thanks! I'm open to this, it seems like a pragmatic addition to the API.

tbh I'm not thrilled with the layout logic in PageLayout. The mean font size gear works well enough in the common case, but it has many edge cases. I've dreamed of overhauling it, but real life keeps getting in the way. This would be a tangible difference for users now.

Interesting that the sorbet tests fail in CI. It's complaining about missing types for the constant, but I see you have the constant typed in the rbi file. I think I've run into that limitation of using external rbi files before, but now that I've merged #546 I wonder if rebasing this onto main will help?

It's frustrating that sorbet doesn't have better changelogs!

Could we also get a couple of basic specs in spec/integration_spec.rb? With existing PDFs in the test corpus, or one you create specifically. Either is fine - just something that executes a couple of happy paths of the new API and will help prevent regressions.

jbockler · 2025-01-13T16:31:33Z

Thanks for the feedback and the quick response!

The layout logic seems like a non-trivial problem. I’d be really interested to see what you come up with if you ever get the chance to revisit it. Hopefully this filter helps for some edge cases in the meantime.

Interesting that the sorbet tests fail in CI...

I can't reproduce the sorbet error locally, but i'm already using the newer version. So i think rebasing should fix this.

Could we also get a couple of basic specs in spec/integration_spec.rb?

Absolutely! I'll add some integration tests.

Some special PDFs (e.g. watermark as text in the background) seem to break the text rendering of PageLayout, because it relies on the mean font size and median glyph width. Which could lead to silently overwriting text which should belong to another row. In order to prevent this we add an option to select which text should be extracted. With that you can exclude special text, that is not relevant for the parsing.

yob · 2025-01-13T21:49:27Z

Thanks!

I don't plan to release this immediately, I'm working with the rubygems folks to fix publishing gems from Buildkite using OIDC tokens and this gem is my testbed. If you want to use this immediately I'd suggest loading it via git 🙏

2.14.0 (2025-01-29) * Raise minimum supported ruby to 2.1 (yob/pdf-reader#543) * Add support for filtering to Page#text (yob/pdf-reader#545) 2.14.1 (2025-02-04) * Fix issue in RBI signatures, introduced in v2.14.0(yob/pdf-reader#550)

jbockler mentioned this pull request Jan 12, 2025

Update sorbet for ARM support #546

Merged

jbockler force-pushed the advanced-filter branch from b4a7cbc to d8fb4c2 Compare January 13, 2025 16:33

add integration tests

Loading
Loading status checks…

f3896db

yob merged commit f470197 into yob:main Jan 13, 2025
1 check passed

olivier-thatch mentioned this pull request Feb 3, 2025

Fix RBI signatures #550

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add advanced text run filter for text extraction #545

Add advanced text run filter for text extraction #545

jbockler commented Jan 12, 2025

yob commented Jan 13, 2025

jbockler commented Jan 13, 2025

yob commented Jan 13, 2025

Add advanced text run filter for text extraction #545

Add advanced text run filter for text extraction #545

Conversation

jbockler commented Jan 12, 2025

The Fix

yob commented Jan 13, 2025

jbockler commented Jan 13, 2025

yob commented Jan 13, 2025