Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-rotate feature fails for certain images #747

Closed
Balearica opened this issue Apr 30, 2023 · 3 comments
Closed

Auto-rotate feature fails for certain images #747

Balearica opened this issue Apr 30, 2023 · 3 comments

Comments

@Balearica
Copy link
Collaborator

Balearica commented Apr 30, 2023

While setting rotateAuto: true generally works as expected, for certain documents the angle is falsely calculated as 0 degrees. Different images can be experimented with using the image processing example. An image that the feature does not work for is attached below.

Notably, this is not a reason to not use rotateAuto, as the images I've encountered that fail to rotate correctly are not rotated at all--so end up the same as the input image.

@Balearica
Copy link
Collaborator Author

I added an auto-rotate benchmark to assess this feature more robustly. The benchmark rotates each of our benchmark images by 0.2, 0.1, -0.1, and -0.2 radians (0.1 radian is about 6 degrees) and attempts to un-rotate them using the auto-rotate option.

At the time of writing this (version 4.0.3) auto-rotate worked correctly for 4 of 6 images rotated +/- 0.1 radians, and 0 of 6 images rotated +/- 0.2 radians. This indicates that the feature struggles with larger angles. Notably, for the images that were not correctly rotated, 0 rotation was applied--so none of the images got worse by using the rotateAuto option.

Balearica pushed a commit that referenced this issue Apr 30, 2023
@Balearica
Copy link
Collaborator Author

I updated Tesseract.js and Tesseract.js-core to have improved auto-rotate functionality. Previously auto-rotate worked correctly for 4 of 6 images rotated +/- 0.1 radians, and 0 of 6 images rotated +/- 0.2 radians. After the change auto-rotate worked correctly for 6 of 6 images rotated +/- 0.1 radians and 3 of 6 images rotated +/- 0.2 radians.

The change I made was switching from using the Tesseract "reskew angle" to using the "gradient". Although these are both estimates of the rotation of the page, the gradient is calculated later on using more information (it is calculated using individual lines of text), and appears to be significantly better.

I do not believe further improvements are possible using only statistics already calculated by Tesseract. In the images where the auto-rotation fails, it looks like the root cause is that Tesseract does not detect text lines to begin with.

@Balearica
Copy link
Collaborator Author

This update has been included in the 4.0.4 release. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant