Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong Oritentation of letter #741

Closed
muhmuhhum opened this issue Dec 14, 2023 · 7 comments
Closed

Wrong Oritentation of letter #741

muhmuhhum opened this issue Dec 14, 2023 · 7 comments
Labels
document-reading Related to reading documents

Comments

@muhmuhhum
Copy link

I have following Problem i have a 2d technical drawing where text is written in every direction.
The GetWords() Method with NearestNeighbourWordExtractor works fine for me except for this example.
grafik
In the Image you can see a part of the PDF.
Where light blue is the word box.
Dark blue the letter box and green the Location.
My Problem now is that the Letter has the TextOrientation Horizontal which leads to a wrongly drawn Text box for it and maybe that the 7 and the 9 cant find each other with nearest neighbour.
I have tried to create a pdf which has the same problems but i couldnt get it to work.
Because there is a nda i cant share the file, but maybe you could point me in the right direction to find the problem and maybe find a solution for it

@BobLd
Copy link
Collaborator

BobLd commented Dec 14, 2023

@muhmuhhum thanks for openning the issue. I understand you can't sahre the pdf document but would you mind sharing the code you're using to draw the bounding box?

Also, is all the text always draw as horizontal text? For example, do you also have the issue with the "R 12,5" text?

@muhmuhhum
Copy link
Author

muhmuhhum commented Dec 14, 2023

@BobLd Thx for the quick response its wrong for some of the other letters on the document, but for nearly all, the letter has the correct TextOrientation. I already found that the Location.EndLine und Location.StartLine for the letters with wrong TextOrientation are at the same point.
Here for the 7:
grafik

And the 2(Of the 25):
grafik

Here a bigger cutout with more words marked.
grafik

For the drawing i have to change some values cause skia uses top left as origin and i have to calculate the new position with 300 dpi:

var wordBox = GetRotatedRect(word.BoundingBox);
canvas.DrawRect((float)(wordBox.blX / 72 * 300),
            (float)((page.Height - wordBox.blY - wordBox.height) / 72 * 300), (float)(wordBox.width / 72 * 300),
            (float)(wordBox.height / 72 * 300), wordPaint);

And GetRotatedRect:

(double blX, double blY, double width, double height) GetRotatedRect(PdfRectangle boundingBox)
  {
      var xPoints = new List<double>
      {
          boundingBox.BottomLeft.X,
          boundingBox.TopLeft.X,
          boundingBox.TopRight.X,
          boundingBox.BottomRight.X
      };
      var yPoints = new List<double>
      {
          boundingBox.BottomLeft.Y,
          boundingBox.TopLeft.Y,
          boundingBox.TopRight.Y,
          boundingBox.BottomRight.Y
      };

    return (xPoints.Min(), yPoints.Min(), xPoints.Max() - xPoints.Min(), yPoints.Max() - yPoints.Min());
}

@BobLd
Copy link
Collaborator

BobLd commented Dec 14, 2023

I already found that the Location.EndLine und Location.StartLine for the letters with wrong TextOrientation are at the same point.
Thanks for that, that's very usefull. I'll get back to you on that shortly.

Regarding the rendering with Skia, you indeed need to invert the Y axis. I think one thing that causes your draw bounding boxes to always be Horizontal is that you use canvas.DrawRect(), which I think always draws axis aligned rectangles.

Could you instead use the canvas.DrawPath() method? You can use the emthod below:

using (var rect = new SKPath())
{
	rect.MoveTo((float)transformedPdfBounds.BottomLeft.X, (float)transformedPdfBounds.BottomLeft.Y);
	rect.LineTo((float)transformedPdfBounds.TopLeft.X, (float)transformedPdfBounds.TopLeft.Y);
	rect.LineTo((float)transformedPdfBounds.TopRight.X, (float)transformedPdfBounds.TopRight.Y);
	rect.LineTo((float)transformedPdfBounds.BottomRight.X, (float)transformedPdfBounds.BottomRight.Y);
	rect.Close();
	_canvas.DrawPath(rect, new SKPaint() { Color = SKColors.Black, Style = SKPaintStyle.Stroke });
}

where transformedPdfBounds is your PdfRectangle boundingBox, with top left as origin (ready for Skia).

@muhmuhhum
Copy link
Author

Oh it is intended that the bounding boxes are always horizontal sry i have missed this question in your original answer that is what GetRotatedRect is for to get the horizontal box around the word. Sry if that caused some confusion.

@muhmuhhum
Copy link
Author

Ok after some research i think i found the problem. The Pdf has Fonts with Widths of 0 which leads to some weird behavior

@BobLd
Copy link
Collaborator

BobLd commented Dec 15, 2023

@muhmuhhum sounds good, thanks a lot for that. The code that computes the text orientation is here
https://github.com/UglyToad/PdfPig/blob/4537ec3f02c9f1f12e17e3a2e03f411c41d027de/src/UglyToad.PdfPig/Content/Letter.cs#L139C1-L162C10

If you want, you can try to fix it, I'll try to have a look on my side.

In the meantime, you can try using the NearestNeighbourWordExtractor while ignoring the TextOrientation as follow:

var options = new NearestNeighbourWordExtractor.NearestNeighbourWordExtractorOptions()
{
	GroupByOrientation = false
};

var nnWordExtracor = new NearestNeighbourWordExtractor(options);

Let me know if that helps

@muhmuhhum
Copy link
Author

@BobLd Soory for the late answer my current workauround for this is that when i try to extract the words i check for letters where the letter.StartBaseLine is the same Point as letter.EndBaseLine and then replace them with bottomLeft and BottomRight of the glyph box and set the TextOrientation based on the Rotation of the GlyphRectangle. This may ignores the possible extra width for the Letters but i havent found a good other solution. Now i just ask myself how programms like Adobe Acrobat can draw this pdf cause as far as i understand a character with width of 0 should be drawn as so, but it is displayed normally just as every other character.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
document-reading Related to reading documents
Projects
None yet
Development

No branches or pull requests

3 participants