Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blog: UTF-8 encoding problems on Windows #6917

Closed
4 tasks done
avanbremen opened this issue Mar 15, 2024 · 13 comments
Closed
4 tasks done

Blog: UTF-8 encoding problems on Windows #6917

avanbremen opened this issue Mar 15, 2024 · 13 comments
Assignees
Labels
bug Issue reports a bug resolved Issue is resolved, yet unreleased if open

Comments

@avanbremen
Copy link

avanbremen commented Mar 15, 2024

Context

No response

Bug description

When adding the built-in blog plugin and writing your first post, spinning up the live preview server will abort with a BuildError (init.py line 73):

Error reading metadata of post 'blog\posts\hello-world.md' in 'docs':
Expected metadata to be defined but found nothing

Including YAML style meta-data in other (not blog) Markdown files will not result in a BuildError.

A minimal reproduction project is included.

Tested with mkdocs-material-9.4.13 and mkdocs-1.5.3.

Related links

Reproduction

material-mkdocs-blog.zip

Steps to reproduce

  1. Add built-in blog plugin
  2. Write your first post
  3. Spin up the live preview server

Browser

No response

Before submitting

@alexvoss
Copy link
Sponsor Collaborator

I was able to reproduce the error but can also say that this is not generally so. I am just working on a blog tutorial and am writing about adding various things to the header of blog posts, without any problems. The content of the blog post looks fine to me, so I typed out a copy and, hey presto, it worked where the original hello-world.md did not.

The file is a UTF-8 file with BOM, which might be causing the problem?

$ file hello-world.md
hello-world.md: Unicode text, UTF-8 (with BOM) text, with CRLF line terminators

It seems that Python does not read UTF-8 with BOM when the encoding UTF-8 is used? See answers to this SOF question

@alexvoss
Copy link
Sponsor Collaborator

@avanbremen what editor are you using on what OS? Do you have the option to save without the BOM?

@kamilkrzyskow
Copy link
Collaborator

kamilkrzyskow commented Mar 15, 2024

I haven't run the example, but MkDocs should handle reading UTF-8-BOM files.
EDIT1: Does the blog plugin replace the source reading logic?
EDIT2: It doesn't look like it does replace it.
EDIT3:
Actually, when I checked the error source it revealed to me that it actually does change the logic of reading the file, but it's not done in the read_source event https://www.mkdocs.org/dev-guide/plugins/#on_page_read_source but instead is abstracted away in the Post object creation:

# Read contents and metadata immediately
with open(file.abs_src_path, encoding = "utf-8") as f:
self.markdown = f.read()
# Sadly, MkDocs swallows any exceptions that occur during parsing.
# Since we want to provide the best possible user experience, we
# need to catch errors early and display them nicely. We decided to
# drop support for MkDocs' MultiMarkdown syntax, because it is not
# correctly implemented anyway. When using MultiMarkdown syntax, all
# date formats are returned as strings and list are not properly
# supported. Thus, we just use the relevants parts of `get_data`.
match: Match = YAML_RE.match(self.markdown)
if not match:
raise PluginError(
f"Error reading metadata of post '{path}' in '{docs}':\n"
f"Expected metadata to be defined but found nothing"
)

Changing the utf-8 to utf-8-sig is the fix in this instance.

@alexvoss
Copy link
Sponsor Collaborator

Unless we declare this an upstream issue and ask the Python community to fix its Unicode support ;o)

@kamilkrzyskow
Copy link
Collaborator

The current implementation is fixed 😆, back in Python 2 you had to add a u before Unicode strings, I doubt they'll make further adjustments. Maybe in Python 4 😏

@avanbremen
Copy link
Author

@alexvoss @kamilkrzyskow thank you so much for looking into this.

I am using JetBrains Rider on Windows 11. Saving \docs\blog\posts\hello-world.md as UTF-8 with NO BOM does indeed fix the issue, good to know 😎. I will probably change Rider settings so that it creates UTF-8 files without the BOM by default.

I noticed that in the barebones project I included the \docs\index.md file is also encoded as UTF-8 with BOM. This file does not cause any issue when running mkdocs serve. Does this mean this problem is specifically tied to the blog plugin?

Have a great weekend!

@alexvoss
Copy link
Sponsor Collaborator

Does this mean this problem is specifically tied to the blog plugin?

Yes, because the blog plugin reads data from Markdown files itself instead of relying on MkDocs - see the code @kamilkrzyskow pointed to above.

The same issue could occur on other places. I found other instances where files are read with utf-8 encoding and not utf-8-sig, so this could happen elsewhere. I am sure that @squidfunk will give us his reaction on the matter.

Will tag this issue as a bug for now. As far as I can see, changing all instances of reading a file from utf-8 to utf-8-sig should fix this throughout the codebase and there should not be any drawbacks. Reading with utf-8-sig encoding should read files both with and without BOM. Before we can make those changes we should perhaps do some more due diligence.

@alexvoss alexvoss added the bug Issue reports a bug label Mar 15, 2024
@kamilkrzyskow
Copy link
Collaborator

I am using JetBrains Rider on Windows 11

Isn't Rider for C#, also weird default, never had an issue with PyCharm, which is also from JetBrain 🤔

Does this mean this problem is specifically tied to the blog plugin?

Yes, MkDocs itself does read the files with utf-8-sig:
https://github.com/mkdocs/mkdocs/blob/1.5.3/mkdocs/structure/pages.py#L201-L203
https://github.com/mkdocs/mkdocs/blob/master/mkdocs/structure/files.py#L455-L458

the blog plugin's implementation is linked in the previous comment, and based on the comments it was made to provide better support for different data types in the front-matter, as otherwise everything would be a string 🤔

@squidfunk
Copy link
Owner

Changing the utf-8 to utf-8-sig is the fix in this instance.

I was always asking myself what the -sig is for. Now we know 😅 @kamilkrzyskow @alexvoss would one of you like to craft a PR here and in Insiders, so we can replace all instances? Otherwise, I'll do it in the coming days.

@squidfunk squidfunk changed the title Barebones blog project fails to build, error reading metadata Blog: UTF-8 encoding problems on Windows Mar 16, 2024
@kamilkrzyskow
Copy link
Collaborator

I should be able to create the PRs today or tomorrow @squidfunk. This only affects file reads, for files, that could be created using another editor. So maybe not every encoding needs to be changed to the -sig variant. This also doesn't affect only Windows, more of an ancient programming debt when there was the transition from other encodings to UTF-8 as the main standard.

@squidfunk
Copy link
Owner

Perfect, I'll assign you for now. If you run into any troubles, let us know ☺️

@squidfunk
Copy link
Owner

Resolved via #6923 (and Insiders via https://github.com/squidfunk/mkdocs-material-insiders/pull/82)

@squidfunk squidfunk added the resolved Issue is resolved, yet unreleased if open label Mar 17, 2024
@squidfunk
Copy link
Owner

Released as part of 9.5.14.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue reports a bug resolved Issue is resolved, yet unreleased if open
Projects
None yet
Development

No branches or pull requests

4 participants