Blog: `UTF-8` encoding problems on Windows #6917

avanbremen · 2024-03-15T15:34:50Z

Context

No response

Bug description

When adding the built-in blog plugin and writing your first post, spinning up the live preview server will abort with a BuildError (init.py line 73):

Error reading metadata of post 'blog\posts\hello-world.md' in 'docs':
Expected metadata to be defined but found nothing

Including YAML style meta-data in other (not blog) Markdown files will not result in a BuildError.

A minimal reproduction project is included.

Tested with mkdocs-material-9.4.13 and mkdocs-1.5.3.

Reproduction

material-mkdocs-blog.zip

Steps to reproduce

Browser

No response

Before submitting

I have read and followed the bug reporting guidelines.
I have attached links to the documentation, and possibly related issues and discussions.
I assure that I have removed all customizations before submitting this bug report.
I have attached a .zip file with a minimal reproduction using the built-in info plugin.

The text was updated successfully, but these errors were encountered:

alexvoss · 2024-03-15T16:22:52Z

I was able to reproduce the error but can also say that this is not generally so. I am just working on a blog tutorial and am writing about adding various things to the header of blog posts, without any problems. The content of the blog post looks fine to me, so I typed out a copy and, hey presto, it worked where the original hello-world.md did not.

The file is a UTF-8 file with BOM, which might be causing the problem?

$ file hello-world.md
hello-world.md: Unicode text, UTF-8 (with BOM) text, with CRLF line terminators

It seems that Python does not read UTF-8 with BOM when the encoding UTF-8 is used? See answers to this SOF question

alexvoss · 2024-03-15T16:24:21Z

@avanbremen what editor are you using on what OS? Do you have the option to save without the BOM?

kamilkrzyskow · 2024-03-15T16:39:18Z

I haven't run the example, but MkDocs should handle reading UTF-8-BOM files.
EDIT1: ~~Does the blog plugin replace the source reading logic?~~
EDIT2: ~~It doesn't look like it does replace it.~~
EDIT3:
Actually, when I checked the error source it revealed to me that it actually does change the logic of reading the file, but it's not done in the read_source event https://www.mkdocs.org/dev-guide/plugins/#on_page_read_source but instead is abstracted away in the Post object creation:

mkdocs-material/src/plugins/blog/structure/__init__.py

Lines 59 to 75 in 2f1b2e9

    
           # Read contents and metadata immediately 
        
           with open(file.abs_src_path, encoding = "utf-8") as f: 
        
               self.markdown = f.read() 
        
               # Sadly, MkDocs swallows any exceptions that occur during parsing. 
        
               # Since we want to provide the best possible user experience, we 
        
               # need to catch errors early and display them nicely. We decided to 
        
               # drop support for MkDocs' MultiMarkdown syntax, because it is not 
        
               # correctly implemented anyway. When using MultiMarkdown syntax, all 
        
               # date formats are returned as strings and list are not properly 
        
               # supported. Thus, we just use the relevants parts of `get_data`. 
        
               match: Match = YAML_RE.match(self.markdown) 
        
               if not match: 
        
                   raise PluginError( 
        
                       f"Error reading metadata of post '{path}' in '{docs}':\n" 
        
                       f"Expected metadata to be defined but found nothing" 
        
                   )

Changing the utf-8 to utf-8-sig is the fix in this instance.

alexvoss · 2024-03-15T17:05:28Z

Unless we declare this an upstream issue and ask the Python community to fix its Unicode support ;o)

kamilkrzyskow · 2024-03-15T17:36:59Z

The current implementation is fixed 😆, back in Python 2 you had to add a u before Unicode strings, I doubt they'll make further adjustments. Maybe in Python 4 😏

avanbremen · 2024-03-15T19:20:50Z

@alexvoss @kamilkrzyskow thank you so much for looking into this.

I am using JetBrains Rider on Windows 11. Saving \docs\blog\posts\hello-world.md as UTF-8 with NO BOM does indeed fix the issue, good to know 😎. I will probably change Rider settings so that it creates UTF-8 files without the BOM by default.

I noticed that in the barebones project I included the \docs\index.md file is also encoded as UTF-8 with BOM. This file does not cause any issue when running mkdocs serve. Does this mean this problem is specifically tied to the blog plugin?

Have a great weekend!

alexvoss · 2024-03-15T19:41:06Z

Does this mean this problem is specifically tied to the blog plugin?

Yes, because the blog plugin reads data from Markdown files itself instead of relying on MkDocs - see the code @kamilkrzyskow pointed to above.

The same issue could occur on other places. I found other instances where files are read with utf-8 encoding and not utf-8-sig, so this could happen elsewhere. I am sure that @squidfunk will give us his reaction on the matter.

Will tag this issue as a bug for now. As far as I can see, changing all instances of reading a file from utf-8 to utf-8-sig should fix this throughout the codebase and there should not be any drawbacks. Reading with utf-8-sig encoding should read files both with and without BOM. Before we can make those changes we should perhaps do some more due diligence.

kamilkrzyskow · 2024-03-15T19:44:12Z

I am using JetBrains Rider on Windows 11

Isn't Rider for C#, also weird default, never had an issue with PyCharm, which is also from JetBrain 🤔

Does this mean this problem is specifically tied to the blog plugin?

Yes, MkDocs itself does read the files with utf-8-sig:
https://github.com/mkdocs/mkdocs/blob/1.5.3/mkdocs/structure/pages.py#L201-L203
https://github.com/mkdocs/mkdocs/blob/master/mkdocs/structure/files.py#L455-L458

the blog plugin's implementation is linked in the previous comment, and based on the comments it was made to provide better support for different data types in the front-matter, as otherwise everything would be a string 🤔

squidfunk · 2024-03-16T01:11:12Z

Changing the utf-8 to utf-8-sig is the fix in this instance.

I was always asking myself what the -sig is for. Now we know 😅 @kamilkrzyskow @alexvoss would one of you like to craft a PR here and in Insiders, so we can replace all instances? Otherwise, I'll do it in the coming days.

kamilkrzyskow · 2024-03-16T20:33:01Z

I should be able to create the PRs today or tomorrow @squidfunk. This only affects file reads, for files, that could be created using another editor. So maybe not every encoding needs to be changed to the -sig variant. This also doesn't affect only Windows, more of an ancient programming debt when there was the transition from other encodings to UTF-8 as the main standard.

squidfunk · 2024-03-17T01:21:17Z

Perfect, I'll assign you for now. If you run into any troubles, let us know ☺️

squidfunk · 2024-03-17T05:05:34Z

Resolved via #6923 (and Insiders via https://github.com/squidfunk/mkdocs-material-insiders/pull/82)

squidfunk · 2024-03-18T00:48:32Z

Released as part of 9.5.14.

alexvoss added the bug Issue reports a bug label Mar 15, 2024

squidfunk changed the title ~~Barebones blog project fails to build, error reading metadata~~ Blog: UTF-8 encoding problems on Windows Mar 16, 2024

squidfunk assigned kamilkrzyskow Mar 17, 2024

kamilkrzyskow mentioned this issue Mar 17, 2024

Fixed UTF-8 with BOM encoding support #6923

Merged

squidfunk added the resolved Issue is resolved, yet unreleased if open label Mar 17, 2024

squidfunk closed this as completed Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blog: `UTF-8` encoding problems on Windows #6917

Blog: `UTF-8` encoding problems on Windows #6917

avanbremen commented Mar 15, 2024 •

edited

alexvoss commented Mar 15, 2024

alexvoss commented Mar 15, 2024

kamilkrzyskow commented Mar 15, 2024 •

edited

alexvoss commented Mar 15, 2024

kamilkrzyskow commented Mar 15, 2024

avanbremen commented Mar 15, 2024

alexvoss commented Mar 15, 2024

kamilkrzyskow commented Mar 15, 2024

squidfunk commented Mar 16, 2024

kamilkrzyskow commented Mar 16, 2024

squidfunk commented Mar 17, 2024

squidfunk commented Mar 17, 2024

squidfunk commented Mar 18, 2024

Blog: UTF-8 encoding problems on Windows #6917

Blog: UTF-8 encoding problems on Windows #6917

Comments

avanbremen commented Mar 15, 2024 • edited

Context

Bug description

Related links

Reproduction

Steps to reproduce

Browser

Before submitting

alexvoss commented Mar 15, 2024

alexvoss commented Mar 15, 2024

kamilkrzyskow commented Mar 15, 2024 • edited

alexvoss commented Mar 15, 2024

kamilkrzyskow commented Mar 15, 2024

avanbremen commented Mar 15, 2024

alexvoss commented Mar 15, 2024

kamilkrzyskow commented Mar 15, 2024

squidfunk commented Mar 16, 2024

kamilkrzyskow commented Mar 16, 2024

squidfunk commented Mar 17, 2024

squidfunk commented Mar 17, 2024

squidfunk commented Mar 18, 2024

Blog: `UTF-8` encoding problems on Windows #6917

Blog: `UTF-8` encoding problems on Windows #6917

avanbremen commented Mar 15, 2024 •

edited

kamilkrzyskow commented Mar 15, 2024 •

edited