Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: CommonMark compatibility, supporting multiple markdown/content parsers #3018

Closed
slorber opened this issue Jul 1, 2020 · 32 comments
Closed
Labels
domain: markdown Related to Markdown parsing or syntax proposal This issue is a proposal, usually non-trivial change
Milestone

Comments

@slorber
Copy link
Collaborator

slorber commented Jul 1, 2020

💥 Proposal

People using Docusaurus don't always like the MDX parser:

  • If you come from an existing Markdown docs base (like v1), you need to make it compatible with MDX, despite that you actually don't plan to embed any JSX components in the markdown
  • You might want to keep compatibility to CommonMark, to stay compatible with existing ecosystem (Github md viewer, markdownlint etc...)
  • It creates more "lock-in", because to leave MDX you have to convert back to CommonMark
  • It can be confusing to not be able to use CommonMark (ie html tags, not jsx) in .md files, and to learn that even .md files are parsed with MDX

Related discussions:


Solution ?

These libs:

  • is also based on UnifiedJS ecosystem
  • allows to pass custom React elements to replace existing tags

We may be able to build some shared abstraction on top of react-markdown + MDX.

If this works, we could switch from one parser to another with a simple switch/setting, that could be:

  • .md -> common-mark compatible parser
  • .mdx -> MDX
  • global default D2 parser setting
  • parser frontmatter

The idea would be that, if a doc does not embed any html/jsx, we could switch from one parser to the other, and shouldn't notice any change.

--

Feedbacks welcome

@slorber slorber added proposal This issue is a proposal, usually non-trivial change status: needs triage This issue has not been triaged by maintainers labels Jul 1, 2020
@slorber slorber changed the title RFC: CommonMark compatibility RFC: CommonMark compatibility, supporting multiple parsers Jul 1, 2020
@slorber slorber changed the title RFC: CommonMark compatibility, supporting multiple parsers RFC: CommonMark compatibility, supporting multiple markdown parsers Jul 1, 2020
@slorber
Copy link
Collaborator Author

slorber commented Jul 1, 2020

Sidenote: <!--truncate--> marker used for blog summaries will likely not work in MDX 2:

Edit: I think it will still work because it's processed before mdx compilation

@borekb
Copy link

borekb commented Jul 1, 2020

That sounds excellent!

Thinking about the common abstraction you mentioned, and assuming that this is still the overarching goal:

Beyond that, Docusaurus 2 is a performant static site generator and can be used to create common content-driven websites (e.g. Documentation, Blogs, Product Landing and Marketing Pages, etc) extremely quickly.

I think that Docusaurus could document an interface for plugins / formats / loaders (I don't know how to call them) that could possibly look like this:

  • At the base level, the format should be able to produce an HTML output, i.e., an HTML string. For example, if I have a .txt file, I'd be able to write a "format" that produces <pre>... contents of the txt file ...</pre>. Since this is just a string, Docusaurus wouldn't operate on it in any way, just display it.

  • Smarter formats would return some sort of AST (or JSX or whatever would be suitable). For example, if I wanted to implement .md that turns code blocks to live playgrounds, like React Styleguidist does, I'd be able to do that.

Some wilders use cases this would cover (I actually had them in the past):

  • A marketing team uses headless WordPress to maintain the contents of landing pages.
  • Feature comparison / grid is maintained in a Google Sheet.
  • .md files use some sort of Markdown dialect, for example, the site used to be powered by MkDocs and uses Python-Markdown plus a couple of custom extensions.

For this RFC, I think it's more than enough to support CommonMark but since I've now spent some time thinking about how we'd use Docusaurus and what I'd love it to allow me to do, I thought I'd post it here.

Thanks a lot for this RFC and all the work that goes into Docusaurus!

@borekb
Copy link

borekb commented Jul 15, 2020

@slorber I'd like to create a prototype of CommonMark support but am unfamiliar with Docusaurus codebase so would really appreciate high-level guidelines if you will.

Roughly speaking, if I wanted to parse .md files as CommonMark, which parts of the codebase I'd need to touch? I can overwrite the code for now in a fork, i.e., it's not my ambition yet to make this a general solution supporting both MDX and CommonMark, I just want to see what's the minimal set of changes to swap the MDX parser for something like remark.

Any hints appreciated 🙏 .

@slorber
Copy link
Collaborator Author

slorber commented Jul 15, 2020

Hi,

My first intuition would be to modify "docusaurus-mdx-loader", and provide a loader option to tell it to load the files as md or mdx. In the end, we need a React component anyway, using MDX, but in md mode, we could convert the html elements to JSX elements just before feeding mdx, so that mdx is happy?

Not sure, this would require some experiments to see if this is possible

@borekb
Copy link

borekb commented Jul 15, 2020

Thanks a lot, I'll give it a go later this week or the next one.

@borekb
Copy link

borekb commented Aug 9, 2020

plugin-content-docs that supports CommonMark for .md files

We've experimented with plain Markdown support in an internal prototype and wanted to post the key results here.

Summary

It's doable and not that complex – about 200 LoC. There's currently some ugliness like to get the ToC, we're converting AST to React components and then parsing it back to a string for which we didn't find a better solution yet but I'm sure there should be, e.g. something like hast-util-to-jsx if it was maintained.

How it's done

The rewritten plugin-content-docs-2/index.ts customizes loaders:

  • For .mdx files, use @docusaurus/mdx-loader
  • For .md files, use a custom loader (see below).

In our prototype, we first duplicate ~20 LoC from the base implementation and then customize the loaders. The entire file (certainly with opportunities for further cleanup) looks like this:

import path from 'path';

import admonitions from 'remark-admonitions';
import {STATIC_DIR_NAME} from '@docusaurus/core/lib/constants';
import {
  docuHash,
  aliasedSitePath,
} from '@docusaurus/utils';
import {
  LoadContext,
  Plugin,
  OptionValidationContext,
  ValidationResult,
} from '@docusaurus/types';

import loadEnv from '@docusaurus/plugin-content-docs/lib/env';

import {
  PluginOptions,
  LoadedContent,
  SourceToPermalink,
} from '@docusaurus/plugin-content-docs/lib/types';
import {Configuration} from 'webpack';
import {VERSIONS_JSON_FILE} from '@docusaurus/plugin-content-docs/lib/constants';
import {PluginOptionSchema} from '@docusaurus/plugin-content-docs/lib/pluginOptionSchema';
import {ValidationError} from '@hapi/joi';

import * as originalPluginContentDocs from '@docusaurus/plugin-content-docs';

export default function pluginContentDocs(
  context: LoadContext,
  options: PluginOptions,
): Plugin<LoadedContent | null, typeof PluginOptionSchema> {

  if (options.admonitions) {
    options.remarkPlugins = options.remarkPlugins.concat([
      [admonitions, options.admonitions],
    ]);
  }

  const {siteDir, generatedFilesDir} = context;
  const docsDir = path.resolve(siteDir, options.path);
  const sourceToPermalink: SourceToPermalink = {};

  const dataDir = path.join(
    generatedFilesDir,
    'docusaurus-plugin-content-docs',
    // options.id ?? 'default', // TODO support multi-instance
  );

  // Versioning.
  const env = loadEnv(siteDir, {disableVersioning: options.disableVersioning});
  const {versioning} = env;
  const {
    docsDir: versionedDir,
  } = versioning;

  const result = originalPluginContentDocs.default(context, options);
  result.configureWebpack = function (_config, isServer, utils) {
    const {getBabelLoader, getCacheLoader} = utils;
    const {rehypePlugins, remarkPlugins} = options;
    // Suppress warnings about non-existing of versions file.
    const stats = {
      warningsFilter: [VERSIONS_JSON_FILE],
    };

    return {
      stats,
      devServer: {
        stats,
      },
      resolve: {
        alias: {
          '~docs': dataDir,
        },
      },
      module: {
        rules: [
          {
            test: /(\.mdx)$/,
            include: [docsDir, versionedDir].filter(Boolean),
            use: [
              getCacheLoader(isServer),
              getBabelLoader(isServer),
              {
                loader: require.resolve('@docusaurus/mdx-loader'),
                options: {
                  remarkPlugins,
                  rehypePlugins,
                  staticDir: path.join(siteDir, STATIC_DIR_NAME),
                  metadataPath: (mdxPath: string) => {
                    // Note that metadataPath must be the same/in-sync as
                    // the path from createData for each MDX.
                    const aliasedSource = aliasedSitePath(mdxPath, siteDir);
                    return path.join(
                      dataDir,
                      `${docuHash(aliasedSource)}.json`,
                    );
                  },
                },
              },
              {
                loader: path.resolve(__dirname, './markdown/index.js'),
                options: {
                  siteDir,
                  docsDir,
                  sourceToPermalink,
                  versionedDir,
                },
              },
            ].filter(Boolean),
          },
          {
            test: /(\.md)$/,
            include: [docsDir, versionedDir].filter(Boolean),
            use: [
              getCacheLoader(isServer),
              getBabelLoader(isServer),
              {
                loader: path.resolve(__dirname, './custom-md-loader/index.js'),
                options: {
                  remarkPlugins,
                  rehypePlugins,
                  staticDir: path.join(siteDir, STATIC_DIR_NAME),
                  metadataPath: (mdxPath: string) => {
                    // Note that metadataPath must be the same/in-sync as
                    // the path from createData for each MDX.
                    const aliasedSource = aliasedSitePath(mdxPath, siteDir);
                    return path.join(
                      dataDir,
                      `${docuHash(aliasedSource)}.json`,
                    );
                  },
                },
              },
              {
                loader: path.resolve(__dirname, './markdown/index.js'),
                options: {
                  siteDir,
                  docsDir,
                  sourceToPermalink,
                  versionedDir,
                },
              },
            ].filter(Boolean),
          },
        ],
      },
    } as Configuration;
  }

  return result;
}

export function validateOptions({
  validate,
  options,
}: OptionValidationContext<PluginOptions, ValidationError>): ValidationResult<
  PluginOptions,
  ValidationError
> {
  return originalPluginContentDocs.validateOptions({validate, options});
}

The there's a custom loader – plugin-content-docs-2/src/custom-md-loader/index.ts. It looks like this in full:

import {loader} from 'webpack';
import {getOptions} from 'loader-utils';
import {readFileSync} from 'fs-extra';
import matter from 'gray-matter';
import stringifyObject from 'stringify-object';
import unified from 'unified';
import parse from 'remark-parse';
import remark2rehype from 'remark-rehype';
import rehype2react from 'rehype-react';
import React from 'react';
import rightToc from '@docusaurus/mdx-loader/src/remark/rightToc';
import slug from 'remark-slug';
import raw from 'rehype-raw';
import emoji from 'remark-emoji';
import admonitions from 'remark-admonitions';
import headings from 'rehype-autolink-headings';
import highlight from '@mapbox/rehype-prism';
import reactElementToJSXString from 'react-element-to-jsx-string';

const mdLoader: loader.Loader = function (fileString) {
  const callback = this.async();

  const {data, content} = matter(fileString);

  const options = getOptions(this) || {};

  let exportStr = `export const frontMatter = ${stringifyObject(data)};`;
  // Read metadata for this MDX and export it.
  if (options.metadataPath && typeof options.metadataPath === 'function') {
    const metadataPath = options.metadataPath(this.resourcePath);
    if (metadataPath) {
      // Add as dependency of this loader result so that we can
      // recompile if metadata is changed.
      this.addDependency(metadataPath);
      const metadata = readFileSync(metadataPath, 'utf8');
      exportStr += `\nexport const metadata = ${metadata};`;
    }
  }

  const processedMd = unified()
    .use(parse, {commonmark: true})
    .use(slug)
    .use(emoji)
    .use(admonitions)
    .use(rightToc)
    .use(remark2rehype, {allowDangerousHtml: true})
    .use(raw)
    .use(headings)
    .use(highlight)
    .use(rehype2react, {createElement: React.createElement, Fragment: React.Fragment})
    .processSync(content);

  const jsxString = reactElementToJSXString((processedMd as any).result);

  // I don't like this at all, but it's a prototype...
  // We need to get 'rightToc' data from the JSX string, so following lines
  // are about getting the info and then replacing it, along with escaping unwanted chars.
  const rightTocString = jsxString
    .match(/(export const rightToc = \[[\s\S.]*\];)/)![1]
    .replace(/(\\n)|(\\t)|(\\)/g, '');

  const escapedJsxString = jsxString
    .replace(/{\`[\S\s.]*?export const rightToc = \[[\s\S.]*\];[\S\s.]*?\`}/, '')
    .replace(/{'[\s\S]*?'}/g, `{' '}`)
    .replace(/`/g, '\`');

  const code = `
  import React from 'react';

  ${rightTocString}
  ${exportStr}

  export default function MDLoader() {
    return (${escapedJsxString});
  }
  `;

  return callback && callback(null, code);
};

export default mdLoader;

If there wasn't the ugly React to string parsing code, it would actually be quite simple.

The downside from the maintenance point of view is that the MD loader is explicit about its unified.js plugins while the MDX loader is a bit more indirect / obscure, so there would be two places to maintain this configuration. But I think this could be refactored to be more aligned, and even in the worst case, it's like 15 lines of code and the default set of plugins probably isn't changing that often.

Overall, it seems feasible to me.

@borekb
Copy link

borekb commented Aug 9, 2020

An alternative approach would be to convert MD to MDX first and then just let the mdx-loader to its thing. But there probably isn't currently a convertor from MD to MDX in the unified ecosystem, though many pieces are in place: unifiedjs/ideas#9.

@slorber
Copy link
Collaborator Author

slorber commented Aug 11, 2020

thanks for those details, that looks interesting. If MDX provided a converter that would be great, also would helpful for v1->v2 migrations

I don't have much time to explore these ideas but we'll come back to it someday.

Note, not sure it's related, but there's a large docs plugin refactor here: #3245

@nilsocket
Copy link

Is it possible to have something simple, which works out of the box.

I need math blocks, I see MDX documentation, it's too messy and complicated.

Docusaurus seems to work on the basic assumptions or at-least targeted
to only those users who are front-end developers, know JSX, React, ...

or

Is there a simple way to get math blocks support.

Thank you.

@slorber
Copy link
Collaborator Author

slorber commented Apr 16, 2021

@nilsocket I don't think math blocks (latex/katex?) are really related to the markdown parser. But you are right, and we should make this easy. Can you explain better your usecase on this new issue I just created? #4625

@lukejgaskell
Copy link

@slorber is there an official way of handling this? I have a similar situation where I don't want my .md files validated with .mdx.

@slorber
Copy link
Collaborator Author

slorber commented Aug 19, 2021

@lukejgaskell unfortunately no easy solution can be implemented in userland to solve this properly.
The solution proposed by @borekb is likely the best you can do, and I understand you might be intimidating 😅

MDX is not a "validator" for md files, it converts those files to React components that are loaded as JS modules in the client app through webpack loaders.

To make this compatible with CommonMark, this would require the loader to not use MDX in some cases but use a different Remark parsing logic.

For .md files we even have 2 choices now:

  • convert those files to React components, but use CommonMark compatible processing (solution of @borekb )
  • convert those files to some AST that a small client-side runtime could render (it may be more performant for build time, but will have to poc this).

Some challenges to consider:

  • The goal is not only to support CommonMark, but also try to reduce build times/improve perfs for sites not needing MDX (or with limited usage)
  • Some non-MDX Docusaurus markdown features (admonitions, code blocks etc...) should rather keep working when switching the parser

This is something I want to work on but I don't have time in the short term.

@lukejgaskell
Copy link

lukejgaskell commented Aug 19, 2021

@slorber That makes sense, thank you for the detailed explanation. If it's helpful, my use case is that I'm importing markdown from different sources to host on a single site. That markdown may or may not follow the same syntax as the current loader.

For example, some of it uses <pre> tags, or other HTML elements, but not always correctly... which makes me have to escape them. To fix my scenario I end up doing a bunch of regex parsing to get those files to align with the loader. Maybe there are other ways to handle these scenarios, but having loader options could be helpful as different sources have different lax practices on their markdown.

Usually it ends up breaking in the build (because of the mdx loader) even though I'd like it to just show a broken file in those scenarios. Anyways, here's the regex I end up doing to solve some of this:

const replaceLT = (m, group1) => (!group1 ? m : "&lt;");
const replaceGT = (m, group1) => (!group1 ? m : "&gt;");
const replaceFileLink = (m) => m.replace("(", "(pathname://");

async function run() {
  await replace({
    files: ["docs/**/*.md"],
    from: [
      /<pre>/g,
      /<\/pre>/g,
      /<!--.*-->/g,
      /\[.*?\]\(.*?\.(json|xlsx|xls|zip|docx|ps1)\)/g, // fix file type links to not be picked up by loader
      /\\`|`(?:\\`|[^`])*`|(<)/gm, //find all less than symbols that are not between backticks
      /\\`|`(?:\\`|[^`])*`|(>)/gm, //find all greater than symbols that are not between backticks
    ],
    to: ["```", "```", "", replaceFileLink, replaceLT, replaceGT],
  });
}

@zepatrik
Copy link

zepatrik commented Feb 3, 2022

One major problem I am facing right now is that I auto-generate some docs pages from go code. It is theoretically possible to inject some HTML/js because of MDX. Therefore, the generated pages are HTML escaped (replacing < > & ' ").
But then, such escaped characters are not rendered as expected in code samples:
Screenshot from 2022-02-03 11-31-28
from

We have to admit, this is not easy if you don&#39;t speak jq fluently. What
about opening an issue and telling us what predefined selectors you want to
have? https://github.com/ory/kratos/issues/new/choose

​```
kratos identities delete &lt;id-0 [id-1 ...]&gt; [flags]
​```

In "standard" markdown there is no need to escape any non-trusted input, but in MDX there is. It would be way safer to say: "this is standard markdown form an untrusted source, don't try to run it as JS" instead of partially escaping stuff where I might miss some edge cases.

@Josh-Cena
Copy link
Collaborator

@zepatrik If you want to do post-processing, don't sanitize code in code blocks. Also, you can use a remark plugin to strip imports/exports very easily. Apart from import/exports, MDX can't execute arbitrary code.

@zepatrik
Copy link

zepatrik commented Feb 3, 2022

Apart from import/exports, MDX can't execute arbitrary code.

Can you elaborate on that? I can easily run arbitrary javascript on the MDX playground using e.g.

<div onClick={() => fetch("https://google.com/").then(console.log).catch(console.log)}>Click me!</div>

Of course with that, I could e.g. leak stuff from local storage to one of my servers or do all kinds of things.

@Josh-Cena Josh-Cena added the domain: markdown Related to Markdown parsing or syntax label Mar 29, 2022
@timothyerwin
Copy link

what is the status on this? does docusaurus 2 split .md files to another parser? we are getting build errors for md files that work perfectly fine in github.

@slorber
Copy link
Collaborator Author

slorber commented Aug 24, 2022

@timothyerwin all the updates are here, it's not necessary to ask.

Docusaurus is based on MDX, and you have to make sure your docs are compatible. This might require editing some of them, particularly HTML tags so that they conform with JSX.

@zhalice2011
Copy link

zhalice2011 commented May 29, 2023

I also have the same problem, the people who write the documents are not proficient in React. Then the official provided automatic migration script cannot convert markdown to mdx format very well.

Is there a way to specify that files with .mdx extension use docusaurus/mdx-loader, while files with .md extension use version 1.0 of the markdown renderer?

Looking forward to your reply.

@slorber
Copy link
Collaborator Author

slorber commented May 30, 2023

With the upcoming Docusaurus 3, we upgrade to MDX 2 (#8288), and there's a format: 'md' compiler config that permits us to support CommonMark.

Note: the content is parsed as CommonMark, and it's not possible to use JSX inside that content anymore, but you can start using raw html and inline styles like on GitHub (enabled by #8960), but under the hood, the content is still compiled as a React component. Features such as admonitions, code blocks etc keep working.

If you want early access to these features, use a canary version of Docusaurus and follow what's written in this PR to turn on CommonMark: #8288 (for now just having .md extension is enough, but I might change this for v3)

@ntucker
Copy link
Contributor

ntucker commented Jun 18, 2023

Can we have an option to disable commonmark? This is creating a lot of issues when I just want to use React 18.

@slorber
Copy link
Collaborator Author

slorber commented Jun 21, 2023

@ntucker I was going to add a global format: 'mdx' option (and probably make it the default in v3), now there's even more reason to do so ;)

Note: you can use format: 'mdx' frontmatter on each file as a temporary workaround

@ntucker
Copy link
Contributor

ntucker commented Jun 21, 2023

Altering every single file when the last edit time is used in the final site for publish time is not exciting to me. However, I'm very glad to hear about upcoming global control!

@slorber
Copy link
Collaborator Author

slorber commented Jun 22, 2023

Note: the new CommonMark mode will be probably marked as experimental in v3.0 and opt-in.

The basic rendering works fine, but it is currently missing some Docusaurus features.
Track #9092 to make sure the features you need are supported, and report missing unsupported features if you detect any.

@slorber
Copy link
Collaborator Author

slorber commented Jun 23, 2023

As part of #9097, Docusaurus v3 will keep using MDX to parse .md files by default, but allow you to opt-in for explicit usage of CommonMark (for your whole site, for .md files, or on a per-file basis)

Limitations: there are some features not working yet with CommonMark, see #9092

cc @ntucker

@slorber slorber modified the milestones: 3.x+, 3.0 Aug 17, 2023
@nickmccurdy
Copy link
Contributor

nickmccurdy commented Nov 15, 2023

If you're coming from the blog and want to opt into CommonMark, use markdown: { format: "detect" } in your global config or format: md in Markdown front matter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: markdown Related to Markdown parsing or syntax proposal This issue is a proposal, usually non-trivial change
Projects
None yet
Development

No branches or pull requests

10 participants