Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major performance issue when parsing a long list of reference links #996

Closed
RomanHotsiy opened this issue Jan 26, 2024 · 10 comments · Fixed by #1004
Closed

Major performance issue when parsing a long list of reference links #996

RomanHotsiy opened this issue Jan 26, 2024 · 10 comments · Fixed by #1004

Comments

@RomanHotsiy
Copy link

We noticed a major performance problem when parsing a long list of references similar to this benchmark: https://github.com/markdown-it/markdown-it/blob/master/benchmark/samples/block-ref-list.md

In our case we have a list of 1000+ references.

The root cause seems to be this termination logic:

const terminatorRules = state.md.block.ruler.getRules('reference')
const oldParentType = state.parentType
state.parentType = 'reference'
for (; nextLine < endLine && !state.isEmpty(nextLine); nextLine++) {
// this would be a code block normally, but after paragraph
// it's considered a lazy continuation regardless of what's there
if (state.sCount[nextLine] - state.blkIndent > 3) { continue }
// quirk for blockquotes, this line should already be checked by that rule
if (state.sCount[nextLine] < 0) { continue }
// Some tags can terminate paragraph without empty line.
let terminate = false
for (let i = 0, l = terminatorRules.length; i < l; i++) {
if (terminatorRules[i](state, nextLine, endLine, true)) {
terminate = true
break
}
}
if (terminate) { break }
}

Removing this logic doesn't break any tests and improves speed of parsing our long list 30x 🙀

I tried to find some similar problems and found this thread: #54

I believe this table is incorrect but I'm not sure:

// First 2 params - rule name & source. Secondary array - list of rules,
// which can be terminated by this one.
[ 'table', require('./rules_block/table'), [ 'paragraph', 'reference' ] ],
[ 'code', require('./rules_block/code') ],
[ 'fence', require('./rules_block/fence'), [ 'paragraph', 'reference', 'blockquote', 'list' ] ],
[ 'blockquote', require('./rules_block/blockquote'), [ 'paragraph', 'reference', 'blockquote', 'list' ] ],
[ 'hr', require('./rules_block/hr'), [ 'paragraph', 'reference', 'blockquote', 'list' ] ],
[ 'list', require('./rules_block/list'), [ 'paragraph', 'reference', 'blockquote' ] ],
[ 'reference', require('./rules_block/reference') ],
[ 'html_block', require('./rules_block/html_block'), [ 'paragraph', 'reference', 'blockquote' ] ],
[ 'heading', require('./rules_block/heading'), [ 'paragraph', 'reference', 'blockquote' ] ],
[ 'lheading', require('./rules_block/lheading') ],
[ 'paragraph', require('./rules_block/paragraph') ]

From the CommonMark spec I can't see that reference can be terminated by other rules and it's the other way around actually - the reference can terminate some of the rules. Am I correct?

I tried modifying the code above to the variant below and all the tests are passing performance is still fast:

const _rules = [
  // First 2 params - rule name & source. Secondary array - list of rules,
  // which can be terminated by this one.
  ['table',      r_table,      ['paragraph']],
  ['code',       r_code],
  ['fence',      r_fence,      ['paragraph', 'blockquote', 'list']],
  ['blockquote', r_blockquote, ['paragraph', 'blockquote', 'list']],
  ['hr',         r_hr,         ['paragraph', 'blockquote', 'list']],
  ['list',       r_list,       ['paragraph', 'blockquote']],
  ['reference',  r_reference, ['table', 'fence', 'blockquote', 'hr', 'list', 'html_block', 'heading']],
  ['html_block', r_html_block, ['paragraph', 'blockquote']],
  ['heading',    r_heading,    ['paragraph', 'blockquote']],
  ['lheading',   r_lheading],
  ['paragraph',  r_paragraph]
]

Could someone check if my understanding is correct? I would be happy to open a PR.

@RomanHotsiy
Copy link
Author

Now I read a bit more code and I'm not sure I understand the spec correctly 🤔

// Reference can not terminate anything. This check is for safety only.

@RomanHotsiy
Copy link
Author

From the commonmark spec:

image

@RomanHotsiy
Copy link
Author

Hey @puzrin, @rlidwka, sorry for pinging you directly 🙌

Could you help me figure it out. I would be happy to open a PR if you just point me into correct direction.

Thanks in advance!

@rlidwka
Copy link
Member

rlidwka commented Jan 29, 2024

['reference', r_reference, ['table', 'fence', 'blockquote', 'hr', 'list', 'html_block', 'heading']],

It never makes sense to put hr or heading (or table iirc) in that list, because they don't call block parser recursively. Their end is determined by their own markup, and nothing can interrupt those tags.

In any case, if you have a solution that massively improves performance, and all tests still pass, it's worth adding PR.

@RomanHotsiy
Copy link
Author

Their end is determined by their own markup, and nothing can interrupt those tags.

@rlidwka but is it about interrupting those tags or interrupting BY those tags.

I'm confused by the comment in the code:

Secondary array - list of rules,which can be terminated by this one.

@rlidwka
Copy link
Member

rlidwka commented Jan 29, 2024

@rlidwka but is it about interrupting those tags or interrupting BY those tags.

[ 'table', require('./rules_block/table'), [ 'paragraph', 'reference' ] ],

This means that table interrupts a paragraph.

This also means that paragraph can be interrupted by a table.

@RomanHotsiy
Copy link
Author

Thanks!

Created PR here: #998

@rlidwka
Copy link
Member

rlidwka commented Feb 3, 2024

This is not just a performance issue, this is algorithmic complexity issue.

[ref1]: url 'title'
[ref2]: url
[ref3]: url
[ref4]: url
...
[ref10000]: url

Where does first reference end?

Keep in mind, that it could look like this, which is one big reference:

[ref1]: url '
[ref2]: url
[ref3]: url
[ref4]: url
...
[ref10000]: url
'

Currently, reference 1 is parsed from line 1-10000, reference 2 is parsed from line 2-10000, etc., and this entire block of text has to be extracted beforehand (to remove indents, leading > for quotes and such).

Do you see O(n^2) popping up here?

Commonmark implementation probably doesn't have it, because they can strip indents on the fly (I guess), but here it would mean a big rewrite (possibly can't even be done without losing modularity).

Would be interesting to know if you have any ideas regarding that.

@puzrin
Copy link
Member

puzrin commented Feb 4, 2024

Commonmark implementation probably doesn't have it, because they can strip indents on the fly (I guess), but here it would mean a big rewrite (possibly can't even be done without losing modularity).

Would be interesting to know if you have any ideas regarding that.

May be we could restrict max lines count for refs to reasonable number? As far as I remember, we have some hard limits in emphasis. All those hard limits can be exposed to options

@RomanHotsiy
Copy link
Author

This is not just a performance issue, this is algorithmic complexity issue.

@rlidwka yes, I already figured it out.

I started digging into the algorithm and I have some progress already but nothing that I can share yet.
I'll update my PR when I have something.

May be we could restrict max lines count for refs to reasonable number?

This is a great idea. If everyone is happy with that (and I can't come up with anything better) I will adjust my PR to use the limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants