Major performance issue when parsing a long list of reference links #996

RomanHotsiy · 2024-01-26T09:18:24Z

We noticed a major performance problem when parsing a long list of references similar to this benchmark: https://github.com/markdown-it/markdown-it/blob/master/benchmark/samples/block-ref-list.md

In our case we have a list of 1000+ references.

The root cause seems to be this termination logic:

markdown-it/lib/rules_block/reference.mjs

Lines 29 to 51 in d07d585

    
           const terminatorRules = state.md.block.ruler.getRules('reference') 
        
           const oldParentType = state.parentType 
        
           state.parentType = 'reference' 
        
           for (; nextLine < endLine && !state.isEmpty(nextLine); nextLine++) { 
        
             // this would be a code block normally, but after paragraph 
        
             // it's considered a lazy continuation regardless of what's there 
        
             if (state.sCount[nextLine] - state.blkIndent > 3) { continue } 
        
             // quirk for blockquotes, this line should already be checked by that rule 
        
             if (state.sCount[nextLine] < 0) { continue } 
        
             // Some tags can terminate paragraph without empty line. 
        
             let terminate = false 
        
             for (let i = 0, l = terminatorRules.length; i < l; i++) { 
        
               if (terminatorRules[i](state, nextLine, endLine, true)) { 
        
                 terminate = true 
        
                 break 
        
               } 
        
             } 
        
             if (terminate) { break } 
        
           }

Removing this logic doesn't break any tests and improves speed of parsing our long list 30x 🙀

I tried to find some similar problems and found this thread: #54

I believe this table is incorrect but I'm not sure:

markdown-it/lib/parser_block.js

Lines 13 to 25 in d72c68b

    
           // First 2 params - rule name & source. Secondary array - list of rules, 
        
           // which can be terminated by this one. 
        
           [ 'table',      require('./rules_block/table'),      [ 'paragraph', 'reference' ] ], 
        
           [ 'code',       require('./rules_block/code') ], 
        
           [ 'fence',      require('./rules_block/fence'),      [ 'paragraph', 'reference', 'blockquote', 'list' ] ], 
        
           [ 'blockquote', require('./rules_block/blockquote'), [ 'paragraph', 'reference', 'blockquote', 'list' ] ], 
        
           [ 'hr',         require('./rules_block/hr'),         [ 'paragraph', 'reference', 'blockquote', 'list' ] ], 
        
           [ 'list',       require('./rules_block/list'),       [ 'paragraph', 'reference', 'blockquote' ] ], 
        
           [ 'reference',  require('./rules_block/reference') ], 
        
           [ 'html_block', require('./rules_block/html_block'), [ 'paragraph', 'reference', 'blockquote' ] ], 
        
           [ 'heading',    require('./rules_block/heading'),    [ 'paragraph', 'reference', 'blockquote' ] ], 
        
           [ 'lheading',   require('./rules_block/lheading') ], 
        
           [ 'paragraph',  require('./rules_block/paragraph') ]

From the CommonMark spec I can't see that reference can be terminated by other rules and it's the other way around actually - the reference can terminate some of the rules. Am I correct?

I tried modifying the code above to the variant below and all the tests are passing performance is still fast:

const _rules = [
  // First 2 params - rule name & source. Secondary array - list of rules,
  // which can be terminated by this one.
  ['table',      r_table,      ['paragraph']],
  ['code',       r_code],
  ['fence',      r_fence,      ['paragraph', 'blockquote', 'list']],
  ['blockquote', r_blockquote, ['paragraph', 'blockquote', 'list']],
  ['hr',         r_hr,         ['paragraph', 'blockquote', 'list']],
  ['list',       r_list,       ['paragraph', 'blockquote']],
  ['reference',  r_reference, ['table', 'fence', 'blockquote', 'hr', 'list', 'html_block', 'heading']],
  ['html_block', r_html_block, ['paragraph', 'blockquote']],
  ['heading',    r_heading,    ['paragraph', 'blockquote']],
  ['lheading',   r_lheading],
  ['paragraph',  r_paragraph]
]

Could someone check if my understanding is correct? I would be happy to open a PR.

RomanHotsiy · 2024-01-26T10:12:40Z

Now I read a bit more code and I'm not sure I understand the spec correctly 🤔

markdown-it/lib/rules_block/reference.mjs

Line 166 in d07d585

// Reference can not terminate anything. This check is for safety only.

RomanHotsiy · 2024-01-26T13:34:37Z

From the commonmark spec:

RomanHotsiy · 2024-01-29T09:30:53Z

Hey @puzrin, @rlidwka, sorry for pinging you directly 🙌

Could you help me figure it out. I would be happy to open a PR if you just point me into correct direction.

Thanks in advance!

rlidwka · 2024-01-29T11:29:41Z

['reference', r_reference, ['table', 'fence', 'blockquote', 'hr', 'list', 'html_block', 'heading']],

It never makes sense to put hr or heading (or table iirc) in that list, because they don't call block parser recursively. Their end is determined by their own markup, and nothing can interrupt those tags.

In any case, if you have a solution that massively improves performance, and all tests still pass, it's worth adding PR.

RomanHotsiy · 2024-01-29T12:09:34Z

Their end is determined by their own markup, and nothing can interrupt those tags.

@rlidwka but is it about interrupting those tags or interrupting BY those tags.

I'm confused by the comment in the code:

Secondary array - list of rules,which can be terminated by this one.

rlidwka · 2024-01-29T12:19:15Z

@rlidwka but is it about interrupting those tags or interrupting BY those tags.

[ 'table', require('./rules_block/table'), [ 'paragraph', 'reference' ] ],

This means that table interrupts a paragraph.

This also means that paragraph can be interrupted by a table.

RomanHotsiy · 2024-01-29T12:33:30Z

Thanks!

Created PR here: #998

rlidwka · 2024-02-03T23:07:33Z

This is not just a performance issue, this is algorithmic complexity issue.

[ref1]: url 'title'
[ref2]: url
[ref3]: url
[ref4]: url
...
[ref10000]: url

Where does first reference end?

Keep in mind, that it could look like this, which is one big reference:

[ref1]: url '
[ref2]: url
[ref3]: url
[ref4]: url
...
[ref10000]: url
'

Currently, reference 1 is parsed from line 1-10000, reference 2 is parsed from line 2-10000, etc., and this entire block of text has to be extracted beforehand (to remove indents, leading > for quotes and such).

Do you see O(n^2) popping up here?

Commonmark implementation probably doesn't have it, because they can strip indents on the fly (I guess), but here it would mean a big rewrite (possibly can't even be done without losing modularity).

Would be interesting to know if you have any ideas regarding that.

puzrin · 2024-02-04T02:25:21Z

Commonmark implementation probably doesn't have it, because they can strip indents on the fly (I guess), but here it would mean a big rewrite (possibly can't even be done without losing modularity).

Would be interesting to know if you have any ideas regarding that.

May be we could restrict max lines count for refs to reasonable number? As far as I remember, we have some hard limits in emphasis. All those hard limits can be exposed to options

RomanHotsiy · 2024-02-04T02:30:57Z

This is not just a performance issue, this is algorithmic complexity issue.

@rlidwka yes, I already figured it out.

I started digging into the algorithm and I have some progress already but nothing that I can share yet.
I'll update my PR when I have something.

May be we could restrict max lines count for refs to reasonable number?

This is a great idea. If everyone is happy with that (and I can't come up with anything better) I will adjust my PR to use the limit.

RomanHotsiy mentioned this issue Jan 29, 2024

Improve performance of reference definition list parsing #998

Closed

rlidwka mentioned this issue Feb 4, 2024

fix quadratic complexity in reference parser #1004

Merged

rlidwka closed this as completed in #1004 Mar 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major performance issue when parsing a long list of reference links #996

Major performance issue when parsing a long list of reference links #996

RomanHotsiy commented Jan 26, 2024

RomanHotsiy commented Jan 26, 2024

RomanHotsiy commented Jan 26, 2024

RomanHotsiy commented Jan 29, 2024

rlidwka commented Jan 29, 2024

RomanHotsiy commented Jan 29, 2024

rlidwka commented Jan 29, 2024

RomanHotsiy commented Jan 29, 2024

rlidwka commented Feb 3, 2024

puzrin commented Feb 4, 2024

RomanHotsiy commented Feb 4, 2024

Major performance issue when parsing a long list of reference links #996

Major performance issue when parsing a long list of reference links #996

Comments

RomanHotsiy commented Jan 26, 2024

RomanHotsiy commented Jan 26, 2024

RomanHotsiy commented Jan 26, 2024

RomanHotsiy commented Jan 29, 2024

rlidwka commented Jan 29, 2024

RomanHotsiy commented Jan 29, 2024

rlidwka commented Jan 29, 2024

RomanHotsiy commented Jan 29, 2024

rlidwka commented Feb 3, 2024

puzrin commented Feb 4, 2024

RomanHotsiy commented Feb 4, 2024