Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

switch tokenizer #1812

Conversation

romainmenke
Copy link
Contributor

@romainmenke romainmenke commented Jan 10, 2023

see : #1145

Intro :

This change is not intended to be merged as it exists today.
It is importing the csstools tokenizer package but ideally the tokenizer is "owned" by PostCSS.

The goals is to surface possible issues early.


There are 3 test failures

2 are related to !important.
In the existing tokenizer !important is one word.
A declaration will have different contents in raws depending on the !important vs ! /* a comment */ important.

In the new tokenizer this is a delim and ident token.
Rewriting the algorithm to match the old output is possible, but it wasn't trivial.

1 is related to @.
The existing tokenizer has specific handling for a lone @ and throws.
In the new tokenizer doesn't throw any error as it is just a delim token which is valid CSS.


The existing tokenizer also has a mechanic to "return" tokens.
This is missing from the new tokenizer and requires slow and hacky code to patch back in.

These hacks would not be needed if the tokenizer was written for PostCSS within this repository.

@@ -20,6 +21,42 @@ function findLastWithPosition(tokens) {
}
}

function tokenIsWordLike(tokenType) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use a Set or object instead to be a little faster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested multiple variations and their might be a small advantage but this is negligible overal.


switch (token[0]) {
case 'space':
case 'whitespace-token':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of using numbers like token[0] === Tokens.whitespace for a better performance and memory consumption?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance and memory consumption is the same with strings or numbers.
String literals are interned by V8 and other JS engines so these are pointer comparisons.

test/parse.test.ts Outdated Show resolved Hide resolved
lib/parser.js Outdated

let returned = [];

this.tokenizer.nextTokenWithBack = () => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to change parser and avoid nextToken call.

We used it because our current tokenizer is stream less (it has no array of tokens and just pass next token to parser). Stream tokenizer is better for memory consumption but anyway will not work for our future plans to put tokens to AST.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the tokenizer allocates a large array to store all the code point values for each character from the source string.

This allocation takes a lot of time.

I've tried a variation where the tokenizer is fully streaming and only keeps 4 code points as state.

This is a little bit slower because that approach results in more value changes but it keeps memory consumption a bit lower.

I did not see a notable difference between small and very large CSS sources with either approach.


I haven't yet tried the other way.
To completely remove the streaming aspect.

This would allocate a lot more memory, but it would reduce CPU cache misses.
At this time the parser is the driver. So we switch a lot between the functions and logic for the tokenizer and parser.

Doing everything separately would have benefits.
This is similar to "tiling" in image processing.

These kinds of optimizations are very difficult to do without benchmarking on a wide range of different hardware. CPU cache size, memory and IO latency greatly affect the outcome. One approach might be faster for me, but slower for most users overal.

@romainmenke

This comment was marked as outdated.

@romainmenke

This comment was marked as outdated.

Comment on lines +388 to +413
// TODO : this feature is harder to implement with the new tokenizer
// test('ignore unclosed per token request', () => {
// function token(css, opts) {
// let processor = tokenizer(new Input(css), opts)
// let tokens = []
// while (!processor.endOfFile()) {
// tokens.push(processor.nextToken({ ignoreUnclosed: true }))
// }
// return tokens
// }

// let css = "How's it going ("
// let tokens = token(css, {})
// let expected = [
// ['word', 'How', 0, 2],
// ['string', "'s", 3, 4],
// ['space', ' '],
// ['word', 'it', 6, 7],
// ['space', ' '],
// ['word', 'going', 9, 13],
// ['space', ' '],
// ['(', '(', 15]
// ]

// equal(tokens, expected)
// })
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ai Before I dive into this I wanted to know if this feature is currently in use?
Outside of this test I couldn't find anything that uses the per token ignoring of errors

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was needed for postcss-less #1191

@romainmenke
Copy link
Contributor Author

romainmenke commented Jan 18, 2023

Current status :

  • the tokenizer code has been added to this repository
  • some final tweaks to the parser were needed
  • highlighting of CSS syntax errors needed to be updated
  • test coverage is back up to 100 (line coverage)

At this time it would be good to have a review of the changes in the tokenizer tests.
The intention remains that this change to the tokenizer must not be noticeable to anyone using PostCSS. I want to make sure I haven't overlooked anything.

['whitespace-token', ' ', 28, 28, undefined],
['function-token', 'calc(', 29, 33, { value: 'calc' }],
[
'dimension-token',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Soo long. Do we really need -token prefix for all tokens? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we do not :)

These names are used throughout the CSS specification.
Using them during initial development helps me implement things correctly.
Then I can just read the specification without having to mentally map the wording.

We should change this to what is best for performance and as an API surface later.

@ai
Copy link
Member

ai commented Jan 19, 2023

Can you also test the performance by https://github.com/postcss/benchmark

It could take current and dev version of postcss (Just put postcss/ next to postcss-benchmark/).

@romainmenke
Copy link
Contributor Author

benchmarks :

preprocessors

PostCSS:           26 ms
PostCSS sync:      26 ms (1.0 times slower)
Next PostCSS:      29 ms (1.1 times slower)
Next PostCSS sync: 29 ms (1.1 times slower)
LibSass sync:      40 ms (1.5 times slower)
Less:              40 ms (1.5 times slower)
LibSass:           41 ms (1.6 times slower)
Dart Sass sync:    53 ms (2.0 times slower)
Dart Sass:         97 ms (3.7 times slower)

parsers

Stylis:       3 ms  (3.2 times faster)
CSSOM:        10 ms (1.0 times faster)
PostCSS:      11 ms
CSSTree:      11 ms (1.0 times slower)
Mensch:       11 ms (1.1 times slower)
Next PostCSS: 13 ms (1.2 times slower)
Rework:       15 ms (1.4 times slower)
Stylecow:     24 ms (2.3 times slower)
PostCSS Full: 32 ms (3.0 times slower)
Gonzales:     51 ms (4.8 times slower)
ParserLib:    53 ms (5.0 times slower)

prefixers

Lightning CSS:    4 ms   (8.3 times faster)
Stylis:           5 ms   (6.3 times faster)
Autoprefixer:     33 ms
Autoprefixer dev: 36 ms  (1.1 times slower)
Stylecow:         183 ms (5.6 times slower)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants