switch tokenizer #1812

romainmenke · 2023-01-10T11:17:13Z

Intro :

This change is not intended to be merged as it exists today.
It is importing the csstools tokenizer package but ideally the tokenizer is "owned" by PostCSS.

The goals is to surface possible issues early.

There are 3 test failures

2 are related to !important.
In the existing tokenizer !important is one word.
A declaration will have different contents in raws depending on the !important vs ! /* a comment */ important.

In the new tokenizer this is a delim and ident token.
Rewriting the algorithm to match the old output is possible, but it wasn't trivial.

1 is related to @.
The existing tokenizer has specific handling for a lone @ and throws.
In the new tokenizer doesn't throw any error as it is just a delim token which is valid CSS.

The existing tokenizer also has a mechanic to "return" tokens.
This is missing from the new tokenizer and requires slow and hacky code to patch back in.

These hacks would not be needed if the tokenizer was written for PostCSS within this repository.

ai · 2023-01-10T11:20:07Z

lib/parser.js

@@ -20,6 +21,42 @@ function findLastWithPosition(tokens) {
  }
 }

+function tokenIsWordLike(tokenType) {


You can use a Set or object instead to be a little faster.

I've tested multiple variations and their might be a small advantage but this is negligible overal.

ai · 2023-01-10T11:21:03Z

lib/parser.js


      switch (token[0]) {
-        case 'space':
+        case 'whitespace-token':


What do you think of using numbers like token[0] === Tokens.whitespace for a better performance and memory consumption?

Performance and memory consumption is the same with strings or numbers.
String literals are interned by V8 and other JS engines so these are pointer comparisons.

test/parse.test.ts

ai · 2023-01-10T11:23:31Z

lib/parser.js

+
+    let returned = [];
+
+    this.tokenizer.nextTokenWithBack = () => {


Feel free to change parser and avoid nextToken call.

We used it because our current tokenizer is stream less (it has no array of tokens and just pass next token to parser). Stream tokenizer is better for memory consumption but anyway will not work for our future plans to put tokens to AST.

Currently the tokenizer allocates a large array to store all the code point values for each character from the source string.

This allocation takes a lot of time.

I've tried a variation where the tokenizer is fully streaming and only keeps 4 code points as state.

This is a little bit slower because that approach results in more value changes but it keeps memory consumption a bit lower.

I did not see a notable difference between small and very large CSS sources with either approach.

I haven't yet tried the other way.
To completely remove the streaming aspect.

This would allocate a lot more memory, but it would reduce CPU cache misses.
At this time the parser is the driver. So we switch a lot between the functions and logic for the tokenizer and parser.

Doing everything separately would have benefits.
This is similar to "tiling" in image processing.

These kinds of optimizations are very difficult to do without benchmarking on a wide range of different hardware. CPU cache size, memory and IO latency greatly affect the outcome. One approach might be faster for me, but slower for most users overal.

…igent-buffalo-10eb7ca739

…tch-tokenizer--local-source--amiable-kingfisher-f91eb3baba

…--amiable-kingfisher-f91eb3baba port tokenizer to PostCSS repo

romainmenke · 2023-01-18T21:50:23Z

test/tokenize.test.js

+// TODO : this feature is harder to implement with the new tokenizer
+// test('ignore unclosed per token request', () => {
+//   function token(css, opts) {
+//     let processor = tokenizer(new Input(css), opts)
+//     let tokens = []
+//     while (!processor.endOfFile()) {
+//       tokens.push(processor.nextToken({ ignoreUnclosed: true }))
+//     }
+//     return tokens
+//   }
+
+//   let css = "How's it going ("
+//   let tokens = token(css, {})
+//   let expected = [
+//     ['word', 'How', 0, 2],
+//     ['string', "'s", 3, 4],
+//     ['space', ' '],
+//     ['word', 'it', 6, 7],
+//     ['space', ' '],
+//     ['word', 'going', 9, 13],
+//     ['space', ' '],
+//     ['(', '(', 15]
+//   ]
+
+//   equal(tokens, expected)
+// })


@ai Before I dive into this I wanted to know if this feature is currently in use?
Outside of this test I couldn't find anything that uses the per token ignoring of errors

It was needed for postcss-less #1191

romainmenke · 2023-01-18T21:53:47Z

Current status :

the tokenizer code has been added to this repository
some final tweaks to the parser were needed
highlighting of CSS syntax errors needed to be updated
test coverage is back up to 100 (line coverage)

At this time it would be good to have a review of the changes in the tokenizer tests.
The intention remains that this change to the tokenizer must not be noticeable to anyone using PostCSS. I want to make sure I haven't overlooked anything.

ai · 2023-01-19T00:07:10Z

test/tokenize.test.js

+    ['whitespace-token', ' ', 28, 28, undefined],
+    ['function-token', 'calc(', 29, 33, { value: 'calc' }],
+    [
+      'dimension-token',


Soo long. Do we really need -token prefix for all tokens? :)

No we do not :)

These names are used throughout the CSS specification.
Using them during initial development helps me implement things correctly.
Then I can just read the specification without having to mentally map the wording.

We should change this to what is best for performance and as an API surface later.

ai · 2023-01-19T00:08:38Z

Can you also test the performance by https://github.com/postcss/benchmark

It could take current and dev version of postcss (Just put postcss/ next to postcss-benchmark/).

romainmenke · 2023-01-19T07:05:26Z

benchmarks :

preprocessors

PostCSS:           26 ms
PostCSS sync:      26 ms (1.0 times slower)
Next PostCSS:      29 ms (1.1 times slower)
Next PostCSS sync: 29 ms (1.1 times slower)
LibSass sync:      40 ms (1.5 times slower)
Less:              40 ms (1.5 times slower)
LibSass:           41 ms (1.6 times slower)
Dart Sass sync:    53 ms (2.0 times slower)
Dart Sass:         97 ms (3.7 times slower)

parsers

Stylis:       3 ms  (3.2 times faster)
CSSOM:        10 ms (1.0 times faster)
PostCSS:      11 ms
CSSTree:      11 ms (1.0 times slower)
Mensch:       11 ms (1.1 times slower)
Next PostCSS: 13 ms (1.2 times slower)
Rework:       15 ms (1.4 times slower)
Stylecow:     24 ms (2.3 times slower)
PostCSS Full: 32 ms (3.0 times slower)
Gonzales:     51 ms (4.8 times slower)
ParserLib:    53 ms (5.0 times slower)

prefixers

Lightning CSS:    4 ms   (8.3 times faster)
Stylis:           5 ms   (6.3 times faster)
Autoprefixer:     33 ms
Autoprefixer dev: 36 ms  (1.1 times slower)
Stylecow:         183 ms (5.6 times slower)

switch tokenizer

1ebc451

ai reviewed Jan 10, 2023

View reviewed changes

test/parse.test.ts Outdated Show resolved Hide resolved

ai reviewed Jan 10, 2023

View reviewed changes

refactor important parsing and throw on empty at keyword

1a7db51

This comment was marked as outdated.

Sign in to view

romainmenke added 3 commits January 16, 2023 23:15

port tokenizer to PostCSS repo

a55dd55

fix syntax error highlights

e6b9d83

wip

201009f

This comment was marked as outdated.

Sign in to view

romainmenke and others added 7 commits January 17, 2023 22:54

increase test coverage

e477e4a

tokenizer - 100% coverage

ef6679b

wip

7392175

Merge remote-tracking branch 'origin/main' into switch-tokenizer--dil…

663b515

…igent-buffalo-10eb7ca739

Merge branch 'switch-tokenizer--diligent-buffalo-10eb7ca739' into swi…

58ee015

…tch-tokenizer--local-source--amiable-kingfisher-f91eb3baba

cleanup

f7aa541

Merge pull request #1 from romainmenke/switch-tokenizer--local-source…

47fefff

…--amiable-kingfisher-f91eb3baba port tokenizer to PostCSS repo

romainmenke commented Jan 18, 2023

View reviewed changes

ai reviewed Jan 19, 2023

View reviewed changes

fix

98387d8

romainmenke mentioned this pull request Jan 20, 2023

Native selector and value parsing in PostCSS #1145

Open

romainmenke closed this Jan 20, 2023

romainmenke deleted the switch-tokenizer--diligent-buffalo-10eb7ca739 branch January 20, 2023 09:42

romainmenke mentioned this pull request May 20, 2023

[parser] Implement full featured CSS parser stylelint/css-parser#2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

switch tokenizer #1812

switch tokenizer #1812

romainmenke commented Jan 10, 2023 •

edited

ai Jan 10, 2023

romainmenke Jan 20, 2023

ai Jan 10, 2023

romainmenke Jan 20, 2023

ai Jan 10, 2023

romainmenke Jan 20, 2023 •

edited

This comment was marked as outdated.

This comment was marked as outdated.

romainmenke Jan 18, 2023 •

edited

ai Jan 19, 2023

romainmenke commented Jan 18, 2023 •

edited

ai Jan 19, 2023

romainmenke Jan 19, 2023

ai commented Jan 19, 2023

romainmenke commented Jan 19, 2023


		let returned = [];

		this.tokenizer.nextTokenWithBack = () => {

switch tokenizer #1812

switch tokenizer #1812

Conversation

romainmenke commented Jan 10, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romainmenke Jan 20, 2023 • edited

Choose a reason for hiding this comment

This comment was marked as outdated.

This comment was marked as outdated.

romainmenke Jan 18, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romainmenke commented Jan 18, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ai commented Jan 19, 2023

romainmenke commented Jan 19, 2023

preprocessors

parsers

prefixers

romainmenke commented Jan 10, 2023 •

edited

romainmenke Jan 20, 2023 •

edited

romainmenke Jan 18, 2023 •

edited

romainmenke commented Jan 18, 2023 •

edited