-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
switch tokenizer #1812
switch tokenizer #1812
Conversation
@@ -20,6 +21,42 @@ function findLastWithPosition(tokens) { | |||
} | |||
} | |||
|
|||
function tokenIsWordLike(tokenType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use a Set
or object instead to be a little faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested multiple variations and their might be a small advantage but this is negligible overal.
|
||
switch (token[0]) { | ||
case 'space': | ||
case 'whitespace-token': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of using numbers like token[0] === Tokens.whitespace
for a better performance and memory consumption?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance and memory consumption is the same with strings or numbers.
String literals are interned by V8 and other JS engines so these are pointer comparisons.
lib/parser.js
Outdated
|
||
let returned = []; | ||
|
||
this.tokenizer.nextTokenWithBack = () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to change parser and avoid nextToken
call.
We used it because our current tokenizer is stream less (it has no array of tokens and just pass next token to parser). Stream tokenizer is better for memory consumption but anyway will not work for our future plans to put tokens to AST.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the tokenizer allocates a large array to store all the code point values for each character from the source string.
This allocation takes a lot of time.
I've tried a variation where the tokenizer is fully streaming and only keeps 4 code points as state.
This is a little bit slower because that approach results in more value changes but it keeps memory consumption a bit lower.
I did not see a notable difference between small and very large CSS sources with either approach.
I haven't yet tried the other way.
To completely remove the streaming aspect.
This would allocate a lot more memory, but it would reduce CPU cache misses.
At this time the parser is the driver. So we switch a lot between the functions and logic for the tokenizer and parser.
Doing everything separately would have benefits.
This is similar to "tiling" in image processing.
These kinds of optimizations are very difficult to do without benchmarking on a wide range of different hardware. CPU cache size, memory and IO latency greatly affect the outcome. One approach might be faster for me, but slower for most users overal.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
…igent-buffalo-10eb7ca739
…tch-tokenizer--local-source--amiable-kingfisher-f91eb3baba
…--amiable-kingfisher-f91eb3baba port tokenizer to PostCSS repo
// TODO : this feature is harder to implement with the new tokenizer | ||
// test('ignore unclosed per token request', () => { | ||
// function token(css, opts) { | ||
// let processor = tokenizer(new Input(css), opts) | ||
// let tokens = [] | ||
// while (!processor.endOfFile()) { | ||
// tokens.push(processor.nextToken({ ignoreUnclosed: true })) | ||
// } | ||
// return tokens | ||
// } | ||
|
||
// let css = "How's it going (" | ||
// let tokens = token(css, {}) | ||
// let expected = [ | ||
// ['word', 'How', 0, 2], | ||
// ['string', "'s", 3, 4], | ||
// ['space', ' '], | ||
// ['word', 'it', 6, 7], | ||
// ['space', ' '], | ||
// ['word', 'going', 9, 13], | ||
// ['space', ' '], | ||
// ['(', '(', 15] | ||
// ] | ||
|
||
// equal(tokens, expected) | ||
// }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ai Before I dive into this I wanted to know if this feature is currently in use?
Outside of this test I couldn't find anything that uses the per token ignoring of errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was needed for postcss-less
#1191
Current status :
At this time it would be good to have a review of the changes in the tokenizer tests. |
['whitespace-token', ' ', 28, 28, undefined], | ||
['function-token', 'calc(', 29, 33, { value: 'calc' }], | ||
[ | ||
'dimension-token', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Soo long. Do we really need -token
prefix for all tokens? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No we do not :)
These names are used throughout the CSS specification.
Using them during initial development helps me implement things correctly.
Then I can just read the specification without having to mentally map the wording.
We should change this to what is best for performance and as an API surface later.
Can you also test the performance by https://github.com/postcss/benchmark It could take current and dev version of |
benchmarks : preprocessors
parsers
prefixers
|
see : #1145
Intro :
This change is not intended to be merged as it exists today.
It is importing the csstools tokenizer package but ideally the tokenizer is "owned" by PostCSS.
The goals is to surface possible issues early.
There are 3 test failures
2 are related to
!important
.In the existing tokenizer
!important
is one word.A declaration will have different contents in
raws
depending on the!important
vs! /* a comment */ important
.In the new tokenizer this is a delim and ident token.
Rewriting the algorithm to match the old output is possible, but it wasn't trivial.
1 is related to
@
.The existing tokenizer has specific handling for a lone
@
and throws.In the new tokenizer doesn't throw any error as it is just a delim token which is valid CSS.
The existing tokenizer also has a mechanic to "return" tokens.
This is missing from the new tokenizer and requires slow and hacky code to patch back in.
These hacks would not be needed if the tokenizer was written for PostCSS within this repository.