fix(dict): Remove only corrections if a space could be inserted as well #792

not-my-profile · 2023-08-07T18:34:03Z

The typo dictionary words.csv previously contained a bunch of problematic entries such as:

abouta,about
algorithmi,algorithm
attachen,attach
shouldbe,should

Which resulted in wrong corrections if the following spaces (indicated by ␣) were accidentally missed:

about␣a
algorithm␣i developed
attach␣en masse
should␣be

Many of these entries were introduced by taking entries from the codespell-dict and removing corrections containing spaces (since typos currently doesn't support them), e.g the codespell dictionary contains:

abouta->about a, about,
shouldbe->should, should be,

This commit updates tests/verify.rs to automatically remove entries in the form of {correction}{common_word},{correction}, where {common_word} is one of the 1000 most frequent English words.

The top-1000-most-frequent-words.csv file was generated by running:

curl https://norvig.com/ngrams/count_1w.txt \
  | head -n1024 \
  | awk '{print $1;}' \
  | grep -vE '^([^ia]|al|re)$' \
  > top-1000-most-frequent-words.csv

crates/typos-dict/assets/words.csv

epage · 2023-08-07T19:41:54Z

crates/typos-dict/assets/top-1000-most-frequent-words.csv

@@ -0,0 +1,1000 @@
+the


Let's leave this off for now because we'd need to workout cases like "extrememe"

I worked these cases out by adding the check:

if only_correction.ends_with(suffix) { // We still want to correct e.g. "extrememe" to "extreme". return true; }

I think I'd still prefer not be constrained by this very mechanical process. It can provide insight but I don't trust it to automatically be applied

The current process now seems to make exactly the changes we want to all our 63,200 entries, which does inspire some confidence in me. Besides the process is very easy to adapt so I think we can just do so when we figure out that it too eagerly filters out something.

The current process now seems to make exactly the changes we want to all our 63,200 entries

Except this isn't exactly what I want (see the other thread). Arbitrarily combining words that don't make sense when combined leads to us losing corrections we would have otherwise.

crates/typos-dict/assets/words.csv

not-my-profile · 2023-08-07T23:12:24Z

I have updated the PR to now also detect {common_word}{correction},{correction} and not just {correction}{common_word},{correction}.

epage · 2023-08-08T02:03:27Z

crates/typos-dict/assets/words.csv

-aand,and
-aanother,another
-aapply,apply
+aack


The redundant a's is a separate thing and we should be correcting these

The challenge with blindly checking concatenated words is it doesn't filter out for when they don't make sense.

In general, I'd feel better if we just looked at what changed due to the spaces and applied it to those. We could then separate decide which of these changes might make sense. As is, I'm seeing a lot that don't and don't want to take the time to decide that.

Oh the logic I had did actually already detect these with

if only_correction.starts_with(prefix) { return false; }

I just must have omitted resetting words.csv to run SNAPSHOTS=overwrite cargo test verify again.

So the many cases have just become:

aequidistant aequivalent afor amuch anumber ascripts asudo imakes isimilar itheir itheirs iwithout

which I think make sense to not correct automatically.

iwithout and many of these don't make sense from a "word combining" perspective. Things we can correct in master will become uncorrectable with this change.

The typo dictionary words.csv previously contained a bunch of problematic entries such as: abouta,about algorithmi,algorithm attachen,attach shouldbe,should anumber,number Which resulted in wrong automatic corrections if the following spaces (indicated by ␣) were accidentally missed: about␣a algorithm␣i developed attach␣en masse should␣be a␣number Many of these entries were introduced by taking entries from the codespell-dict and removing corrections containing spaces (since typos currently doesn't support them), e.g the codespell dictionary contains: abouta->about a, about, shouldbe->should, should be, This commit updates `tests/verify.rs` to automatically remove corrections in the form of `{correction}{common_word},{correction}` or `{common_word}{correction},{correction}`, where `{common_word}` is one of the 1000 most frequent English words (except if `{correction}` also ends/starts in `{common_word}`, since we still want to correct e.g. "extrememe" to "extreme"). The top-1000-most-frequent-words.csv file was generated by running: curl https://norvig.com/ngrams/count_1w.txt \ | head -n1024 \ | awk '{print $1;}' \ | grep -vE '^([^ia]|al|re)$' \ > top-1000-most-frequent-words.csv

not-my-profile force-pushed the remove-unsure-corrections branch from 3fd67dc to ec32cf5 Compare August 7, 2023 18:40

epage reviewed Aug 7, 2023

View reviewed changes

crates/typos-dict/assets/words.csv Outdated Show resolved Hide resolved

epage reviewed Aug 7, 2023

View reviewed changes

crates/typos-dict/assets/words.csv Show resolved Hide resolved

not-my-profile marked this pull request as draft August 7, 2023 20:39

not-my-profile force-pushed the remove-unsure-corrections branch from ec32cf5 to e240e50 Compare August 7, 2023 21:01

not-my-profile marked this pull request as ready for review August 7, 2023 21:03

not-my-profile force-pushed the remove-unsure-corrections branch from e240e50 to ea162f0 Compare August 7, 2023 22:54

not-my-profile changed the title ~~fix(dict): Remove unsure corrections~~ fix(dict): Remove only corrections if they could contain spaces Aug 7, 2023

not-my-profile force-pushed the remove-unsure-corrections branch from ea162f0 to b80e29d Compare August 7, 2023 22:58

not-my-profile changed the title ~~fix(dict): Remove only corrections if they could contain spaces~~ fix(dict): Remove only corrections that could contain spaces Aug 7, 2023

not-my-profile force-pushed the remove-unsure-corrections branch from b80e29d to 68cce1a Compare August 7, 2023 23:00

not-my-profile changed the title ~~fix(dict): Remove only corrections that could contain spaces~~ fix(dict): Remove only corrections if a space could be inserted as well Aug 7, 2023

not-my-profile requested a review from epage August 7, 2023 23:11

epage reviewed Aug 8, 2023

View reviewed changes

not-my-profile force-pushed the remove-unsure-corrections branch from 68cce1a to 60aad40 Compare August 8, 2023 04:34

not-my-profile marked this pull request as draft August 11, 2023 11:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(dict): Remove only corrections if a space could be inserted as well #792

fix(dict): Remove only corrections if a space could be inserted as well #792

not-my-profile commented Aug 7, 2023 •

edited

epage Aug 7, 2023

not-my-profile Aug 7, 2023

epage Aug 8, 2023

not-my-profile Aug 8, 2023

epage Aug 8, 2023

not-my-profile commented Aug 7, 2023

epage Aug 8, 2023

epage Aug 8, 2023

not-my-profile Aug 8, 2023

not-my-profile Aug 8, 2023

epage Aug 8, 2023

fix(dict): Remove only corrections if a space could be inserted as well #792

Are you sure you want to change the base?

fix(dict): Remove only corrections if a space could be inserted as well #792

Conversation

not-my-profile commented Aug 7, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

not-my-profile commented Aug 7, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

not-my-profile commented Aug 7, 2023 •

edited