English words cannot be searched in Chinese word segmentation mode #5605

DaGeger · 2018-11-09T03:07:51Z

Problem:

I can't search for English words in Chinese search mode.

Procedure to reproduce the problem

When I set html_search_language = 'zh' to use Chinese search, .
And add the following string 可以查看 FAQ 模块中 test 部分.
search keyword test.

Error results

When searching keyword test, there will be no result display.

Expected results

There will be test string in search results.

Sugestions

Why do space characters appear at the end of the word segmentation result?

I look up the source code and send the following regular expression, which will identify the space.
sphinx/search/zh.py

Now test as follows：

In [3]: re.compile(u'(?u)\\w+[\u0000-\u00ff]').findall("可以查看 FAQ 模块中 test 部分")
Out[3]: ['可以查看 ', 'FAQ ', '模块中 ', 'test ']

The result clearly contains the space character, here is the source of the problem.
Can we solve this problem by adjusting regular expressions?

Environment info

OS: MacOS 10.14.1 with Python 2.7 in a sphinx specific virtualenv
Python version: Python 2.7.10
Sphinx version: 1.8.1
Browser Chrome

The text was updated successfully, but these errors were encountered:

tk0miya · 2018-11-10T13:42:49Z

Why do space characters appear at the end of the word segmentation result?

Sorry, I don't know why because all maintainers can't use Chinese. So any pull requests are welcome :-)

DaGeger · 2018-11-11T13:00:23Z

Why do space characters appear at the end of the word segmentation result?

Sorry, I don't know why because all maintainers can't use Chinese. So any pull requests are welcome :-)

I did a test, this problem will also occur in English sentences.

In [1]: import re
In [2]: re.compile(u'\\w+[\u0000-\u00ff]').findall('this is a test string')
Out[2]: ['this ', 'is ', 'a ', 'test ', 'string']

generate search index for Latin words correctly if search language is Chinese

TimKam · 2018-11-11T14:23:06Z

@DaGeger I created a PR that removes the trailing white spaces for Latin search index terms. However, I do not understand, if/why these spaces should remain in Chinese. Can you confirm the search works as intended in Chinese now (check out my branch and install from source)?

DaGeger · 2018-11-11T16:14:13Z

@DaGeger I created a PR that removes the trailing white spaces for Latin search index terms. However, I do not understand, if/why these spaces should remain in Chinese. Can you confirm the search works as intended in Chinese now (check out my branch and install from source)?

Great! I tested your code and it can solve the problem.

Reply: why these spaces should remain in Chinese?
In the process of write documents, there are often two English words connected in Chinese.
eg: 查找CAS service配置
In this case Spaces must be preserved.

By the way, I don't know, why search CAS will have no results?

Is it self made? How can it be searched?

TimKam · 2018-11-11T19:16:24Z

Thanks for testing. The problem with CAS is that it is first stemmed and then considered as too short to be relevant (only two letters). We've fixed a similar issue in the English search a while ago. I can fix this here as well and will ask you for another test once the fix is ready.

DaGeger · 2018-11-12T05:49:20Z

Thanks for testing. The problem with CAS is that it is first stemmed and then considered as too short to be relevant (only two letters). We've fixed a similar issue in the English search a while ago. I can fix this here as well and will ask you for another test once the fix is ready.

In the same way, I tested your commit, but there will be no result when searching for CAS.

TimKam · 2018-11-12T13:54:26Z

@DaGeger In my tests, the search index includes cas. Can you run git pull --rebase and pip install -e . in the repo's root folder (make sure my branch is checked out) to ensure you have the latest changes? Also, try deleting the old output folder (typically: _build) manually before running the build. If this doesn't help, can you share the exact doc set you are testing with?

generate search index for Latin words correctly if search language is Chinese

DaGeger · 2018-11-13T11:58:14Z

As you described before, I tested many times and found that there would be problems.
Use your code test:
this sentence can be searched

模块中 CAS service部分

But this is not acceptable

模块中CAS service部分

The difference is that there is a blank on the left side of the CAS word.

DaGeger · 2018-11-13T12:57:48Z

I tried to find the problem and found new conclusions. I think the reason for the problem is still in this regular expression.

re.compile(u'(?u)\\w+[\u0000-\u00ff]')

We can conduct tests.

In [1]: import re

In [2]: latin1_letters = re.compile(u'(?u)\\w+[\u0000-\u00ff]')

In [3]: latin1_letters.findall('模块中 CAS service部分')
Out[3]: ['模块中 ', 'CAS ', 'service']

In [4]: latin1_letters.findall('模块中CAS service部分')
Out[4]: ['模块中CAS ', 'service']

Look, when the word is left without spaces, the keyword cas will not be matched.
Does this belong to the expected situation?

TimKam · 2018-11-15T09:33:03Z

Okay; splitting Chinese from "Latin" words is then an additional issue that needs fixing. I talked this over with a Chinese colleague to make sure I understand the implications. Will try to fix it asap.

And maintain list of latin terms that can be accessed to make decision if word should be stemmend

ghost · 2018-12-25T11:08:13Z

@TimKam , how about this pattern, are they expected results?

>>> import re
>>> pattern = u'(?:(?:(?![\s.,])[\x00-\xFF])+|(?:(?![\x00-\xFF])\w)+)'
>>> s = '''\
... 可以查看 FAQ 模块中 Chinesetest 部分
...
... 模块中 CAS service部分
...
... 取而代之的是它们通过ZigBee'''
>>> re.findall(pattern, s)
['可以查看', 'FAQ', '模块中', 'Chinesetest', '部分', '模块中', 'CAS', 'service', '部分', '取而代之的是它们通过', 'ZigBee']
>>> re.findall(pattern, '模块中 CAS service部分')
['模块中', 'CAS', 'service', '部分']
>>> re.findall(pattern, '模块中CAS service部分')
['模块中', 'CAS', 'service', '部分']
>>> re.findall(pattern, 'This is a test string. a.b[].1.2')
['This', 'is', 'a', 'test', 'string', 'a', 'b[]', '1', '2']

For easy reading:

(?:
    (?:
        (?![\s.,])[\x00-\xFF]
    )+
    |
    (?:
        (?![\x00-\xFF])\w
    )+
)

TimKam · 2018-12-25T12:25:40Z

Thanks for the suggestion, @animalize. Some days ago, I updated the PR and adjusted the regexp to the following: [\u0000-\u00ff]\w+\w+[\u0000-\u00ff]. This correctly extracts Latin "sub terms", as far as I can assess. Can you confirm this or find a test case for which your regexp is better? :-)

ghost · 2018-12-25T12:41:12Z

Please assess this testcase:

>>> import re
>>> pattern = u'[\u0000-\u00ff]\w+\w+[\u0000-\u00ff]'
>>> re.findall(pattern, '可以abc查看')
[]
>>> re.findall(pattern, 'This is a test string. a.b[].1.2')
['This ', ' test ', 'string.']
>>> 
>>>
>>> pattern = u'(?:(?:(?![\s.,])[\x00-\xFF])+|(?:(?![\x00-\xFF])\w)+)'
>>> re.findall(pattern, '可以abc查看')
['可以', 'abc', '查看']
>>> re.findall(pattern, 'This is a test string. a.b[].1.2')
['This', 'is', 'a', 'test', 'string', 'a', 'b[]', '1', '2']

ghost · 2018-12-25T12:51:00Z

If only Latin string are need, we can use this pattern:

>>> import re
>>> pattern = u'(?:(?![\s.,])[\x00-\xFF])+'
>>> s = '''\
... 可以查看 FAQ 模块中 Chinesetest 部分
...
... 模块中 CAS service部分
...
... 取而代之的是它们通过ZigBee'''
>>> re.findall(pattern, s)
['FAQ', 'Chinesetest', 'CAS', 'service', 'ZigBee']
>>> re.findall(pattern, '模块中 CAS service部分')
['CAS', 'service']
>>> re.findall(pattern, '模块中CAS service部分')
['CAS', 'service']
>>> re.findall(pattern, 'This is a test string. a.b[].1.2')
['This', 'is', 'a', 'test', 'string', 'a', 'b[]', '1', '2']
>>> re.findall(pattern, '可以abc查看')
['abc']

And maintain list of latin terms that can be accessed to make decision if word should be stemmend

ghost · 2018-12-25T15:26:31Z

I have a better idea, we can use this pattern: r'[a-zA-Z0-9_]+'

Please see these code in https://github.com/sphinx-doc/sphinx/blob/1.8/sphinx/search/__init__.py

    _word_re = re.compile(r'(?u)\w+')
    ...
    def split(self, input):
        # type: (unicode) -> List[unicode]
        """
        This method splits a sentence into words.  Default splitter splits input
        at white spaces, which should be enough for most languages except CJK
        languages.
        """
    return self._word_re.findall(input)

BTW, this description is wrong according to the code:

Default splitter splits input at white spaces

\w means in Python3's doc:

\w

For Unicode (str) patterns:Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.

For 8-bit (bytes) patterns:Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_]. If the LOCALE flag is used, matches characters considered alphanumeric in the current locale and the underscore.

\w means in Python2's doc:

\w

When the LOCALE and UNICODE flags are not specified, matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus whatever characters are defined as alphanumeric for the current locale. If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.

So we can simply use r'[a-zA-Z0-9_]+' for Latin words.

TimKam · 2018-12-25T17:19:56Z

Thanks, seems to work. I will merge this when all tests pass!

generate search index for Latin words correctly if search language is Chinese

ghost · 2018-12-26T10:56:15Z

@TimKam @tk0miya
you can @ me when you need help on htmlhelp/Chinese related issues.

DaGeger changed the title ~~English words for word segmentation in Chinese language will contain spaces~~ English words cannot be searched in Chinese word segmentation mode Nov 9, 2018

tk0miya added help wanted html search labels Nov 9, 2018

TimKam added a commit to TimKam/sphinx that referenced this issue Nov 11, 2018

sphinx-doc#5605 fix Chinese search index

c9deb3f

generate search index for Latin words correctly if search language is Chinese

TimKam mentioned this issue Nov 11, 2018

WIP: #5605 fix Chinese search index #5611

Merged

TimKam added a commit to TimKam/sphinx that referenced this issue Nov 11, 2018

sphinx-doc#5605 fix issue with stemmer, zh

33df7b9

TimKam added a commit to TimKam/sphinx that referenced this issue Nov 11, 2018

sphinx-doc#5605 add type annotation

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

2d333ef

TimKam added a commit to TimKam/sphinx that referenced this issue Nov 12, 2018

sphinx-doc#5605 improve code style

a3a314a

TimKam added a commit to TimKam/sphinx that referenced this issue Nov 12, 2018

sphinx-doc#5605 fix Chinese search index

a3535a5

generate search index for Latin words correctly if search language is Chinese

TimKam added a commit to TimKam/sphinx that referenced this issue Nov 12, 2018

sphinx-doc#5605 fix issue with stemmer, zh

4083f67

TimKam added a commit to TimKam/sphinx that referenced this issue Nov 12, 2018

sphinx-doc#5605 add type annotation

68060d7

TimKam added a commit to TimKam/sphinx that referenced this issue Nov 12, 2018

sphinx-doc#5605 improve code style

a52948a

TimKam added a commit to TimKam/sphinx that referenced this issue Dec 22, 2018

sphinx-doc#5605 split chinese and Latin words correctly

8e5e157

And maintain list of latin terms that can be accessed to make decision if word should be stemmend

TimKam added a commit to TimKam/sphinx that referenced this issue Dec 25, 2018

sphinx-doc#5605 shorten regexp, add test assertion

1e0b06e

TimKam added a commit to TimKam/sphinx that referenced this issue Dec 25, 2018

sphinx-doc#5605 add change note

cfc44ff

TimKam added a commit to TimKam/sphinx that referenced this issue Dec 25, 2018

sphinx-doc#5605 add test assertion improve regexp

5a3f3f8

TimKam added a commit to TimKam/sphinx that referenced this issue Dec 25, 2018

sphinx-doc#5605 split chinese and Latin words correctly

f898bcb

And maintain list of latin terms that can be accessed to make decision if word should be stemmend

TimKam added a commit to TimKam/sphinx that referenced this issue Dec 25, 2018

sphinx-doc#5605 add change note

f17f386

TimKam added a commit to TimKam/sphinx that referenced this issue Dec 25, 2018

sphinx-doc#5605 improve regexp

1dca276

TimKam added a commit to TimKam/sphinx that referenced this issue Dec 25, 2018

sphinx-doc#5605 move change note to 1.8.4

67e1da2

TimKam added a commit that referenced this issue Dec 25, 2018

#5605 fix Chinese search index (#5611)

2216146

generate search index for Latin words correctly if search language is Chinese

TimKam closed this as completed Dec 25, 2018

TimKam mentioned this issue Mar 7, 2021

search words must be sparated? #5318

Open

github-actions bot locked as resolved and limited conversation to collaborators Aug 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

English words cannot be searched in Chinese word segmentation mode #5605

English words cannot be searched in Chinese word segmentation mode #5605

DaGeger commented Nov 9, 2018 •

edited

Loading

tk0miya commented Nov 10, 2018

DaGeger commented Nov 11, 2018

TimKam commented Nov 11, 2018

DaGeger commented Nov 11, 2018

TimKam commented Nov 11, 2018

DaGeger commented Nov 12, 2018

TimKam commented Nov 12, 2018 •

edited

Loading

DaGeger commented Nov 13, 2018

DaGeger commented Nov 13, 2018 •

edited

Loading

TimKam commented Nov 15, 2018

ghost commented Dec 25, 2018 •

edited by ghost

Loading

TimKam commented Dec 25, 2018

ghost commented Dec 25, 2018 •

edited by ghost

Loading

ghost commented Dec 25, 2018 •

edited by ghost

Loading

ghost commented Dec 25, 2018 •

edited by ghost

Loading

TimKam commented Dec 25, 2018

ghost commented Dec 26, 2018

English words cannot be searched in Chinese word segmentation mode #5605

English words cannot be searched in Chinese word segmentation mode #5605

Comments

DaGeger commented Nov 9, 2018 • edited Loading

Problem:

Procedure to reproduce the problem

Error results

Expected results

Sugestions

Environment info

tk0miya commented Nov 10, 2018

DaGeger commented Nov 11, 2018

TimKam commented Nov 11, 2018

DaGeger commented Nov 11, 2018

TimKam commented Nov 11, 2018

DaGeger commented Nov 12, 2018

TimKam commented Nov 12, 2018 • edited Loading

DaGeger commented Nov 13, 2018

DaGeger commented Nov 13, 2018 • edited Loading

We can conduct tests.

TimKam commented Nov 15, 2018

ghost commented Dec 25, 2018 • edited by ghost Loading

TimKam commented Dec 25, 2018

ghost commented Dec 25, 2018 • edited by ghost Loading

ghost commented Dec 25, 2018 • edited by ghost Loading

ghost commented Dec 25, 2018 • edited by ghost Loading

TimKam commented Dec 25, 2018

ghost commented Dec 26, 2018

DaGeger commented Nov 9, 2018 •

edited

Loading

TimKam commented Nov 12, 2018 •

edited

Loading

DaGeger commented Nov 13, 2018 •

edited

Loading

ghost commented Dec 25, 2018 •

edited by ghost

Loading

ghost commented Dec 25, 2018 •

edited by ghost

Loading

ghost commented Dec 25, 2018 •

edited by ghost

Loading

ghost commented Dec 25, 2018 •

edited by ghost

Loading