Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

English words cannot be searched in Chinese word segmentation mode #5605

Closed
DaGeger opened this issue Nov 9, 2018 · 17 comments
Closed

English words cannot be searched in Chinese word segmentation mode #5605

DaGeger opened this issue Nov 9, 2018 · 17 comments

Comments

@DaGeger
Copy link

DaGeger commented Nov 9, 2018

Problem:

I can't search for English words in Chinese search mode.

image

Procedure to reproduce the problem

  1. When I set html_search_language = 'zh' to use Chinese search, .
  2. And add the following string 可以查看 FAQ 模块中 test 部分.
  3. search keyword test.

Error results

When searching keyword test, there will be no result display.

Expected results

There will be test string in search results.

Sugestions

Why do space characters appear at the end of the word segmentation result?

I look up the source code and send the following regular expression, which will identify the space.
sphinx/search/zh.py

Now test as follows:

In [3]: re.compile(u'(?u)\\w+[\u0000-\u00ff]').findall("可以查看 FAQ 模块中 test 部分")
Out[3]: ['可以查看 ', 'FAQ ', '模块中 ', 'test ']

The result clearly contains the space character, here is the source of the problem.
Can we solve this problem by adjusting regular expressions?

Environment info

  • OS: MacOS 10.14.1 with Python 2.7 in a sphinx specific virtualenv
  • Python version: Python 2.7.10
  • Sphinx version: 1.8.1
  • Browser Chrome
@DaGeger DaGeger changed the title English words for word segmentation in Chinese language will contain spaces English words cannot be searched in Chinese word segmentation mode Nov 9, 2018
@tk0miya
Copy link
Member

tk0miya commented Nov 10, 2018

Why do space characters appear at the end of the word segmentation result?

Sorry, I don't know why because all maintainers can't use Chinese. So any pull requests are welcome :-)

@DaGeger
Copy link
Author

DaGeger commented Nov 11, 2018

Why do space characters appear at the end of the word segmentation result?

Sorry, I don't know why because all maintainers can't use Chinese. So any pull requests are welcome :-)

I did a test, this problem will also occur in English sentences.

In [1]: import re
In [2]: re.compile(u'\\w+[\u0000-\u00ff]').findall('this is a test string')
Out[2]: ['this ', 'is ', 'a ', 'test ', 'string']

TimKam added a commit to TimKam/sphinx that referenced this issue Nov 11, 2018
generate search index for Latin words correctly if search language is Chinese
@TimKam
Copy link
Member

TimKam commented Nov 11, 2018

@DaGeger I created a PR that removes the trailing white spaces for Latin search index terms. However, I do not understand, if/why these spaces should remain in Chinese. Can you confirm the search works as intended in Chinese now (check out my branch and install from source)?

@DaGeger
Copy link
Author

DaGeger commented Nov 11, 2018

@DaGeger I created a PR that removes the trailing white spaces for Latin search index terms. However, I do not understand, if/why these spaces should remain in Chinese. Can you confirm the search works as intended in Chinese now (check out my branch and install from source)?

Great! I tested your code and it can solve the problem.
image

Reply: why these spaces should remain in Chinese?
In the process of write documents, there are often two English words connected in Chinese.
eg: 查找CAS service配置
In this case Spaces must be preserved.

By the way, I don't know, why search CAS will have no results?
image
Is it self made? How can it be searched?

@TimKam
Copy link
Member

TimKam commented Nov 11, 2018

Thanks for testing. The problem with CAS is that it is first stemmed and then considered as too short to be relevant (only two letters). We've fixed a similar issue in the English search a while ago. I can fix this here as well and will ask you for another test once the fix is ready.

TimKam added a commit to TimKam/sphinx that referenced this issue Nov 11, 2018
TimKam added a commit to TimKam/sphinx that referenced this issue Nov 11, 2018

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@DaGeger
Copy link
Author

DaGeger commented Nov 12, 2018

Thanks for testing. The problem with CAS is that it is first stemmed and then considered as too short to be relevant (only two letters). We've fixed a similar issue in the English search a while ago. I can fix this here as well and will ask you for another test once the fix is ready.

In the same way, I tested your commit, but there will be no result when searching for CAS.
image

@TimKam
Copy link
Member

TimKam commented Nov 12, 2018

@DaGeger In my tests, the search index includes cas. Can you run git pull --rebase and pip install -e . in the repo's root folder (make sure my branch is checked out) to ensure you have the latest changes? Also, try deleting the old output folder (typically: _build) manually before running the build. If this doesn't help, can you share the exact doc set you are testing with?

TimKam added a commit to TimKam/sphinx that referenced this issue Nov 12, 2018
TimKam added a commit to TimKam/sphinx that referenced this issue Nov 12, 2018
generate search index for Latin words correctly if search language is Chinese
TimKam added a commit to TimKam/sphinx that referenced this issue Nov 12, 2018
TimKam added a commit to TimKam/sphinx that referenced this issue Nov 12, 2018
TimKam added a commit to TimKam/sphinx that referenced this issue Nov 12, 2018
@DaGeger
Copy link
Author

DaGeger commented Nov 13, 2018

As you described before, I tested many times and found that there would be problems.
Use your code test:
this sentence can be searched

模块中 CAS service部分

image

But this is not acceptable

模块中CAS service部分

image

The difference is that there is a blank on the left side of the CAS word.

@DaGeger
Copy link
Author

DaGeger commented Nov 13, 2018

I tried to find the problem and found new conclusions. I think the reason for the problem is still in this regular expression.

re.compile(u'(?u)\\w+[\u0000-\u00ff]')

We can conduct tests.

In [1]: import re

In [2]: latin1_letters = re.compile(u'(?u)\\w+[\u0000-\u00ff]')

In [3]: latin1_letters.findall('模块中 CAS service部分')
Out[3]: ['模块中 ', 'CAS ', 'service']

In [4]: latin1_letters.findall('模块中CAS service部分')
Out[4]: ['模块中CAS ', 'service']

Look, when the word is left without spaces, the keyword cas will not be matched.
Does this belong to the expected situation?

@TimKam
Copy link
Member

TimKam commented Nov 15, 2018

Okay; splitting Chinese from "Latin" words is then an additional issue that needs fixing. I talked this over with a Chinese colleague to make sure I understand the implications. Will try to fix it asap.

TimKam added a commit to TimKam/sphinx that referenced this issue Dec 22, 2018
 And maintain list of latin terms that can be accessed to make decision if word
 should be stemmend
@ghost
Copy link

ghost commented Dec 25, 2018

@TimKam , how about this pattern, are they expected results?

>>> import re
>>> pattern = u'(?:(?:(?![\s.,])[\x00-\xFF])+|(?:(?![\x00-\xFF])\w)+)'
>>> s = '''\
... 可以查看 FAQ 模块中 Chinesetest 部分
...
... 模块中 CAS service部分
...
... 取而代之的是它们通过ZigBee'''
>>> re.findall(pattern, s)
['可以查看', 'FAQ', '模块中', 'Chinesetest', '部分', '模块中', 'CAS', 'service', '部分', '取而代之的是它们通过', 'ZigBee']
>>> re.findall(pattern, '模块中 CAS service部分')
['模块中', 'CAS', 'service', '部分']
>>> re.findall(pattern, '模块中CAS service部分')
['模块中', 'CAS', 'service', '部分']
>>> re.findall(pattern, 'This is a test string. a.b[].1.2')
['This', 'is', 'a', 'test', 'string', 'a', 'b[]', '1', '2']

For easy reading:

(?:
    (?:
        (?![\s.,])[\x00-\xFF]
    )+
    |
    (?:
        (?![\x00-\xFF])\w
    )+
)

@TimKam
Copy link
Member

TimKam commented Dec 25, 2018

Thanks for the suggestion, @animalize. Some days ago, I updated the PR and adjusted the regexp to the following: [\u0000-\u00ff]\w+\w+[\u0000-\u00ff]. This correctly extracts Latin "sub terms", as far as I can assess. Can you confirm this or find a test case for which your regexp is better? :-)

@ghost
Copy link

ghost commented Dec 25, 2018

Please assess this testcase:

>>> import re
>>> pattern = u'[\u0000-\u00ff]\w+\w+[\u0000-\u00ff]'
>>> re.findall(pattern, '可以abc查看')
[]
>>> re.findall(pattern, 'This is a test string. a.b[].1.2')
['This ', ' test ', 'string.']
>>> 
>>>
>>> pattern = u'(?:(?:(?![\s.,])[\x00-\xFF])+|(?:(?![\x00-\xFF])\w)+)'
>>> re.findall(pattern, '可以abc查看')
['可以', 'abc', '查看']
>>> re.findall(pattern, 'This is a test string. a.b[].1.2')
['This', 'is', 'a', 'test', 'string', 'a', 'b[]', '1', '2']

@ghost
Copy link

ghost commented Dec 25, 2018

If only Latin string are need, we can use this pattern:

>>> import re
>>> pattern = u'(?:(?![\s.,])[\x00-\xFF])+'
>>> s = '''\
... 可以查看 FAQ 模块中 Chinesetest 部分
...
... 模块中 CAS service部分
...
... 取而代之的是它们通过ZigBee'''
>>> re.findall(pattern, s)
['FAQ', 'Chinesetest', 'CAS', 'service', 'ZigBee']
>>> re.findall(pattern, '模块中 CAS service部分')
['CAS', 'service']
>>> re.findall(pattern, '模块中CAS service部分')
['CAS', 'service']
>>> re.findall(pattern, 'This is a test string. a.b[].1.2')
['This', 'is', 'a', 'test', 'string', 'a', 'b[]', '1', '2']
>>> re.findall(pattern, '可以abc查看')
['abc']

TimKam added a commit to TimKam/sphinx that referenced this issue Dec 25, 2018
TimKam added a commit to TimKam/sphinx that referenced this issue Dec 25, 2018
TimKam added a commit to TimKam/sphinx that referenced this issue Dec 25, 2018
TimKam added a commit to TimKam/sphinx that referenced this issue Dec 25, 2018
And maintain list of latin terms that can be accessed to make decision if word
should be stemmend
TimKam added a commit to TimKam/sphinx that referenced this issue Dec 25, 2018
@ghost
Copy link

ghost commented Dec 25, 2018

I have a better idea, we can use this pattern: r'[a-zA-Z0-9_]+'

Please see these code in https://github.com/sphinx-doc/sphinx/blob/1.8/sphinx/search/__init__.py

    _word_re = re.compile(r'(?u)\w+')
    ...
    def split(self, input):
        # type: (unicode) -> List[unicode]
        """
        This method splits a sentence into words.  Default splitter splits input
        at white spaces, which should be enough for most languages except CJK
        languages.
        """
    return self._word_re.findall(input)

BTW, this description is wrong according to the code:

Default splitter splits input at white spaces


\w means in Python3's doc:

\w

For Unicode (str) patterns:Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.

For 8-bit (bytes) patterns:Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_]. If the LOCALE flag is used, matches characters considered alphanumeric in the current locale and the underscore.

\w means in Python2's doc:

\w

When the LOCALE and UNICODE flags are not specified, matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus whatever characters are defined as alphanumeric for the current locale. If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.

So we can simply use r'[a-zA-Z0-9_]+' for Latin words.

TimKam added a commit to TimKam/sphinx that referenced this issue Dec 25, 2018
@TimKam
Copy link
Member

TimKam commented Dec 25, 2018

Thanks, seems to work. I will merge this when all tests pass!

TimKam added a commit to TimKam/sphinx that referenced this issue Dec 25, 2018
TimKam added a commit that referenced this issue Dec 25, 2018
generate search index for Latin words correctly if search language is Chinese
@TimKam TimKam closed this as completed Dec 25, 2018
@ghost
Copy link

ghost commented Dec 26, 2018

@TimKam @tk0miya
you can @ me when you need help on htmlhelp/Chinese related issues.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 9, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants