- Sponsor
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
English words cannot be searched in Chinese word segmentation mode #5605
Comments
Sorry, I don't know why because all maintainers can't use Chinese. So any pull requests are welcome :-) |
I did a test, this problem will also occur in English sentences.
|
generate search index for Latin words correctly if search language is Chinese
@DaGeger I created a PR that removes the trailing white spaces for Latin search index terms. However, I do not understand, if/why these spaces should remain in Chinese. Can you confirm the search works as intended in Chinese now (check out my branch and install from source)? |
Great! I tested your code and it can solve the problem. Reply: why these spaces should remain in Chinese? By the way, I don't know, why search |
Thanks for testing. The problem with |
@DaGeger In my tests, the search index includes |
generate search index for Latin words correctly if search language is Chinese
I tried to find the problem and found new conclusions. I think the reason for the problem is still in this regular expression.
We can conduct tests.
Look, when the word is left without spaces, the keyword |
Okay; splitting Chinese from "Latin" words is then an additional issue that needs fixing. I talked this over with a Chinese colleague to make sure I understand the implications. Will try to fix it asap. |
And maintain list of latin terms that can be accessed to make decision if word should be stemmend
@TimKam , how about this pattern, are they expected results? >>> import re
>>> pattern = u'(?:(?:(?![\s.,])[\x00-\xFF])+|(?:(?![\x00-\xFF])\w)+)'
>>> s = '''\
... 可以查看 FAQ 模块中 Chinesetest 部分
...
... 模块中 CAS service部分
...
... 取而代之的是它们通过ZigBee'''
>>> re.findall(pattern, s)
['可以查看', 'FAQ', '模块中', 'Chinesetest', '部分', '模块中', 'CAS', 'service', '部分', '取而代之的是它们通过', 'ZigBee']
>>> re.findall(pattern, '模块中 CAS service部分')
['模块中', 'CAS', 'service', '部分']
>>> re.findall(pattern, '模块中CAS service部分')
['模块中', 'CAS', 'service', '部分']
>>> re.findall(pattern, 'This is a test string. a.b[].1.2')
['This', 'is', 'a', 'test', 'string', 'a', 'b[]', '1', '2'] For easy reading:
|
Thanks for the suggestion, @animalize. Some days ago, I updated the PR and adjusted the regexp to the following: |
Please assess this testcase: >>> import re
>>> pattern = u'[\u0000-\u00ff]\w+\w+[\u0000-\u00ff]'
>>> re.findall(pattern, '可以abc查看')
[]
>>> re.findall(pattern, 'This is a test string. a.b[].1.2')
['This ', ' test ', 'string.']
>>>
>>>
>>> pattern = u'(?:(?:(?![\s.,])[\x00-\xFF])+|(?:(?![\x00-\xFF])\w)+)'
>>> re.findall(pattern, '可以abc查看')
['可以', 'abc', '查看']
>>> re.findall(pattern, 'This is a test string. a.b[].1.2')
['This', 'is', 'a', 'test', 'string', 'a', 'b[]', '1', '2'] |
If only Latin string are need, we can use this pattern: >>> import re
>>> pattern = u'(?:(?![\s.,])[\x00-\xFF])+'
>>> s = '''\
... 可以查看 FAQ 模块中 Chinesetest 部分
...
... 模块中 CAS service部分
...
... 取而代之的是它们通过ZigBee'''
>>> re.findall(pattern, s)
['FAQ', 'Chinesetest', 'CAS', 'service', 'ZigBee']
>>> re.findall(pattern, '模块中 CAS service部分')
['CAS', 'service']
>>> re.findall(pattern, '模块中CAS service部分')
['CAS', 'service']
>>> re.findall(pattern, 'This is a test string. a.b[].1.2')
['This', 'is', 'a', 'test', 'string', 'a', 'b[]', '1', '2']
>>> re.findall(pattern, '可以abc查看')
['abc'] |
And maintain list of latin terms that can be accessed to make decision if word should be stemmend
I have a better idea, we can use this pattern: Please see these code in https://github.com/sphinx-doc/sphinx/blob/1.8/sphinx/search/__init__.py _word_re = re.compile(r'(?u)\w+')
...
def split(self, input):
# type: (unicode) -> List[unicode]
"""
This method splits a sentence into words. Default splitter splits input
at white spaces, which should be enough for most languages except CJK
languages.
"""
return self._word_re.findall(input) BTW, this description is wrong according to the code:
So we can simply use |
Thanks, seems to work. I will merge this when all tests pass! |
generate search index for Latin words correctly if search language is Chinese
Problem:
I can't search for English words in Chinese search mode.
Procedure to reproduce the problem
html_search_language = 'zh'
to use Chinese search, .可以查看 FAQ 模块中 test 部分
.test
.Error results
When searching keyword
test
, there will be no result display.Expected results
There will be test string in search results.
Sugestions
Why do space characters appear at the end of the word segmentation result?
I look up the source code and send the following regular expression, which will identify the space.
sphinx/search/zh.py
Now test as follows:
The result clearly contains the space character, here is the source of the problem.
Can we solve this problem by adjusting regular expressions?
Environment info
The text was updated successfully, but these errors were encountered: