-
-
Notifications
You must be signed in to change notification settings - Fork 29.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The CSV file delimiter found with csv.Sniffer() is interpreted incorrectly #119123
Comments
At first glance, the line:
can to match the pattern in Line 286 in 74072a3
Sniffer thinks the delimiter is , .
I think we can improve the |
Also, it is risky to change the Sniffer because tested and deployed code in the wild may rely on its current best guess. |
This issue currently makes us stuck. We have opted to use clevercsv (https://pypi.org/project/clevercsv) until the bug is fixed |
Hi! I was able to recreate the issue on Python 3.12.3 on a Mac, and it appears that the problem stems from the commas within the lists under the "group" and "subgroup" columns for the entry with id 4. We can see why this happens, if we look closer at the class Sniffer:
'''
"Sniffs" the format of a CSV file (i.e. delimiter, quotechar)
Returns a Dialect object.
'''
def __init__(self):
# in case there is more than one possible delimiter
self.preferred = [',', '\t', ';', ' ', ':']
def sniff(self, sample, delimiters=None):
"""
Returns a dialect (or None) corresponding to the sample
"""
quotechar, doublequote, delimiter, skipinitialspace = \
self._guess_quote_and_delimiter(sample, delimiters)
if not delimiter:
delimiter, skipinitialspace = self._guess_delimiter(sample,
delimiters)
if not delimiter:
raise Error("Could not determine delimiter")
... When you run the When we pass in both def _guess_quote_and_delimiter(self, data, delimiters):
"""
Looks for text enclosed between two identical quotes
(the probable quotechar) which are preceded and followed
by the same character (the probable delimiter).
For example:
,'some text',
The quote with the most wins, same with the delimiter.
If there is no quotechar the delimiter can't be determined
this way.
"""
... In the provided CSV file, we can see that that the only instances where identical quotes are followed by the same character involve commas. This is evident in the row with id 4, where "group" is If you were to only pass You can also see that by removing the quotation marks in the row with id 4 (changing "group" from |
As @kharvey2 pointed out, the regular expression is responsible for choosing the wrong delimiter here. The current parser has two methods for guessing what the delimiter. One based off statistics and one off some regular expressions. The fiest regilar expression will match the first delimiter that is next to a quote which breaks support for embedded lists. A simple fix seems to add support for lists to the regular expressions. I can work on that |
Perhaps a good solution to this problem would be to just remove all the quotes and enclosing square brackets. Then you don't have the problematic regular expression:
|
Bug report
Bug description:
Python script:
file.csv:
It outputs
,
instead;
CPython versions tested on:
3.11
Operating systems tested on:
Windows
The text was updated successfully, but these errors were encountered: