Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The CSV file delimiter found with csv.Sniffer() is interpreted incorrectly #119123

Open
z-dava opened this issue May 17, 2024 · 6 comments
Open
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@z-dava
Copy link

z-dava commented May 17, 2024

Bug report

Bug description:

Python script:

import csv

with open("file.csv") as f:
    dialect = csv.Sniffer().sniff(f.read(), delimiters=",;")
    print(dialect.delimiter)

file.csv:

id;has_data;has_info;is_sort;a;b;c;d;cost;group;sub_group;merge
1;False;False;False;True;;;;62.25;['345'];['10'];UNKNOWN
2;False;False;False;True;;;;54.00;['235'];['256'];UNKNOWN
3;False;False;False;True;;;;237.00;['567'];['ALL'];UNKNOWN
4;False;False;False;True;;;;46.50;['112', '112'];['765', '766'];UNKNOWN
5;False;False;False;True;;;;237.00;['468'];['ALL'];UNKNOWN
6;False;False;False;True;;;;13.27;['510'];['23'];UNKNOWN

It outputs , instead ;

CPython versions tested on:

3.11

Operating systems tested on:

Windows

@z-dava z-dava added the type-bug An unexpected behavior, bug, or error label May 17, 2024
@aisk aisk added the stdlib Python modules in the Lib dir label May 18, 2024
@aisk
Copy link
Member

aisk commented May 18, 2024

At first glance, the line:

4;False;False;False;True;;;;46.50;['112', '112'];['765', '766'];UNKNOWN

can to match the pattern in

for restr in (r'(?P<delim>[^\w\n"\'])(?P<space> ?)(?P<quote>["\']).*?(?P=quote)(?P=delim)', # ,".*?",
, so the Sniffer thinks the delimiter is ,.

I think we can improve the Sniffer, but please note that we can't make it 100% accurate.

@rhettinger
Copy link
Contributor

I think we can improve the Sniffer, but please note that we can't make it 100% accurate.

Also, it is risky to change the Sniffer because tested and deployed code in the wild may rely on its current best guess.

@z-dava
Copy link
Author

z-dava commented May 21, 2024

This issue currently makes us stuck. We have opted to use clevercsv (https://pypi.org/project/clevercsv) until the bug is fixed

@kharvey2
Copy link

Hi! I was able to recreate the issue on Python 3.12.3 on a Mac, and it appears that the problem stems from the commas within the lists under the "group" and "subgroup" columns for the entry with id 4.

We can see why this happens, if we look closer at the Sniffer class,

class Sniffer:
    '''
    "Sniffs" the format of a CSV file (i.e. delimiter, quotechar)
    Returns a Dialect object.
    '''
    def __init__(self):
        # in case there is more than one possible delimiter
        self.preferred = [',', '\t', ';', ' ', ':']


    def sniff(self, sample, delimiters=None):
        """
        Returns a dialect (or None) corresponding to the sample
        """

        quotechar, doublequote, delimiter, skipinitialspace = \
                   self._guess_quote_and_delimiter(sample, delimiters)
        if not delimiter:
            delimiter, skipinitialspace = self._guess_delimiter(sample,
                                                                delimiters)

        if not delimiter:
            raise Error("Could not determine delimiter")
...

When you run the sniff method, it first calls
guess_quote_and_delimiter(...). If this doesn't find a delimiter, it then runs guess_delimiter(). If both attempts fail, it raises the error "Could not determine delimiter."

When we pass in both , and ; into the sniff method and it runs guess_quote_and_delimiter(...), it will return , as the delimiter. This is because guess_quote_and_delimiter(...) is described as follows,

    def _guess_quote_and_delimiter(self, data, delimiters):
        """
        Looks for text enclosed between two identical quotes
        (the probable quotechar) which are preceded and followed
        by the same character (the probable delimiter).
        For example:
                         ,'some text',
        The quote with the most wins, same with the delimiter.
        If there is no quotechar the delimiter can't be determined
        this way.
        """
...

In the provided CSV file, we can see that that the only instances where identical quotes are followed by the same character involve commas. This is evident in the row with id 4, where "group" is ['112', '112'] and "subgroup" is ['765', '766']. Consequently, the sniff method identifies the comma as the delimiter.

If you were to only pass ; as a potential delimiter to the sniff function (csv.Sniffer().sniff(csv_content, delimiters=";")), it would first use guess_quote_and_delimiter(...) to attempt to detect the delimiter. Since there are no instances where their are two identical quotes which are preceded and followed by a ;, it is unable to find a delimiter so it then runs guess_delimiter(). Here, it recognizes ; as the delimiter because it's the character repeated consistently across lines.

You can also see that by removing the quotation marks in the row with id 4 (changing "group" from ['112', '112'] to [112, 112] and "subgroup" from ['765', '766'] to [765, 766]) and keeping the sniff line as dialect = csv.Sniffer().sniff(csv_content, delimiters=",;"), it will detect ; as the delimiter.

@aterrel
Copy link

aterrel commented May 30, 2024

As @kharvey2 pointed out, the regular expression is responsible for choosing the wrong delimiter here.

The current parser has two methods for guessing what the delimiter. One based off statistics and one off some regular expressions. The fiest regilar expression will match the first delimiter that is next to a quote which breaks support for embedded lists.

A simple fix seems to add support for lists to the regular expressions. I can work on that

@aterrel
Copy link

aterrel commented Jun 5, 2024

Perhaps a good solution to this problem would be to just remove all the quotes and enclosing square brackets. Then you don't have the problematic regular expression:

def remove_enclosed_parts(s):
    # Define a regex pattern to match text within quotes and square brackets
    pattern = r'(["\']).*?\1|\[.*?\]'
    
    # Use a loop to repeatedly apply the regex until no more matches are found
    while re.search(pattern, s):
        s = re.sub(pattern, '', s)
    
    return s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

5 participants