The CSV file delimiter found with csv.Sniffer() is interpreted incorrectly #119123

z-dava · 2024-05-17T14:33:04Z

Bug report

Bug description:

Python script:

import csv

with open("file.csv") as f:
    dialect = csv.Sniffer().sniff(f.read(), delimiters=",;")
    print(dialect.delimiter)

file.csv:

id;has_data;has_info;is_sort;a;b;c;d;cost;group;sub_group;merge
1;False;False;False;True;;;;62.25;['345'];['10'];UNKNOWN
2;False;False;False;True;;;;54.00;['235'];['256'];UNKNOWN
3;False;False;False;True;;;;237.00;['567'];['ALL'];UNKNOWN
4;False;False;False;True;;;;46.50;['112', '112'];['765', '766'];UNKNOWN
5;False;False;False;True;;;;237.00;['468'];['ALL'];UNKNOWN
6;False;False;False;True;;;;13.27;['510'];['23'];UNKNOWN

It outputs , instead ;

CPython versions tested on:

3.11

Operating systems tested on:

Windows

The text was updated successfully, but these errors were encountered:

aisk · 2024-05-18T14:44:25Z

At first glance, the line:

4;False;False;False;True;;;;46.50;['112', '112'];['765', '766'];UNKNOWN

can to match the pattern in

cpython/Lib/csv.py

Line 286 in 74072a3

    
           for restr in (r'(?P<delim>[^\w\n"\'])(?P<space> ?)(?P<quote>["\']).*?(?P=quote)(?P=delim)', # ,".*?",

, so the Sniffer thinks the delimiter is ,.

I think we can improve the Sniffer, but please note that we can't make it 100% accurate.

rhettinger · 2024-05-21T01:25:52Z

I think we can improve the Sniffer, but please note that we can't make it 100% accurate.

Also, it is risky to change the Sniffer because tested and deployed code in the wild may rely on its current best guess.

z-dava · 2024-05-21T10:02:18Z

This issue currently makes us stuck. We have opted to use clevercsv (https://pypi.org/project/clevercsv) until the bug is fixed

kharvey2 · 2024-05-21T14:44:08Z

Hi! I was able to recreate the issue on Python 3.12.3 on a Mac, and it appears that the problem stems from the commas within the lists under the "group" and "subgroup" columns for the entry with id 4.

We can see why this happens, if we look closer at the Sniffer class,

class Sniffer:
    '''
    "Sniffs" the format of a CSV file (i.e. delimiter, quotechar)
    Returns a Dialect object.
    '''
    def __init__(self):
        # in case there is more than one possible delimiter
        self.preferred = [',', '\t', ';', ' ', ':']


    def sniff(self, sample, delimiters=None):
        """
        Returns a dialect (or None) corresponding to the sample
        """

        quotechar, doublequote, delimiter, skipinitialspace = \
                   self._guess_quote_and_delimiter(sample, delimiters)
        if not delimiter:
            delimiter, skipinitialspace = self._guess_delimiter(sample,
                                                                delimiters)

        if not delimiter:
            raise Error("Could not determine delimiter")
...

When you run the sniff method, it first calls
guess_quote_and_delimiter(...). If this doesn't find a delimiter, it then runs guess_delimiter(). If both attempts fail, it raises the error "Could not determine delimiter."

When we pass in both , and ; into the sniff method and it runs guess_quote_and_delimiter(...), it will return , as the delimiter. This is because guess_quote_and_delimiter(...) is described as follows,

    def _guess_quote_and_delimiter(self, data, delimiters):
        """
        Looks for text enclosed between two identical quotes
        (the probable quotechar) which are preceded and followed
        by the same character (the probable delimiter).
        For example:
                         ,'some text',
        The quote with the most wins, same with the delimiter.
        If there is no quotechar the delimiter can't be determined
        this way.
        """
...

In the provided CSV file, we can see that that the only instances where identical quotes are followed by the same character involve commas. This is evident in the row with id 4, where "group" is ['112', '112'] and "subgroup" is ['765', '766']. Consequently, the sniff method identifies the comma as the delimiter.

If you were to only pass ; as a potential delimiter to the sniff function (csv.Sniffer().sniff(csv_content, delimiters=";")), it would first use guess_quote_and_delimiter(...) to attempt to detect the delimiter. Since there are no instances where their are two identical quotes which are preceded and followed by a ;, it is unable to find a delimiter so it then runs guess_delimiter(). Here, it recognizes ; as the delimiter because it's the character repeated consistently across lines.

You can also see that by removing the quotation marks in the row with id 4 (changing "group" from ['112', '112'] to [112, 112] and "subgroup" from ['765', '766'] to [765, 766]) and keeping the sniff line as dialect = csv.Sniffer().sniff(csv_content, delimiters=",;"), it will detect ; as the delimiter.

aterrel · 2024-05-30T10:47:30Z

As @kharvey2 pointed out, the regular expression is responsible for choosing the wrong delimiter here.

The current parser has two methods for guessing what the delimiter. One based off statistics and one off some regular expressions. The fiest regilar expression will match the first delimiter that is next to a quote which breaks support for embedded lists.

A simple fix seems to add support for lists to the regular expressions. I can work on that

aterrel · 2024-06-05T02:35:58Z

Perhaps a good solution to this problem would be to just remove all the quotes and enclosing square brackets. Then you don't have the problematic regular expression:

def remove_enclosed_parts(s):
    # Define a regex pattern to match text within quotes and square brackets
    pattern = r'(["\']).*?\1|\[.*?\]'
    
    # Use a loop to repeatedly apply the regex until no more matches are found
    while re.search(pattern, s):
        s = re.sub(pattern, '', s)
    
    return s

z-dava added the type-bug An unexpected behavior, bug, or error label May 17, 2024

aisk added the stdlib Python modules in the Lib dir label May 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The CSV file delimiter found with csv.Sniffer() is interpreted incorrectly #119123

The CSV file delimiter found with csv.Sniffer() is interpreted incorrectly #119123

z-dava commented May 17, 2024 •

edited

aisk commented May 18, 2024

rhettinger commented May 21, 2024

z-dava commented May 21, 2024

kharvey2 commented May 21, 2024

aterrel commented May 30, 2024

aterrel commented Jun 5, 2024

The CSV file delimiter found with csv.Sniffer() is interpreted incorrectly #119123

The CSV file delimiter found with csv.Sniffer() is interpreted incorrectly #119123

Comments

z-dava commented May 17, 2024 • edited

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

aisk commented May 18, 2024

rhettinger commented May 21, 2024

z-dava commented May 21, 2024

kharvey2 commented May 21, 2024

aterrel commented May 30, 2024

aterrel commented Jun 5, 2024

z-dava commented May 17, 2024 •

edited