Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: faild to load gz tsv files into databend #15383

Open
sundy-li opened this issue Apr 30, 2024 · 5 comments
Open

bug: faild to load gz tsv files into databend #15383

sundy-li opened this issue Apr 30, 2024 · 5 comments
Assignees

Comments

@sundy-li
Copy link
Member

Summary

CREATE OR REPLACE TABLE hackernews_1m
(
    `id` BIGINT,
    `deleted` TINYINT,
    `type` String,
    `author` String,
    `timestamp` DateTime,
    `comment` String,
    `dead` TINYINT,
    `parent` BIGINT,
    `poll` BIGINT,
    `children` Array(BIGINT),
    `url` String,
    `score` INT,
    `title` String,
    `parts` Array(INT),
    `descendants` BIGINT
);


wget https://doris-build-1308700295.cos.ap-beijing.myqcloud.com/regression/index/hacknernews_1m.csv.gz

 
❯  curl -XPUT 'http://root:@127.0.0.1:8000/v1/streaming_load' -H 'insert_sql: insert into hackernews_1m  FILE_FORMAT = (type = TSV) ' -F 'upload=@"./hacknernews_1m.csv.gz"'

{"error":{"code":"400","message":"execute fail: Invalid value '\u001f�\b\b�\u000b�c\u0000\u0003hacknernews_1m.csv\u0000��K��:�%\b�o�\"\u0006p�L|�>��j�U3+;9K\u0002�\u0003[\u000b�GE�f�\u001d�$�\u0004A<S�OO�\u0013�������������' for column 0 (id Int64 NULL): invalid text for number\nat file 'hacknernews_1m.csv.gz', line 0"}}%


# works
❯  curl -XPUT 'http://root:@127.0.0.1:8000/v1/streaming_load' -H 'insert_sql: insert into hackernews_1m  FILE_FORMAT = (type = TSV) ' -F 'upload=@"./hacknernews_1m.csv"'
{"id":"00d5bae1-a3f8-4ce4-8dc2-c72998456002","state":"SUCCESS","stats":{"rows":2000000,"bytes":642061025},"error":null,"files":["hacknernews_1m.csv"]}%




@youngsofun
Copy link
Member

youngsofun commented Apr 30, 2024

use (type = TSV compression=gzip) like in copy

 curl -XPUT 'http://root:@127.0.0.1:8000/v1/streaming_load' -H 'insert_sql: insert into hackernews_1m  FILE_FORMAT = (type = TSV  compression=gzip) ' -F 'upload=@"./hacknernews_1m.csv.gz"'

{"id":"ed17ccdc-2ea2-4fac-9ff9-97b0f61fd487","state":"SUCCESS","stats":{"rows":1000000,"bytes":130618406},"error":null,"files":["hacknernews_1m.csv.gz"]}% 

@sundy-li
Copy link
Member Author

There are two issues.

  1. wrong stats stats":{"rows":2000000,"bytes":642061025}
let compression = input_context
                    .get_compression_alg(&filename)
                    .map_err(BadRequest)?;

compression is wrong.

@youngsofun
Copy link
Member

default compression is none not auto

@youngsofun
Copy link
Member

stats is right on my mac.

the bytes you get is diff too, may be you have a diff hacknernews_1m.csv?

(venv) ➜  test git:(stage2) ls -lh hacknernews_1m.csv*
-rw-r--r--  1 yangxiufeng  staff   342M Sep  3  2023 hacknernews_1m.csv
-rw-r--r--  1 yangxiufeng  staff   125M Apr 30 16:42 hacknernews_1m.csv.gz

(venv) ➜  test git:(stage2) curl -XPUT 'http://root:@127.0.0.1:8000/v1/streaming_load' -H 'insert_sql: insert into hackernews_1m  FILE_FORMAT = (type = TSV  compression=gzip) ' -F 'upload=@"./hacknernews_1m.csv.gz"'

{"id":"c0766b8a-77c5-477c-9a48-01e09a895ad1","state":"SUCCESS","stats":{"rows":1000000,"bytes":130618406},"error":null,"files":["hacknernews_1m.csv.gz"]}%                                                                         
(venv) ➜  test git:(stage2) curl -XPUT 'http://root:@127.0.0.1:8000/v1/streaming_load' -H 'insert_sql: insert into hackernews_1m  FILE_FORMAT = (type = TSV  compression=none) ' -F 'upload=@"./hacknernews_1m.csv"'

{"id":"08b3d058-13ab-4303-8925-e40ac971a2dd","state":"SUCCESS","stats":{"rows":1000000,"bytes":358551613},"error":null,"files":["hacknernews_1m.csv"]}%                                                                            (venv) ➜  test git:(stage2) curl -XPUT 'http://root:@127.0.0.1:8000/v1/streaming_load' -H 'insert_sql: insert into hackernews_1m  FILE_FORMAT = (type = TSV  compression=none) ' -F 'upload=@"./hacknernews_1m.csv"'

{"id":"6b5a826b-ec49-47e0-a2ba-75bf372dd36f","state":"SUCCESS","stats":{"rows":1000000,"bytes":358551613},"error":null,"files":["hacknernews_1m.csv"]}%    

@sundy-li
Copy link
Member Author

❯ curl -XPUT 'http://root:@127.0.0.1:8000/v1/streaming_load' -H 'insert_sql: insert into hackernews_1m  FILE_FORMAT = (type = TSV) ' -F 'upload=@"./hacknernews_1m.csv"'
{"id":"546943ab-f100-4a39-aee8-adca6ca27596","state":"SUCCESS","stats":{"rows":2000000,"bytes":642061026},"error":null,"files":["hacknernews_1m.csv"]}%
❯ wc -l ./hacknernews_1m.csv
1000000 ./hacknernews_1m.csv
❯ ls -lh hacknernews_1m.csv*
-rw-r--r-- 1 sundy sundy 342M Sep  3  2023 hacknernews_1m.csv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants