Add Nested Types to the Appender #150

maiadegraaf · 2024-01-19T09:40:14Z

I'm currently working on adding nested types to the appended. I haven't finished it yet but I'm opening up a draft for better coordination.

The Big Changes:

STRUCT's and LIST's are now both supported! Including nested calls:
- tripleNestedListInt INT[][][]
- topWrapper STRUCT(Wrapper STRUCT(Base STRUCT(I INT, V VARCHAR)))
- mixList STRUCT(A STRUCT(L VARCHAR[]), B STRUCT(L INT[])[])[]
The values are now added through recursive callback functions, initialized in InitializeColumnTypesAndInfos, which reduces the number of switches to one, which only happens once.

@taniabogatsch and I are still ironing out some final changes here

# Conflicts: # appender.go # appender_test.go

maiadegraaf · 2024-01-19T09:43:10Z

appender_test.go

@@ -28,7 +27,8 @@ const (
 		float REAL,
 		double DOUBLE,
 		string VARCHAR,
-		bool BOOLEAN
+		bool BOOLEAN,
+-- 	    blob BLOB


There previously was no test for BLOB types, currently it is causing a segmentation fault in
appender.go::443:

state = C.duckdb_append_data_chunk(*a.appender, chunk)

@marcboeker I've looked into this and I don't think BLOB types have ever fully been supported. Running this test on the current main causes a segmentation fault. To my understanding BLOB's at their core look like this: []uint8 in golang, they overlap with slices of uint8, which, if I'm not missing something, means we have to choose which one to support.

I've opened an issue -> #152

taniabogatsch

Hi @maiadegraaf, great PR! I added quite a lot of comments (again 😅), but I think that now we are almost there!

can you double-check that we panic only on internal errors? In all other cases, we return an error to the user. Let's also cover all user errors with tests.
can you double-check that we call Close properly on all test results (db, conn, rows, etc.) to avoid any memory leaks (I am missing destructors)? I think you already do almost everywhere, but just to be sure.

appender.go

appender_nested_test.go

appender_test.go

appender.go

appender_naive_test.go

appender.go

taniabogatsch · 2024-02-06T23:40:47Z

@marcboeker, what motivates removing significant test coverage to have the tests run faster? Especially since these cover the new nested functionality?

marcboeker · 2024-02-07T07:00:13Z

@marcboeker, what motivates removing significant test coverage to have the tests run faster? Especially since these cover the new nested functionality?

Please review the appender_test.go file, where nested and base cases have been consolidated and refactored (though not yet complete). The only test removed is the nested large test, as I believe the base nested test is sufficient for testing with e.g. 10 rows instead of 10000, without sacrificing the speed of the test suite. Rapid tests are essential for working with them.

Although benchmarks can be helpful, they are not present for the rest of go-duckdb. Therefore, I have removed them. My aim is to create test cases that cover a significant portion of the code, are comprehensible, and can serve as a form of documentation.

There are still missing test cases for the appender, such as UUID.

taniabogatsch · 2024-02-07T10:11:46Z

Thanks for the explanation.

The only test removed is the nested large test, as I believe the base nested test is sufficient for testing with e.g. 10 rows instead of 10000

The reason for having more than ten rows is that DuckDB's standard vector size is 2048. I see that the primitive types now cover 3k rows, which tests the creation of multiple chunks. I missed that change. Child vectors can exceed the standard vector size for nested types, so testing this might still be interesting.

without sacrificing the speed of the test suite. Rapid tests are essential for working with them.

It is possible to use a regex to run only specific tests, and it is also possible to group slow tests into a common expression to exclude (e.g., _slow_test.go). That way, it is possible to include tests closer to real-usage scenarios while keeping fast tests for rapid development.

This is mostly me trying to understand the project better and better understand the structure for future contributions. Maybe it is worth setting up a section in the contributing guidelines about this?

marcboeker · 2024-02-09T22:36:35Z

I've done some more refactoring and added tests for the UUID type, time.Time and []byte.

Unfortunately, appending a blob does not work at the moment, hence the TestAppenderBlob is skipped.

@maiadegraaf Do you know what type a DUCKDB_TYPE_BLOB is internally? Is it a primitive type or a list (DUCKDB_TYPE_LIST) of uint8 (which i doubt)?

The current implementation triggers a segmentation violation in the following line:

state = C.duckdb_append_data_chunk(*a.appender, chunk)

The logical type we set is: C.duckdb_create_logical_type(C.DUCKDB_TYPE_BLOB) and the value for the vector row index is set via:

setPrimitive[[]byte](colInfo, rowIdx, val.([]byte))

One weird thing is that the appender is a bit flaky under Ubuntu. Locally on macOS and in the CI under macOS/FreeBSD, everything works reliably. It fails at the same operation as the blob appending:

state = C.duckdb_append_data_chunk(*a.appender, chunk)

taniabogatsch · 2024-02-15T17:13:53Z

I had a look at the failing test. The reason for the segmentation fault is that we do not set the validity mask correctly for the child vectors of the STRUCT vector. When appending the data, we assume the values are set and try to append a string with an undefined length and an invalid data pointer.

I guess this worked on macos but not on Ubuntu because Ubuntu probably default-initialised that bit to one (valid). In contrast, MacOS probably default-initialised that bit to zero (false)... Or some other bit-level shenanigans.

I have a fix ready, and I'm also working on some other changes. I'll open a PR to @maiadegraaf once I'm done.

I'll also get back to you about the BLOBs.

marcboeker · 2024-02-15T19:24:32Z

@taniabogatsch Thanks, sounds great. I'm curious to see, how you have fixed it.

Validity mask fix and changes towards idiomatic Go

taniabogatsch · 2024-02-16T09:29:11Z

I used the logical type to switch in the SetNull, but I believe that only works for top-level recursion depths. We either have to destroy all child logical types when initializing the types or when closing the Appender. So these types might not be accessible in deeper levels. A cleaner solution here is to revert to the original implementation of this PR, where we return the logical type and end up with exactly one logical type for the top-level chunk initialization. That way, we also don't keep an unused field in the colInfo struct, and we can instead set the DUCKDB_TYPE. I'll push a fix. 🤔

taniabogatsch · 2024-02-16T13:52:47Z

W.r.t. the BLOB type. DuckDB distinguishes between BLOB and UTINYINT[]. In DuckDB, a BLOB has the same representation as a VARCHAR. Both are a more elaborate char pointer (inlined vs. string heap). It is possible to store Go-string and Go-[]byte in C.CString, and pass the resulting C string to duckdb_vector_assign_string_element. I updated the BLOB implementation accordingly, and I also expanded on the test.

To the best of my knowledge, Go does not distinguish between []byte and []uint8. DuckDB distinguishes between BLOB and UTINYINT[]. @maiadegraaf raised this problematic here. In the current solution, we opt to support BLOB in the Appender, but not UTINYINT[]. I added the respective tests.

cc @marcboeker, what do you think? :)

More Appender changes

marcboeker

Thanks for the improvements. I really like the generics approach. I'm going to merge this and we can fix the small changes later.

marcboeker · 2024-02-16T23:36:27Z

appender.go

+	fn     SetColValue
+
+	// The type of the column.
+	ddbType C.duckdb_type


I would call it duckdbType or dbType as it is more clear than ddbType. Reads like a typo.

marcboeker · 2024-02-16T23:38:13Z

appender.go

-		return nil, fmt.Errorf("can't create appender")
-	}
+	var appender C.duckdb_appender
+	state := C.duckdb_appender_create(*dbConn.con, cSchema, cTable, &appender)


Maybe we should shorten dbConn simply to conn as the corresponding struct field in the Appender is also called c.

marcboeker · 2024-02-16T23:42:47Z

appender.go

+	return csPtr, slice
+}
+
+func initPrimitive[T any](ddbType C.duckdb_type) (colInfo, C.duckdb_logical_type) {


Awesome use of generics. Makes the code a lot cleaner 🎉
Maybe we should rename ddbType to duckdbType here to.

Avoid writing to parquet if we can (appends only) ~Using upstream driver has some shortcomings from the get go: The interface decides on the schema based on passed types, so having untyped `nil`s don't help. But typed nils are also not handled. The [fork/PR](marcboeker/go-duckdb#150) fixes these issues but I still get 0 rows on the test db (or motherduck)~ The PR is now merged upstream and tagged as 1.6.1. ~I had to use my [fork](https://github.com/disq/go-duckdb/tree/feat/allow-nulls-in-appender-first-row) to get around nils, another way to this would be if the `appender.initColTypes` was exported, then we could just pass on concrete types for init and then use nullable data as normal.~ ~Another issue: We create JSON types as `json` but appender doesn't know about it. Sending `string` instead (which is what the json alias points to internally in duckdb, varchar) doesn't work. Might need the newer appender implementation for this to go forward.~ Example config: ```yaml kind: source spec: name: test path: cloudquery/test registry: cloudquery version: v4.0.8 # tables: ["*"] tables: [ "test_some_table" ] destinations: ["duckdb"] spec: num_clients: 1 num_rows: 4 num_sub_rows: 1 # num_clients: 10 # num_rows: 10000 num_sub_rows: 0 # num_sub_rows: 10 --- kind: destination spec: name: duckdb version: v0.0.0 registry: grpc path: localhost:7777 # write_mode: "overwrite" write_mode: "append" spec: connection_string: tmp_duck.db # connection_string: md:my_db # Optional parameters batch_size: 10000 # batch_size_bytes: 14194304 # 14 MiB # debug: false ```

maiadegraaf added 17 commits January 8, 2024 13:21

Add nested lists and structs

755ca82

start adding parent struct

36b8ab4

Add callback functios

ac8e1ae

Nested in nested works!

a5bf70d

Add more nested tests

baa24b4

BIG tests

8f34212

Change test format

1b6e9af

Final changes

f053e00

Merge remote-tracking branch 'origin/main'

8457b52

rm unused library

271780e

fix nits

7880836

implement preallocation and new types

261d780

rm unused struct

bb78a45

add test for large table with nested types

b5c8803

destroy logical types

9eb6cdc

Merge branch 'main' into appender2.0

bab28b4

# Conflicts: # appender.go # appender_test.go

UUID now works

89a0cfc

maiadegraaf marked this pull request as draft January 19, 2024 09:40

maiadegraaf commented Jan 19, 2024

View reviewed changes

maiadegraaf added 3 commits January 19, 2024 12:59

add naive benchmark

7cfdd61

support validity mask

2929113

move nested null tests to nested

6e3c4b1

maiadegraaf mentioned this pull request Jan 22, 2024

Appending BLOB types causes segmentation fault #152

Closed

taniabogatsch suggested changes Jan 24, 2024

View reviewed changes

maiadegraaf added 5 commits January 24, 2024 18:36

fix appender nits

6e66864

fix appender_bench_test.go nits

b320a60

fix appender_bench_test.go nits

e498b62

reduce table size

e0739d7

after flush correctly create new chunk

d8396cb

taniabogatsch reviewed Jan 26, 2024

View reviewed changes

appender.go Outdated Show resolved Hide resolved

marcboeker added 3 commits February 9, 2024 23:03

Add UUID type and more tests

097ecae

Use TIMESTAMP

2206234

Use timestamp with fixed date/time

b08145c

taniabogatsch mentioned this pull request Feb 14, 2024

Contributing guidelines #162

Merged

taniabogatsch added 5 commits February 15, 2024 15:36

verify NULL breaks test

b032bcd

disable setting the validity mask

3a71d11

second row

3404593

test without varchar

1a965c4

set validity of child vectors

1d32a0e

fixing NULL STRUCTs and other changes

88f856f

Merge pull request #2 from taniabogatsch/appender

dec7b9e

Validity mask fix and changes towards idiomatic Go

taniabogatsch added 5 commits February 16, 2024 11:25

don't keep logical type in colInfo, more VARCHAR fields in nested tests

9df2e0d

more nested NULL tests

9f2b95a

renaming

3fe9bd4

fix BLOB

00ea7cb

test BLOB/uint8[] vs UTINYINT[]

ebd2841

Merge pull request #3 from taniabogatsch/appender

10f8f62

More Appender changes

disq mentioned this pull request Feb 16, 2024

feat: Use Appender interface cloudquery/cloudquery#16668

Merged

marcboeker reviewed Feb 16, 2024

View reviewed changes

marcboeker merged commit e26a426 into marcboeker:main Feb 16, 2024
3 checks passed

taniabogatsch mentioned this pull request Feb 19, 2024

Remaining Appender review changes and test fixes #167

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Nested Types to the Appender #150

Add Nested Types to the Appender #150

maiadegraaf commented Jan 19, 2024

maiadegraaf Jan 19, 2024

maiadegraaf Jan 22, 2024

taniabogatsch left a comment

taniabogatsch commented Feb 6, 2024

marcboeker commented Feb 7, 2024

taniabogatsch commented Feb 7, 2024

marcboeker commented Feb 9, 2024 •

edited

taniabogatsch commented Feb 15, 2024

marcboeker commented Feb 15, 2024

taniabogatsch commented Feb 16, 2024

taniabogatsch commented Feb 16, 2024

marcboeker left a comment

marcboeker Feb 16, 2024

marcboeker Feb 16, 2024

marcboeker Feb 16, 2024

Add Nested Types to the Appender #150

Add Nested Types to the Appender #150

Conversation

maiadegraaf commented Jan 19, 2024

The Big Changes:

maiadegraaf Jan 19, 2024

Choose a reason for hiding this comment

maiadegraaf Jan 22, 2024

Choose a reason for hiding this comment

taniabogatsch left a comment

Choose a reason for hiding this comment

taniabogatsch commented Feb 6, 2024

marcboeker commented Feb 7, 2024

taniabogatsch commented Feb 7, 2024

marcboeker commented Feb 9, 2024 • edited

taniabogatsch commented Feb 15, 2024

marcboeker commented Feb 15, 2024

taniabogatsch commented Feb 16, 2024

taniabogatsch commented Feb 16, 2024

marcboeker left a comment

Choose a reason for hiding this comment

marcboeker Feb 16, 2024

Choose a reason for hiding this comment

marcboeker Feb 16, 2024

Choose a reason for hiding this comment

marcboeker Feb 16, 2024

Choose a reason for hiding this comment

marcboeker commented Feb 9, 2024 •

edited