Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: query with Apache Arrow result #134

Merged
merged 17 commits into from
Jan 18, 2024

Conversation

levakin
Copy link
Contributor

@levakin levakin commented Dec 1, 2023

This PR implements Connection method QueryArrowContext to get query result in Apache Arrow format.
Connection type is now exported and can be used in similar way as pgx driver. See example here.

Partly implements #75 , some methods are still missing.

@marcboeker
Copy link
Owner

Thanks @levakin for the PR. Could you please have a look at https://github.com/marcboeker/go-duckdb/actions/runs/7060722450/job/19233036222?pr=134
Also I need some time to review the PR since I'm currently overwhelmed with other tasks.

@levakin
Copy link
Contributor Author

levakin commented Dec 3, 2023

@marcboeker thanks, I fixed the issue. It was happening due to different memory allocation behaviour on linux and macos.
There actually 2 ways to fix it.

  • First option is to check how many rows were retrieved already and stop if there are no more.
  • The second option is to use calloc instead of malloc here cdArr := (*cdata.CArrowArray)(unsafe.Pointer(C.malloc(C.sizeof_struct_ArrowArray))).

Before the fix malloc returned non-zero memory chunk for ArrowArray on second call. The ArrowArray.length is checked and it's not 0 on linux (not even the length of previous chunk, just some large number). Then it tries to import this ArrowArray and obviously fails.

@marcboeker
Copy link
Owner

marcboeker commented Dec 10, 2023

@levakin I would prefer to separate the standard driver from additional features like Arrow support or the DuckDB appender API. The DuckDB Appender implementation has a method called NewAppenderFromConn(driverConn driver.Conn, schema string, table string) (*Appender, error) that is used to initialize a new appender.
Is it possible to do the same for the Apache Arrow interface to keep the default driver free of extensions?

With dbConn, ok := driverConn.(*conn) you are able to obtain a raw DuckDB database connection.

Have you seen this part of the PR, which does exactly what I'm proposing.

@levakin
Copy link
Contributor Author

levakin commented Dec 11, 2023

@marcboeker Yes, I can create such interface. I thought it would be a better idea to follow the same library design as https://github.com/jackc/pgx does. Postgres native interface (without any additional interface initializations like Appender) + stdlib wrapper provided as separate package

@marcboeker
Copy link
Owner

@levakin As we already have the Appender separated from the main driver, I would prefer a separated Arrow implementation as well.

Maybe it would make sense to get in contact with @phillipleblanc as he has already built an Arrow integration. The goal should be to have one implementation shipped with go-duckdb that satisfies all parties.

@phillipleblanc
Copy link

phillipleblanc commented Jan 4, 2024

@catkins mentioned in #75 that it might make sense to have the Arrow interface as a separate module - so people that didn't need the Arrow interface wouldn't need to pull in the Arrow dependencies (since it can be quite heavyweight). That also meshes with @marcboeker's desire to have the Arrow interface separated from the main driver - but I'll defer to Marc on what he wants to do there.

The approach @levakin took with using prepared statements is better than my implementation in https://github.com/spicehq/go-duckdb/blob/master/connection.go#L36 - which has the drawback of the context not being respected and needing to execute the query synchronously.

What I can do is work on a separate module that wraps go-duckdb and provides an Arrow interface using the prepared statement approach that @levakin implemented. @marcboeker can then decide to merge the code into this repo (similar to the Appender pattern), or create a new repo (i.e. go-duckdb-arrow) and import the code there, and tell people that need to Arrow interface to use that module.

EDIT: Actually I'm not sure if the separate module approach will work - both implementations need access to internal functions/state to work properly.

@catkins
Copy link

catkins commented Jan 4, 2024

Actually I'm not sure if the separate module approach will work - both implementations need access to internal functions/state to work properly.

if they're in the same tree, could internal packages be used to share code between them without exposing it to consumers?

@levakin
Copy link
Contributor Author

levakin commented Jan 4, 2024

I'll be back soon from the holidays and implement proposed changes. I don't think a separate repo would be convenient to use, but a separate module is a good idea

@marcboeker
Copy link
Owner

@phillipleblanc I'm fine with including it in the go-duckdb module, but maybe not tightly integrated into the core driver but instead as an add-on like the Appender. That keeps the main driver interface clean but also does not require a second module/dependency. I'd be happy to merge such a variant.

@levakin levakin force-pushed the feat/apache-arrow branch 2 times, most recently from fd3b63f to fd2c76d Compare January 9, 2024 23:28
@levakin
Copy link
Contributor Author

levakin commented Jan 10, 2024

@marcboeker please check again. I've separated most of the new logic in arrow.go. Changelog is in the latest commit message.

Separating Arrow in a different module would at least require exposing Statements methods. If we can tolerate arrow dependency in one module, it would be easier to maintain IMHO

@marcboeker
Copy link
Owner

marcboeker commented Jan 15, 2024

Hey @levakin, sorry for the delay and thank you for the extensive refactoring. I've also refactored the code to keep driver.Conn in the function signature instead of any for the conn. I've moved the C code to the header of the arrow.go to have everything in one place. I also moved the Arrow specific functions from the statement.go to the arrow.go. Please review.

Next, we should validate the Arrow specific code for any memory leaks (which can easily occur in a CGO environment) and check for correctness/optimizations. As I have no experience with Apache Arrow, I need to understand it first.
Additionally, there were some refactorings that were not related to the Arrow feature. I've tried to keep most of them. If anything was lost, I suggest fixing that in a separate PR.

@levakin
Copy link
Contributor Author

levakin commented Jan 15, 2024

@marcboeker Thanks for the review. Keeping C code in arrow.go is ok, since there is not much code in there. Initial idea was to follow the approach from the adbc driver which stores these structures in adbc.h.


Let's discuss the approach with driver.Conn in the function signature instead of any for the conn. Why do you think it's a necessary step to make type assertion conn, ok := driverConn.(driver.Conn), when we can just skip it and pass driverConn from the conn.Raw(func(driverConn any) error { instead?

Also I think users should be encouraged to use conn.Raw which returns any, rather than connector.Connect returning Conn, because in this case sql package maintains the pool of connections. That's why I also changed README and test files.


Are you sure sure we need to provide both Query and QueryContext methods in new interfaces?

// Deprecated: Use QueryContext instead.
func (a *Arrow) Query(query string, args ...any) (array.RecordReader, error) {
	return a.QueryContext(context.Background(), query, args)
}

AFAIK QueryContext method in the standard library was added after context package was developed, and it's the only reason why it was not included in Query method to follow backwards compatibility. So it will be safe to provide Query method accepting context argument since the beginning.

@levakin
Copy link
Contributor Author

levakin commented Jan 15, 2024

@marcboeker Do you think having working examples in the README is a bad idea or should it be addressed in a separate PR?
The same question is for Connector refactoring, which is stateful but cannot be closed.

@marcboeker
Copy link
Owner

Let's discuss the approach with driver.Conn in the function signature instead of any for the conn. Why do you think it's a necessary step to make type assertion conn, ok := driverConn.(driver.Conn), when we can just skip it and pass driverConn from the conn.Raw(func(driverConn any) error { instead?

Changing a function signature, even if it does not break things, is unusual in Go without bumping the major release number. Normally you initialize a new Arrow connection with

connector, err := duckdb.NewConnector("", nil)
if err != nil {
  ...
}
conn, err := connector.Connect(context.Background())
if err != nil {
  ...
}
defer conn.Close()

// Retrieve Arrow from connection.
ar, err := duckdb.NewArrowFromConn(conn)

which works perfectly fine. In most cases you would not need a conn.Raw(func(driverConn any) {}) when not using the raw DB conn beforehand.

Also I think users should be encouraged to use conn.Raw which returns any, rather than connector.Connect returning Conn, because in this case sql package maintains the pool of connections. That's why I also changed README and test files.

The idea behind the Connector was to have a convenient way of running an init func, that for example installs necessary plugins.

Are you sure sure we need to provide both Query and QueryContext methods in new interfaces?

I'm fine with removing it. My idea was to keep the interface in sync with the SQL driver and remove Query() once the SQL driver does it first.

@marcboeker
Copy link
Owner

@marcboeker Do you think having working examples in the README is a bad idea or should it be addressed in a separate PR? The same question is for Connector refactoring, which is stateful but cannot be closed.

What do you mean by "working examples" in the README? Are they broken?

Regarding the Connector: why should it not be possible to close it?

func (c *connector) Close() error {

Could you please elaborate a bit.
Thanks!

Added and defined structures for Arrow schema and Arrow array in a new file "arrow.h". This allows Arrow-compatible communication and data sharing, which is memory-efficient. Implemented Arrow interface in the connection.go file. Also updated the 'duckdb_test.go' file to include tests for the Arrow interface.
Renames the function QueryArrowContext to QueryArrow. This change is to simplify the function name as 'Context' in the name isn't required or adds any meaning.
levakin and others added 3 commits January 17, 2024 14:30
- conn structure was made unexported
- NewConnector was renamed to OpenConnector to be consistent with sql package and now returns ConnectorCloser interface to expose Close method
- appender: NewAppenderFromConn receives connection of type any to be compatible with sql.Conn.Raw method. Tests and docs were adopted as well.
@levakin
Copy link
Contributor Author

levakin commented Jan 17, 2024

@marcboeker ok, then let's keep the same interface. However, changing to the 'any' type won't break backwards compatibility, so a major version won't be needed.


The idea behind the Connector was to have a convenient way of running an init func, that for example installs necessary plugins.

I aggree with you, it's very convenient to have init for every new connection.


I'm fine with removing it. My idea was to keep the interface in sync with the SQL driver and remove Query() once the SQL driver does it first.

I guess sql package won't remove Query until GO 2, which is unlikely. So I suggest we leave only Query(ctx, ...) or QueryContext(ctx, ...). Which option do you prefer?


Regarding the Connector: why should it not be possible to close it?

Well, it's kinda possible, but too hacky. connector type is hidden behind driver.Connector.

func (Driver) OpenConnector(dataSourceName string) (driver.Connector, error) {
	return createConnector(dataSourceName, func(execerContext driver.ExecerContext) error { return nil })
}

And to get access to Close method it's needed to make type assertion.

connector, err := duckdb.NewConnector("", nil)
if err != nil {
  ...
}
connector.(io.Closer).Close() // Hidden behind driver.Connector returned by NewConnector

What do you mean by "working examples" in the README? Are they broken?

I meant, that you can't just copy them to your IDE and expect to be run as is. With the changes I proposed, I had advantages like syntax highlighting, autoformatting, etc. like with normal code.
If you prefer examples like they are now, it's fine by me.

@levakin
Copy link
Contributor Author

levakin commented Jan 17, 2024

Let me give more concrete example regarding connector.Close.
Let's take README example.

connector, err := NewConnector("test.db", nil)
if err != {
  ...
}
conn, err := connector.Connect(context.Background())
if err != {
  ...
}
defer conn.Close()

// Retrieve appender from connection (note that you have to create the table 'test' beforehand).
appender, err := NewAppenderFromConn(conn, "", "test")
if err != {
  ...
}
defer appender.Close()

Here connector is opened and Database is opened internally. The bug here is that connector is never closed. Unless it's passed to db := sql.OpenDB() and then closed with db.Close().

Here's part of sql package code:

func (db *DB) Close() error {
....
if c, ok := db.connector.(io.Closer); ok {
		err1 := c.Close()
		if err1 != nil {
			err = err1
		}
	}

We end up with 2 options, leave it as is and work with database only through standard sql package or expose connector Close method to the outer world. It can be done by returning ConnectorCloser interface. Also in this case we should ensure Connector is fine with multiple close calls.

@marcboeker
Copy link
Owner

I guess sql package won't remove Query until GO 2, which is unlikely. So I suggest we leave only Query(ctx, ...) or QueryContext(ctx, ...). Which option do you prefer?

Okay, then let's remove Query as QueryContext is also used in the regular driver and is the new standard.

Regarding the Connector: Thanks for the detailed explanation. You probably refer to this issue? golang/go#41790

I would suggest to first merge the Arrow feature and then take care of implementing the io.Closer interface in a second PR.

I meant, that you can't just copy them to your IDE and expect to be run as is. With the changes I proposed, I had advantages like syntax highlighting, autoformatting, etc. like with normal code. If you prefer examples like they are now, it's fine by me.

Okay, got it. What do you think of keeping the samples in the README short and concise and adding two new examples as Go files for the Arrow and Appender to examples/? We can then also link from the README to the examples.

@levakin
Copy link
Contributor Author

levakin commented Jan 18, 2024

You are right about the the issue. Here's this check https://go-review.googlesource.com/c/go/+/258360/6/src/database/sql/sql.go
from https://go-review.googlesource.com/c/go/+/258360

@marcboeker Yeah, sure. I can file a PR and let's resolve it separately.

Next, we should validate the Arrow specific code for any memory leaks (which can easily occur in a CGO environment) and check for correctness/optimizations. As I have no experience with Apache Arrow, I need to understand it first.

I would appreciate some help with it :)

@marcboeker
Copy link
Owner

@marcboeker Yeah, sure. I can file a PR and let's resolve it separately.

Great, thanks!

I would appreciate some help with it :)

I've done some simplifications as C.calloc already returns an unsafe.Pointer. Do you see any impacts?

And one test case didn't return anything as there were no rows in the DB. I've added some test data.

If you don't see any issues, I would suggest merging it.

@levakin
Copy link
Contributor Author

levakin commented Jan 18, 2024

I don't see any impact. However type casting C pointers is cumbersome and we can improve later if needed.

Thanks! Let's merge it @marcboeker 👍🏻

@marcboeker
Copy link
Owner

@levakin Thanks for your hard work on this 🙏 It's merged. Will release a new version when #143 is merged.

@marcboeker marcboeker merged commit ab6ee34 into marcboeker:main Jan 18, 2024
3 checks passed
@levakin levakin deleted the feat/apache-arrow branch January 19, 2024 14:44
@levakin levakin restored the feat/apache-arrow branch January 19, 2024 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants