Roadmap

This document describes high level goals of the DataFusion and Ballista development community. It is not meant to restrict possibilities, but rather help newcomers understand the broader context of where the community is headed, and inspire additional contributions.

DataFusion and Ballista are part of the Apache Arrow project and governed by the Apache Software Foundation governance model. These projects are entirely driven by volunteers, and we welcome contributions for items not on this roadmap. However, before submitting a large PR, we strongly suggest you start a coversation using a github issue or the dev@arrow.apache.org mailing list to make review efficient and avoid surprises.

DataFusion

DataFusion's goal is to become the embedded query engine of choice for new analytic applications, by leveraging the unique features of Rust and Apache Arrow to provide:

Best-in-class single node query performance
A Declarative SQL query interface compatible with PostgreSQL
A Dataframe API, similar to those offered by Pandas and Spark
A Procedural API for programatically creating and running execution plans
High performance, data race free, erogonomic extensibility points at at every layer

Additional SQL Language Features

Complete support list on status
Timestamp Arithmetic #194
SQL Parser extension point #533
Support for nested structures (fields, lists, structs) #119
Remaining Set Operators (INTERSECT / EXCEPT) #1082
Run all queries from the TPCH benchmark (see milestone for more details)

Query Optimizer

Additional constant folding / partial evaluation #1070
More sophisticated cost based optimizer for join ordering
Implement advanced query optimization framework (Tokomak) #440
Finer optimizations for group by and aggregate functions

Datasources

Better support for reading data from remote filesystems (e.g. S3) without caching it locally #907 #1060
Support for partitioned datasources #1139 and make the integration of other table formats (Delta, Iceberg...) simpler
Improve performances of file format datasources (parallelize file listings, async Arrow readers, file chunk prefetching capability...)

Runtime / Infrastructure

Migrate to some sort of arrow2 based implementation (see milestone for more details)
Add DataFusion to h2oai/db-benchmark 147
Improve build time 348

Resource Management

Finer grain control and limit of runtime memory #587 and CPU usage #54

Python Interface

TBD

DataFusion CLI (`datafusion-cli`)

Note: There are some additional thoughts on a datafusion-cli vision on #1096.

Better abstraction between REPL parsing and queries so that commands are separated and handled correctly
Connect to the Statistics subsystem and have the cli print out more stats for query debugging, etc.
Improved error handling for interactive use and shell scripting usage
publishing to apt, brew, and possible NuGet registry so that people can use it more easily
adopt a shorter name, like dfcli?

Ballista

Ballista is a distributed compute platform based on Apache Arrow and DataFusion. It provides a query scheduler that breaks a physical plan into stages and tasks and then schedules tasks for execution across the available executors in the cluster.

Having Ballista as part of the DataFusion codebase helps ensure that DataFusion remains suitable for distributed compute. For example, it helps ensure that physical query plans can be serialized to protobuf format and that they remain language-agnostic so that executors can be built in languages other than Rust.

Ballista Roadmap

Move query scheduler into DataFusion

The Ballista scheduler has some advantages over DataFusion query execution because it doesn't try to eagerly execute the entire query at once but breaks it down into a directionally-acyclic graph (DAG) of stages and executes a configurable number of stages and tasks concurrently. It should be possible to push some of this logic down to DataFusion so that the same scheduler can be used to scale across cores in-process and across nodes in a cluster.

Implement execution-time cost-based optimizations based on statistics

After the execution of a query stage, accurate statistics are available for the resulting data. These statistics could be leveraged by the scheduler to optimize the query during execution. For example, when performing a hash join it is desirable to load the smaller side of the join into memory and in some cases we cannot predict which side will be smaller until execution time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roadmap.md

roadmap.md

Roadmap

DataFusion

Additional SQL Language Features

Query Optimizer

Datasources

Runtime / Infrastructure

Resource Management

Python Interface

DataFusion CLI (`datafusion-cli`)

Ballista

Ballista Roadmap

Move query scheduler into DataFusion

Implement execution-time cost-based optimizations based on statistics

Files

roadmap.md

Latest commit

History

roadmap.md

File metadata and controls

Roadmap

DataFusion

Additional SQL Language Features

Query Optimizer

Datasources

Runtime / Infrastructure

Resource Management

Python Interface

DataFusion CLI (datafusion-cli)

Ballista

Ballista Roadmap

Move query scheduler into DataFusion

Implement execution-time cost-based optimizations based on statistics

DataFusion CLI (`datafusion-cli`)