Creating Clickhouse Schema from Parquet

This is a small utility for creating Clickhouse database schema from a Parquet file.

This is useful especially for Parquet files with nested schema where it is more natural to expose them as Nested types in the Clickhouse schema.

Usage

$ cargo run -- --help

Usage: schemagen --parquet-path <PARQUET_PATH> --clickhouse-schema-path <CLICKHOUSE_SCHEMA_PATH> \
--table-name <TABLE_NAME> --primary-key <PRIMARY_KEY>

Options:
      --parquet-path <PARQUET_PATH>
          Absolute path of the Parquet file to read the Parquet schema from
      --clickhouse-schema-path <CLICKHOUSE_SCHEMA_PATH>
          Absolute path where the Clickhouse schema will be written to
      --table-name <TABLE_NAME>
          The Clickhouse table name
      --primary-key <PRIMARY_KEY>
          The Clickhouse table primary key
  -h, --help
          Print help

Example

For example, if we have a Parquet file /tmp/p.parquet with the following data:

+---+----+----------+-------+
|id |b   |c         |d      |
+---+----+----------+-------+
|42 |null|{foo, bar}|[{foo}]|
+---+----+----------+-------+

The schema of the data is (this is shown in the Apache Spark schema format since its much easier to read than the full blow Parquet schema):

root
 |-- id: integer (nullable = true)
 |--  b: string (nullable = true)
 |--  c: struct (nullable = true)
 |     |-- a: string (nullable = true)
 |     |-- b: string (nullable = true)
 |--  d: array (nullable = true)
 |     |-- element: struct (containsNull = true)
 |     |    |-- a: string (nullable = true)

If we want to generate the Clickhouse schema to a file named /tmp/clickhouse_schema.sql for a table named T with primary key id, run as follows:

cargo run -- --parquet-path /tmp/p.parquet --clickhouse-schema-path /tmp/clickhouse_schema.sql \
--table-name T --primary-key id

The Clickhouse schema that's generated will be as follows:

drop table if exists T;
create table T (
    id Int32
    , b Nullable(String)
    , c Tuple(
        a Nullable(String)
        , b Nullable(String)
    )
    , d Nested (
        a Nullable(String)
    )
) engine = MergeTree() primary key (id);

The resulting Clickhouse schema can be used to create the table in a number of ways. The simplest is to use the Clickhouse command line client. For example (note that this will drop the table if it exists as can be seen in the generated code above):

cat /tmp/clickhouse_schema.sql | clickhouse-client --multiquery

The clickhouse-client may need additional parameters such as as username and password, depending on the way Clickhouse is configured.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
resources		resources
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resources

resources

src

src

.gitignore

.gitignore

Cargo.toml

Cargo.toml

LICENSE

LICENSE

README.md

README.md

rustfmt.toml

rustfmt.toml

Repository files navigation

Creating Clickhouse Schema from Parquet

Usage

Example

About

Releases

Packages

Languages

License

ahmedriza/parquet-to-clickhouse-schema

Folders and files

Latest commit

History

Repository files navigation

Creating Clickhouse Schema from Parquet

Usage

Example

About

Resources

License

Stars

Watchers

Forks

Languages