Kafka Connect: Sink connector with data writers and converters #9466

bryanck · 2024-01-13T19:33:32Z

This PR is the next stage in submitting the Iceberg Kafka Connect sink connector, and is a follow up to #8701. It includes the initial sink connector and configuration, along with the data writers and converters. In the interest of reducing the scope of the PR, it does not include the sink task, commit controller, integration tests, or docs.

For reference, the current sink implementation can be found at https://github.com/tabular-io/iceberg-kafka-connect.

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/Utilities.java

rdblue · 2024-01-14T20:22:26Z

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/Utilities.java

+  }
+
+  public static Object extractFromRecordValue(Object recordValue, String fieldName) {
+    String[] fields = fieldName.split("\\.");


We usually avoid using split like this because it breaks for names that include ..

To avoid this, we would normally index the schema to produce Accessor instances for fields, then look up the correct accessor by the full fieldName that is passed in. Is that something that we can do here?

The field name here is a config used to look up a value in a Kafka record to use for certain purposes like dynamic table routing (which is not part of this PR). One solution could be to escape the dots in the name when setting the config, though I felt that was not a common case so I left it out in the interest of keeping it simpler.

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/Utilities.java

rdblue · 2024-01-14T20:25:17Z

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/Utilities.java

+  }
+
+  public static TaskWriter<Record> createTableWriter(
+      Table table, String tableName, IcebergSinkConfig config) {


Do we expect tableName to be something other than table.name() or table.toString()?

It could be different. The tableName parameter is used to look up table-specific configuration parameters and is always namespace + name. table.name() can have the catalog name prepended in some cases.

rdblue · 2024-01-14T20:29:02Z

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/Utilities.java

+  public static TaskWriter<Record> createTableWriter(
+      Table table, String tableName, IcebergSinkConfig config) {
+    Map<String, String> tableProps = Maps.newHashMap(table.properties());
+    tableProps.putAll(config.writeProps());


In other engines, we typically use shorter property names for overrides. For example, the table property for format is write.format.default but the write property to override it in Spark is write-format. That avoids some odd cases, like setting a default for a single write in this case.

Would it also make sense to do this for the KC sink?

I suppose that is possible, though then won't we need to maintain an additional set of properties? i.e. if a new table property is added, we will need to remember to add it to the KC sink also...?

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/Utilities.java

rdblue · 2024-01-14T20:34:54Z

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/Utilities.java

+      return value;
+    }
+
+    Preconditions.checkState(value instanceof Struct, "Expected a struct type");


Why can structs only contain structs or a value and not maps? And vice versa. Is that a KC guarantee?

This method is used to extract a primitive value from a nested field given a field name with dot notation. This is used for some configs like the route field name, e.g. home.address.city. Currently this expects nested levels to be structs.

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/Utilities.java

rdblue · 2024-01-14T20:37:54Z

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/Utilities.java

+    OutputFileFactory fileFactory =
+        OutputFileFactory.builderFor(table, 1, System.currentTimeMillis())
+            .defaultSpec(table.spec())
+            .operationId(UUID.randomUUID().toString())


Is there a reusable ID that we can supply here instead? In Spark this is the write's UUID, or for streaming it is the write's UUID and the epoch ID. If we have an equivalent in KC it would be nice to use it here. (Not a blocker)

I can't think of anything we could use here instead that would be an improvement.

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/Utilities.java

rdblue · 2024-01-14T20:44:03Z

...ect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/PartitionedAppendWriter.java

+import org.apache.iceberg.io.OutputFileFactory;
+import org.apache.iceberg.io.PartitionedFanoutWriter;
+
+public class PartitionedAppendWriter extends PartitionedFanoutWriter<Record> {


Should this live in data since it is using Iceberg's generic Record class?

PartitionedFanoutWriter and UnpartitionedWriter are in core, perhaps it should sit alongside those?

Yes, that sounds good to me. That should avoid people re-creating this implementation for other purposes.

I attempted to move this, but there is a dependency on InternalRecordWrapper to extract the partition key, and that class lives in data unfortunately.

rdblue · 2024-01-14T20:47:21Z

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/RecordConverter.java

+  private final IcebergSinkConfig config;
+  private final Map<Integer, Map<String, NestedField>> structNameMap = Maps.newHashMap();
+
+  public RecordConverter(Table table, IcebergSinkConfig config) {


Looks like the purpose of this is to create a new Iceberg generic Record from Kafka's object model (Struct). Is that needed? In Flink and Spark, we use writers that are adapted to the in-memory object model for those engines to avoid copying a record and then writing. I'm not familiar enough the the KC object model to know whether this is needed.

The main purpose of the converter is to convert the Kafka Connect record to an Iceberg row, and it is also used to detect schema changes for the purpose of schema evolution.

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/RecordWriter.java

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/RecordWrapper.java

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/IcebergSinkConfig.java

rdblue · 2024-01-14T20:54:45Z

kafka-connect/build.gradle

@@ -30,3 +30,30 @@ project(":iceberg-kafka-connect:iceberg-kafka-connect-events") {
    useJUnitPlatform()
  }  
 }
+
+project(":iceberg-kafka-connect:iceberg-kafka-connect") {


Why the duplicate name? Could this be :iceberg-kafka-connect:iceberg-sink?

I was following the convention of the spark projects to some extent, e.g. :iceberg-spark:iceberg-spark-<version>

Okay, so it sounds like the idea is to make the final artifact name iceberg-kafka-connect. That makes sense.

rdblue

@bryanck thanks for submitting this! It looks really good after my first pass. Overall, I think it would be easier to get in with a couple of modifications, since this is a pretty large PR. First, I would like to separate some of the utils and tests out along with the changes that only add preconditions. Getting those easy changes in would help us focus time on validating the bigger changes. Second, I think focusing on the append use case would also help us move faster. That would mean adding half the writer classes and less config. Usually smaller PRs can have much faster review turn-around because there aren't as many updates and everyone loses less context between iterations.

bryanck · 2024-01-16T17:59:21Z

Sure, thanks, I'll scale down this PR.

bryanck · 2024-01-16T23:52:22Z

I stripped out the delta writers and record converter, along with related tests and config.

fqaiser94 · 2024-01-23T02:41:46Z

...onnect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/IcebergWriterFactory.java

+    } catch (Exception e) {
+      LOG.error(
+          "Unable to create partition spec {}, table {} will be unpartitioned",
+          partitionBy,
+          identifier,
+          e);
+      spec = PartitionSpec.unpartitioned();
+    }


Why are we recovering from exceptions here?
Personally, I would prefer if the connector hard-failed so that I know I've done something wrong in my connector configurations and have to fix it rather than have the connector default to unpartitioned silently. Or am i missing some nuance here?

My thought was to have the sink be more permissive in this case, i.e. if the sink is fanning out to several different tables, don't error out if one of them can't be partitioned as that can be difficult to recover from. There is room for improvement in error handling to help with recovery, e.g. adding in DLQ support for records that can't be processed.

I agree that we don't want to fail if the partitioning doesn't work. For example, if you expect events to have a event_ts column but it isn't in all event types. We should still try to make progress rather than ignoring the events or (worse) failing operation.

...onnect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/IcebergWriterFactory.java

fqaiser94 · 2024-01-24T02:34:11Z

...onnect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/IcebergWriterFactory.java

+              try {
+                result.set(catalog.loadTable(identifier));
+              } catch (NoSuchTableException e) {
+                result.set(
+                    catalog.createTable(
+                        identifier, schema, partitionSpec, config.autoCreateProps()));
+              }


Hmmm a little concerned about this. I'm worried that creating and evolving tables inside each task of a connector will result in contention issues, particularly for topics with many partitions. I'm curious if you've seen any such issues?

Bit of a crazy idea that I haven't fully thought through in terms of feasibility but curious if you folks ever considered delaying committing schema updates until the moment we commit the corresponding data files (potentially even in the same transaction)? That could help reduce any (potential) contention issues because in effect only one task would be performing the schema updates (the same task that is committing datafiles).

I considered that and I agree it would have been a nice solution to reduce the contention. The issue is that you need to have the schema's field IDs assigned when you write the data. You need some type of coordination to prevent different workers from using the same ID for different fields. The catalog is acting as that coordinator currently.

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/Utilities.java

...onnect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/IcebergWriterFactory.java

rdblue · 2024-02-01T23:07:10Z

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/IcebergWriter.java

+        writer.write(row);
+      }
+    } catch (Exception e) {
+      throw new DataException(


What is the result of throwing here? Does it cause the sink to stop?

The sink won't stop initially, but the task will fail. Depending on the platform (Strimzi, Confluent, etc) the task will restart and the source topic partitions rebalanced across the tasks. If the failure is not recoverable then after a number of retries the sink will fail.

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/SchemaUtils.java

rdblue · 2024-02-01T23:17:05Z

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/SchemaUtils.java

+            .collect(toList());
+
+    // filter out columns that have already been made optional
+    List<MakeOptional> makeOptionals =


You might consider moving these checks into SchemaUpdate.Consumer. You could have that update itself for the current table schema and then check empty().

That would work, though I'd prefer to make that change when we introduce the record converter which is calling the consumer, to see what that ends up looking like.

rdblue · 2024-02-01T23:17:57Z

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/SchemaUtils.java

+        update -> updateSchema.addColumn(update.parentName(), update.name(), update.type()));
+    updateTypes.forEach(update -> updateSchema.updateColumn(update.name(), update.type()));
+    makeOptionals.forEach(update -> updateSchema.makeColumnOptional(update.name()));
+    updateSchema.commit();


Minor: this whole method can be retried since it calls refresh at the start.

The whole method is being retried currently (unless I overlooked something).

rdblue · 2024-02-01T23:18:22Z

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/SchemaUtils.java

+
+  public static PartitionSpec createPartitionSpec(
+      org.apache.iceberg.Schema schema, List<String> partitionBy) {
+    if (partitionBy.isEmpty()) {


Handle null here, too?

rdblue · 2024-02-01T23:19:41Z

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/SchemaUtils.java

+    return field.isOptional();
+  }
+
+  public static PartitionSpec createPartitionSpec(


This seems really useful. We may want to move it to core. We currently use engines for this but this is a good simple implementation!

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/SchemaUtils.java

rdblue · 2024-02-01T23:24:01Z

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/SchemaUtils.java

+      }
+    }
+
+    Optional<Type> inferIcebergType(Object value) {


Style: Iceberg doesn't generally use Optional and just passes null instead.

I changed these to do null checks instead.

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/SchemaUtils.java

rdblue · 2024-02-01T23:25:34Z

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/SchemaUtils.java

+      } else if (value instanceof List) {
+        List<?> list = (List<?>) value;
+        if (list.isEmpty()) {
+          return null;


of strings?

I felt it was better to skip adding it if we can't infer the element type, then let schema evolution add it when some data comes in.

rdblue

I left a few comments, but there is nothing that I think is major and blocking. Thanks @bryanck! I'll leave it to you to reply and possibly update but I think we can merge this when you're ready.

rdblue · 2024-02-03T20:41:17Z

Thanks, @bryanck! This is looking great and I'm excited to get the next steps in.

Also thanks to @fqaiser94 for reviewing!

ajantha-bhat

I was out of office and couldn't review it early.
Happy to see the progress.

I have few comments, that can be handled in a follow up.

overall LGTM.

ajantha-bhat · 2024-02-04T11:40:00Z

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/IcebergSinkConfig.java

+        ConfigDef.Type.STRING,
+        null,
+        Importance.MEDIUM,
+        "Coordinator threads to use for table commits, default is (cores * 2)");


I think this definition is wrong.

ajantha-bhat · 2024-02-04T12:08:42Z

...onnect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/IcebergWriterFactory.java

+            notUsed -> {
+              try {
+                result.set(
+                    catalog.createTable(


we have to create the namespace also if doesn't exist.

Many catalogs doesn't support implicit namespaces and expects the namespace to exist before table creation.

ajantha-bhat · 2024-02-04T13:41:31Z

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/SchemaUtils.java

+            String transform = matcher.group(1);
+            switch (transform) {
+              case "year":
+              case "years":


I think plurals was a carry over from spark transforms and it was not as per spec. So, recently we added singular to the same spark class.
https://iceberg.apache.org/spec/#partition-transforms

I think we don't have to support years, months, days, hours syntax as it is not as per the spec and this connector is nothing to do with spark.

…e#9466)

github-actions bot added build KAFKACONNECT labels Jan 13, 2024

bryanck force-pushed the kc-connector-data branch from 8ab722b to 2e266d3 Compare January 13, 2024 19:36

rdblue reviewed Jan 14, 2024

View reviewed changes

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/Utilities.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 14, 2024

View reviewed changes

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/Utilities.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 14, 2024

View reviewed changes

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/Utilities.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 14, 2024

View reviewed changes

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/Utilities.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 14, 2024

View reviewed changes

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/Utilities.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 14, 2024

View reviewed changes

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/Utilities.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 14, 2024

View reviewed changes

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/RecordWriter.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 14, 2024

View reviewed changes

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/RecordWrapper.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 14, 2024

View reviewed changes

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/IcebergSinkConfig.java Show resolved Hide resolved

rdblue reviewed Jan 14, 2024

View reviewed changes

bryanck force-pushed the kc-connector-data branch from 2e266d3 to 428f770 Compare January 16, 2024 23:28

bryanck force-pushed the kc-connector-data branch from 8573900 to 5be9c7e Compare January 21, 2024 15:43

fqaiser94 reviewed Jan 23, 2024

View reviewed changes

fqaiser94 reviewed Jan 28, 2024

View reviewed changes

rdblue reviewed Feb 1, 2024

View reviewed changes

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/Utilities.java Outdated Show resolved Hide resolved

rdblue reviewed Feb 1, 2024

View reviewed changes

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/Utilities.java Outdated Show resolved Hide resolved

bryanck force-pushed the kc-connector-data branch from 5be9c7e to 6c5085f Compare February 1, 2024 21:41

rdblue reviewed Feb 1, 2024

View reviewed changes

...onnect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/IcebergWriterFactory.java Outdated Show resolved Hide resolved

rdblue reviewed Feb 1, 2024

View reviewed changes

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/SchemaUtils.java Show resolved Hide resolved

rdblue reviewed Feb 1, 2024

View reviewed changes

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/SchemaUtils.java Outdated Show resolved Hide resolved

rdblue reviewed Feb 1, 2024

View reviewed changes

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/SchemaUtils.java Outdated Show resolved Hide resolved

rdblue reviewed Feb 1, 2024

View reviewed changes

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/SchemaUtils.java Outdated Show resolved Hide resolved

rdblue reviewed Feb 1, 2024

View reviewed changes

kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/SchemaUtils.java Outdated Show resolved Hide resolved

rdblue reviewed Feb 1, 2024

View reviewed changes

rdblue approved these changes Feb 1, 2024

View reviewed changes

bryanck force-pushed the kc-connector-data branch from 5e20e11 to abe97cd Compare February 2, 2024 15:50

bryanck added 2 commits February 2, 2024 19:09

Kafka Connect: Sink connector with data writers

dca08fa

PR feedback

e881ce8

bryanck force-pushed the kc-connector-data branch from 1868e8e to e881ce8 Compare February 3, 2024 03:09

rdblue merged commit f4ba90d into apache:main Feb 3, 2024

ajantha-bhat reviewed Feb 4, 2024

View reviewed changes

bryanck mentioned this pull request Feb 4, 2024

Kafka Connect: Record converters #9641

Merged

This was referenced Mar 25, 2024

Add new transform prop to keep db name in target pattern databricks/iceberg-kafka-connect#207

Open

feat: add new prop to use map for routing databricks/iceberg-kafka-connect#223

Open

devangjhabakh pushed a commit to cdouglas/iceberg that referenced this pull request Apr 22, 2024

Kafka Connect: Sink connector with data writers and converters (apach…

908d037

…e#9466)

bryanck mentioned this pull request May 18, 2024

Kafka Connect: Commit coordination #10351

Merged

bryanck mentioned this pull request Jul 21, 2024

Kafka Connect: Runtime distribution with integration tests #10739

Merged

Kafka Connect: Sink connector with data writers and converters #9466

Kafka Connect: Sink connector with data writers and converters #9466

Conversation

bryanck commented Jan 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue left a comment

Choose a reason for hiding this comment

bryanck commented Jan 16, 2024

bryanck commented Jan 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue left a comment

Choose a reason for hiding this comment

rdblue commented Feb 3, 2024

ajantha-bhat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment