The Cloud Native Data Engineer

Innovation in data governance: ETL visualization using graph databases

Kevin Stobbelaar — Thu, 07 Apr 2022 08:43:00 GMT

Disclaimer: this is a repost of something I did back in 2016

During a recent project innovation sprint at a customer, we decided to tackle our customer’s data warehouse documentation problem. It was hard to get proper insight in the data streams at hand due to various data sources, changing standards and legacy code. Take for example a random field: in which reports is it being used, to which source can it be tracked, which transformations have been applied, etc?

Since our aim was to thoroughly reshape the infrastructure, we decided to add this kind of information because it would allow us to better gauge the impact of our modifications. During the innovation sprint, we developed a system that builds said info and makes it possible to query.

Oracle Data Integrator and Neo4j: a love story

Oracle Data Integrator (ODI) is a tool that allows you to easily set up ETL flows between various database systems and technologies. Interfaces are probably the most important concept in ODI. They define the data flows between one or more sources and targets.

All metadata ODI uses to execute the various ETL flows is stored in a database. From the structure of the various data sources to the definition of the interfaces and the connections between various fields: everything is available via this metadata. Following some research into the underlying structure, we were able to extract the necessary information by way of a few simple queries.

After extracting this info, the next hurdle was to offer it so it’s possible to run queries on it. Because our documentation will be used not just by people with a technological background, a clear visualization is necessary. Filtering and targeted searches are also very important, due to the large amount of fields and their interrelatedness.

When thinking of persisting and presenting relations, one almost always ends up at graph databases. These databases use graphs to shape the connections between various concepts. One of the best-known graph databases is Neo4j. The community is freely accessible.

The results of our queries on the ODI repository were saved as CSV files. Neo4j makes it very easy to read those CSVs and to transform them to a graph structure. With larger datasets, the import tends to take a long time though. In that case, directly loading the data through the Neo4j Java API may be more performant.

The basics: schemas, tables, fields and mappings

To start, we can map the tables and their corresponding fields for every schema in the data warehouse. These become the nodes on the graph. The relations between this first set of nodes are quite simple: a field belongs to a table, a table belongs to a schema.

A simple example. The table ‘CUSTOMER’ belongs to the schema of customers and has four fields: name, dateOfBirth, address and id.

Based on the existing set of nodes and relations, it’s not possible to make connections between the various fields To solve this, we need an extra type of node: the interface.

The interface symbolizes the operation in which one or more fields from the source tables are mapped on a field in the target table. From this, you can immediately extract the various relations. The USED_IN relation is used to show that a table is used as a source in an interface. Similarly, the FILLS relation is used to show that a table is used as a target in an interface.

CUSTOMER and ADDRESSES are the source tables for the interface POP_ADDRESSES_CUSTOMERS. This interface then fills the target table CUSTOMER_DIMENSION.

By way of the interface node we can already extract a large amount of information from the graf we’ve built. But we can go a step further than that, namely by extending the interface node to the level of the fields so we can shape the relations between the various fields and tables. This results in the mapping node.

Unveiling endless possibilities

The real power of the graph shows when we want to look into the main structure of the data warehouse or when we want to make targeted impact analyses.

Mapping the structure of the metadata

The defined concepts and relations allow to easily map the structure of the data warehouse. The image above clearly shows the structure of the data warehouse for a specific subject area. The data starts from the source on the left and flows via the various ETL processes (created using mappings) to the third layer of the data warehouse on the right.

The flow of a field on the second layer to a field on the third one.

A question often asked by end users of the reports is what the meaning is of the fields included in a report. If, for example, the report contains a field that denotes the profit of a certain month, then it’s interesting to know where this field comes from and how its value is being calculated through the full ETL flow. We can simply deduct this info from the graph we’ve put together. The image above shows an example of the flow of a field from the second layer to the third layer of the data warehouse.

Impact analysis

It frequently happens that a data model in the source applications is being changed, that a certain data source is no longer included or that the structure of a source table is modified. In these cases, it’s convenient for analysts and developers to know what the impact is on the current data warehouse structure.

Example of an impact analysis

Using the generated graphs, it’s possible to get a clear view of this impact. The above image shows the flow for table 73101. It begins on the first layer of the data warehouse and flows all the way to the third dimensional layer. The graph clearly shows which mappings and tables are used to enable this specific flow.

The graph allows us to clearly and quickly visualize what the impact may be if we were to stop delivering table 73101 to the source and it’s immediately clear which mappings need to be changed and which tables will no longer be loaded completely.

Capture MongoDB Change Events with Debezium and Kafka Connect

Kevin Stobbelaar — Mon, 21 Mar 2022 18:02:32 GMT

Streaming MongoDB Oplog Records to Kafka

A few years ago I had to come up with a system to stream operational data to a Data Warehouse for (near) real time analytics. The operational data was living in a MongoDB. Part of the resulting architecture looked like this:

To have real time data we decided to apply Change Data Capture on top of the MongoDB. We stumbled upon Streamsets Datacollector to help us with this. It comes with an out of the box integration to capture MongoDB change events by processing the MongoDB oplog. The MongoDB oplog is a dedicated collection in the local database that keeps track of every operation taking place on a collection in the MongoDB instance. With the help of Streamsets it was child's play to publish the oplog records into a Kafka topic.

This all works fine, but working with oplog records has one main drawback. Let's take a look at an update operation and how this operation is reflected in an oplog record.

rs0:PRIMARY> db.product.updateOne( { id: "1234" },  { $set: { "size": "L", status: "P" }})
{ "acknowledged" : true, "matchedCount" : 1, "modifiedCount" : 1 }

rs0:PRIMARY> db.product.find()
{ "_id" : ObjectId("5e05d3573097942bf4e37a60")
 , "id" : "1234", "label" : "dummy", "size" : "L", "status" : "P" }
 
 -- In the oplog
{ "ts" : Timestamp(1577440106, 1)
, "t" : NumberLong(19)
, "h" : NumberLong("4817342222804408742")
, "v" : 2
, "op" : "u"
, "ns" : "company.product"
, "ui" : UUID("651f7a10-6eea-4b3b-9afe-178a5b7c297e")
, "o2" : { "_id" : ObjectId("5e05d3573097942bf4e37a60") }
, "wall" : ISODate("2022-03-20T09:48:26.279Z")
, "o" : { "$v" : 1
        , "$set" : { "size" : "L", "status" : "P" } } }

The actual change is reflected in the "o" field of the oplog record. There's no indication of the full record before or after this operation. If we want to know the actual status of the full record we will need to merge this individual operation with the history of changes we captured for the same record. Certainly not unmanageable but supporting all the different ways of updating a record in MongoDB takes some time. And I'm still not talking about handling updates to nested documents...

Time to call Batman and Robin, or in this case Debezium and Kafka Connect

For another project I did some research into tools that are capable of performing change data capture on top of Oracle databases. When looking for a tool that didn't use Oracle's logminer process, I quickly got familiar with the Debezium project. The description on the top of the site is quite promising:

Debezium is an open source distributed platform for change data capture. Start it up, point it at your databases, and your apps can start responding to all of the inserts, updates, and deletes that other apps commit to your databases. Debezium is durable and fast, so your apps can respond quickly and never miss an event, even when things go wrong.

And even more favorable: they have a MongoDB connector! When taking a deep-dive into the documentation of the connector it seems that Debezium is not using the MongoDB oplog but another feature called the Change Streams. MongoDB is advertising to use this feature in stead of the oplog. By default these change streams only capture the actual change or the resulting delta of the operation. However you can configure it to return the full document reflected after the operation.

In the remainder of this post I will guide you through the setup of Debezium together with Kafka Connect to capture change events on top of MongoDB.

Hands-on implementation of Debezium and Kafka Connect

Before we really get started, there are some prerequisites:

a working and accessible Kubernetes cluster
the Strimzi Kafka Operator installed on the K8S cluster
a MongoDB instance

I'm using the Strimzi Kafka Operator not only to easily deploy a Kafka Cluster, it also let's me configure Kafka Connect clusters and Kafka Connect connectors as CRD instances. And suprise, surprise, Kafka Connect happens to be one of the ways to setup Debezium.

We will need three things to setup a Debezium integration with the help of Kafka Connect: a Docker image for the Kafka Connect workers (1), a Kafka Connect Cluster specification (2), and a Kafka Connect Connector specification (3). The Kafka Connect Connector will hold the configuration of the Debezium integration with MongoDB.

1. A Dockerfile for the Kafka Connect worker Pods

The worker pods of the Kafka Connect cluster need access to the appropriate libraries to use Debezium and Debezium's MongoDB connector. For this, we can extend a Dockerfile, provided by Strimzi, with the source code of the plugin. Download the plugin from the Confluent pages and add it to a new directory "plugins". The resulting Dockerfile looks like this:

FROM strimzi/kafka:latest-kafka-2.6.0

COPY ./plugins/ /opt/kafka/plugins/
USER 1001

Build and push the resulting Docker image to the registry of your preference. I'm using my personal registry on Dockerhub.

2. A Kafka Connect Cluster Specification

Secondly we need to setup our Kafka Connect cluster. For this we can use the KafkaConnect CRD, imported via the Strimzi Kafka Operator. Most of the configuration is kept pretty default, make sure to replace the image reference to the Docker image you pushed in the previous step.

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaConnect
metadata:
  name: kstobbel-connect-cluster
  annotations:
    strimzi.io/use-connector-resources: "true"
spec:
  replicas: 3
  image: kstobbel/kafka-connect-mongodb:1.0.1
  bootstrapServers: kafka-kafka-bootstrap:9092
  config:
    group.id: kstobbel-connect-cluster
    offset.storage.topic: kstobbel-connect-cluster-offsets
    config.storage.topic: kstobbel-connect-cluster-configs
    status.storage.topic: kstobbel-connect-cluster-status
    key.converter: org.apache.kafka.connect.json.JsonConverter
    value.converter: org.apache.kafka.connect.json.JsonConverter
    key.converter.schemas.enable: true
    value.converter.schemas.enable: true
    config.storage.replication.factor: 3
    offset.storage.replication.factor: 3
    status.storage.replication.factor: 3
  logging:
    type: inline
    loggers:
      log4j.rootLogger: "INFO"

After applying this yaml specification to the cluster, three new pods should be starting in the current namespace.

3. A Kafka Connect Connector Specification

Last but not least we need to configure our Debezium MongoDB connector. Again using the Strimzi Kafka operator, all we need to do is configure the Kafka Connector CRD:

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaConnector
metadata:
  name: "cdc-connector"
  namespace: default
  labels:
    strimzi.io/cluster: kstobbel-connect-cluster
spec:
  class: io.debezium.connector.mongodb.MongoDbConnector
  tasksMax: 1
  config:
    mongodb.hosts: mongodb-headless:27017
    mongodb.name: products
    snapshot.mode: never
    collection.include.list: company.product

Credentials are left out on purpose.

Bringing it all together

We are ready to deploy our Debezium Connector, apply the manifest! Great. Once deployed, one of the Kafka Connect workers is instructed to setup the MongoDB integration. If we take a deeper look into the logs of this worker we see a lot of errors in the console:

kstobbel-connect-cluster-connect 2022-03-21 16:38:09,030 
WARN [Producer clientId=connector-producer-cdc-connector-0] 
Error while fetching metadata with correlation id 509 : 
{products.company.product=UNKNOWN_TOPIC_OR_PARTITION} 
(org.apache.kafka.clients.NetworkClient) 
[kafka-producer-network-thread connector-producer-cdc-connector-0]

The worker pod seems to have issues with the Kafka topic. After manual creation of the Kafka topic (in our case "products.company.product") we should be fine. Shall we insert some dummy data in our MongoDB collection?

db.product.insertOne({"input": "abc12345"})

The payload part of the corresponding Kafka message looks like this (I'm leaving out the schema definition of the message):

{
    "after": "{\"_id\": {\"$oid\": \"6238a9bd5a957ffd1073cac6\"},\"input\": \"abc123\"}",
    "patch": null,
    "filter": null,
    "updateDescription": null,
    "source": {
        "version": "1.8.1.Final",
        "connector": "mongodb",
        "name": "products",
        "ts_ms": 1647880637000,
        "snapshot": "false",
        "db": "company",
        "sequence": null,
        "rs": "rs0",
        "collection": "product",
        "ord": 1,
        "h": null,
        "tord": null,
        "stxnid": null,
        "lsid": null,
        "txnNumber": null
    },
    "op": "c",
    "ts_ms": 1647880637552,
    "transaction": null
}

So far so good, the insert operation is captured by the Kafka Connector and looks familiar. Let's try an update operation.

db.product.updateMany({"input": "abc12345"}, {$set: {"input": "def456", "output": "changed"}})

The resulting Kafka message:

{
    "after": "{\"_id\": {\"$oid\": \"6238a9fe5a957ffd1073cac7\"},\"input\": \"def456\",\"output\": \"changed\"}",
    "patch": null,
    "filter": null,
    "updateDescription": {
        "removedFields": null,
        "updatedFields": "{\"input\": \"def456\", \"output\": \"changed\"}",
        "truncatedArrays": null
    },
    "source": {
        "version": "1.8.1.Final",
        "connector": "mongodb",
        "name": "products",
        "ts_ms": 1647881865000,
        "snapshot": "false",
        "db": "company",
        "sequence": null,
        "rs": "rs0",
        "collection": "product",
        "ord": 1,
        "h": null,
        "tord": null,
        "stxnid": null,
        "lsid": null,
        "txnNumber": null
    },
    "op": "u",
    "ts_ms": 1647881865582,
    "transaction": null
}

Cool! Not only does the message contain the update details, it also includes the state of the record after applying the operation. No more need to do manual merges to construct the full record body. True, we still need to parse the "after" field of the payload but I assume you can figure this out.

Final Remarks

To conclude this post I want to highlight a few things:

Using the Strimzi operator brings a declarative approach to defining Kafka Connect integrations (in stead of performing some API calls)
Having the "after" state of the record in the change event simplifies your data pipelines
Replacing a Data Pipeline Orchestrator (like Streamsets Datacollector or Apache NiFi) by something like Kafka Connect might be limiting at first sight in terms of observability (what is happing with the connector)
Might be something for one of the next posts!

Scaling Kafka consumers with KEDA

Kevin Stobbelaar — Sat, 19 Mar 2022 18:41:33 GMT

Guilty. I'm indeed member of that part of society that might build up frustration because of traffic jams. Often do I wonder about ways of decreasing traffic jams in our little country, Belgium. One seemingly simple solution often comes to mind: just add more traffic lanes and all problems are gone, aren't they?

Why do I share this with you, you ask? During the last few years working as a data engineer with technologies as Apache Kafka and other streaming platforms, the following traffic jam metaphor crossed my path a number of times. Think of the cars being Kafka messages, the lanes Kafka consumers, and the road a Kafka topic. Adding new lanes to process more cars suddenly becomes more and more convenient.

In this post I will highlight the upsides of scaling Kafka consumers based on functional/application triggers, and showcase an implementation of functional triggers with KEDA, Kubernetes Event-driven Autoscaling.

Scaling Kafka consumers on Kubernetes

Let's start with implementing a plain dead simple Kafka consumer application in Python simulating a heavy process on top of incoming Kafka messages, or something similar: sleep for 60 seconds.

from kafka import KafkaConsumer
from datetime import datetime
import time

def print_now():
  now = datetime.now();
  current_time = now.strftime("%H:%M:%S")
  print("Time = ", current_time)

consumer = KafkaConsumer('lazy-input-topic',group_id='lazy-consumer-group', bootstrap_servers='xyz:9092')

for msg in consumer:
  print_now()
  print (msg)
  time.sleep(60)

Before deploying this as a Docker container on a Kubernetes cluster, we can run it locally and send some messages to the Kafka cluster. First create a topic with >1 partitions (important for the remainder of this post). Start the python consumer and watch it starting to consume messages.

./kafka-topics --create --topic lazy-input --partitions 10  --bootstrap-server xyz:9092

Great. All set to start scaling.

Getting started with KEDA, event-driven scaling for Kubernetes

KEDA is a Kubernetes-based Event Driven Autoscaler. With KEDA, you can drive the scaling of any container in Kubernetes based on the number of events needing to be processed.

Like most of the cloud native projects you can easily deploy KEDA on your own Kubernetes cluster with the help of their Helm chart.

Following commands will get you going:

helm repo add kedacore https://kedacore.github.io/charts
helm repo update

kubectl create namespace keda
helm install keda kedacore/keda --namespace keda

KEDA is implemented following the Kubernetes Operator pattern. This means that upon installing KEDA your Kubernetes cluster gets extended with four Custom Resource Definitions (CRD): ScaledObjects, ScaledJobs, TriggerAuthentications, and ClusterTriggerAuthentications.

One more important thing to mention before we dive into it are the KEDA scalers. These are integrations based on which KEDA can decide to scale Kubernetes components. Obviously for this post we're interested in the Kafka integration, but the Azure Blob Storage scaler, for example, also seems worthy of some exploration time in the future!

Let's get going. We need to configure a ScaledObject. And apply it to our Kubernetes cluster.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: kafka-scaledobject
  namespace: default
spec:
  scaleTargetRef:
    name: lazy-consumer
  pollingInterval: 30
  triggers:
  - type: kafka
    metadata:
      bootstrapServers: kafka-kafka-bootstrap.default:9092
      consumerGroup: lazy-consumer-group
      topic: lazy-input
      # Optional
      lagThreshold: "50"
      offsetResetPolicy: latest

kubectl apply -f scaled-object.yaml

One of the first things I noticed is that the existing deployment in which my lazy-consumer pod is running is scaled down to zero replicas by KEDA. This is confirmed by consulting the logs of the KEDA operator pod:

keda-operator 1.6476931939554265e+09    INFO    scaleexecutor    Successfully set ScaleTarget replicas count to ScaledObject minR │
│ eplicaCount    {"scaledobject.Name": "kafka-scaledobject", "scaledObject.Namespace": "default", "scaleTarget.Name": "lazy-c │
│ onsumer", "Original Replicas Count": 1, "New Replicas Count": 0}

Now we can start publishing data onto our topic. I'm using Streamsets Datacollector for this, seeing I have an instance running on the same Kubernetes cluster. If you're following along just use whatever is the most efficient to get the job done.

And boom! Our consumer group is being scaled:

It takes some time for all consumers to actually start consuming messages (due to consumer group rebalances taking place) but they all start processing eventually. Once the consumer lag is gone, the deployment is scaled down.

But wait a minute, my deployment is not scaling down...

After some time, I started noticing weird things. It seems that the consumer pods are getting stuck in a neverending loop...

Message 0 gets processed after message 30...

A few Github and Stackoverflow pages later, I learned that this is caused by the frequent rebalancing events happening on the consumer group. These rebalancing events are triggered due to KEDA continuously adding and removing new consumers to the group. These rebalancing events are sending the consumers back to the lastest commited offset which, in my case, is the initial offset.

Time to make a small change to our python script. I configured the KafkaConsumer to disable auto commiting offsets and added a manual commit after each message is processed. The final script:

from kafka import KafkaConsumer
from datetime import datetime
import time

def print_now():
  now = datetime.now();
  current_time = now.strftime("%H:%M:%S")
  print("Time = ", current_time)

consumer = KafkaConsumer('lazy-input-topic',group_id='lazy-consumer-group', bootstrap_servers='xyz:9092', enable_auto_commit='False')

for msg in consumer:
  print_now()
  print (msg)
  consumer.commit()
  time.sleep(60)

Deploying these changes to the cluster and executing the experiment showed the results we expected before: the increased lag on the consumer group forces KEDA to scale up the deployment and once a consumer has done its job KEDA scales down the deployment, eventually to zero replicas. Nice.

Deliver continuous data discoverability on top of Kafka with DataHub and Airflow

Kevin Stobbelaar — Thu, 09 Sep 2021 14:14:02 GMT

Knowing where your data lives, what it means and what it looks like is becoming more and more important. As data moves from system to system, or from data product to data product a clear and intuitive overview of your datasets across your application landscape lets others explore and wonder.

If you're like me, a tiny bit all over the place when excited, data discoverability might even help to bring some order in your home experiments with data. Lately I have been playing around with some data sets and when moving the data I'm most used to Apache Kafka to help me out. Once I get going the amount of Kafka topics quickly explodes. This had lead me into the lookout for a tool on top of Kafka to not only get an overview of the available Kafka topics - there's lots of tools out there for that, I for example frequently use Kafdrop - but also the corresponding metadata, e.g. the underlying schema, the definition(s), etc.

One of the tools that came up was DataHub from the engineering team at LinkedIn. "A Metadata Platform for the Modern Data Stack" is their slogan, and from my point of view this highlights the two critical pieces of the tool: first of all it consists of a user friendly UI in which the (meta)data can be explored and discovered, secondly it comes with a lot of integrations to modern data repositories and data wrangling tools out of the box. Luckily for this blog post, they do have an integration to Apache Kafka as well.

Deploy your own DataHub instance

Enough talking, on to the doing. The prerequisites: a running Kubernetes cluster and Helm ready to deliver. That's it. DataHub provides two Helm charts to get you up and running: one for the underlying components (Elastic Search, neo4j, Mysql, and the Confluent platform) and one for the DataHub components itself. Following commands should get you started:

helm repo add datahub https://helm.datahubproject.io/ 
helm install prerequisites datahub/datahub-prerequisites 
helm install datahub datahub/datahub

If all goes well you can use Kubernetes port-forwarding to access your DataHub instance (initial credentials: "datahub:datahub"):

kubectl port-forward  9002:9002

DataHub welcome page

Integrate existing Kafka metadata

Once we've got DataHub up and running, next step on our data discoverability journey is to setup an integration between our Kafka cluster and DataHub. DataHub delivers an excellent job documenting their existing integrations, the Kafka Metadata ingestion is described on this page.

In order to make the Kafka metadata ingestion repeatable, I have combined the necessary statements in a Dockerfile.

FROM python:3.9 

RUN python3 -m pip install --upgrade pip wheel setuptools \
&& python3 -m pip install --upgrade acryl-datahub \
&& pip install 'acryl-datahub[kafka,datahub-kafka]'

Next we need to create a recipe yaml file to configure the type of metadata ingestion we want to execute. Based on the DataHub documentation, it is quite easy to end up with this sample to load Kafka topics into DataHub:‌

source:
  type: "kafka" 
  config: 
    connection: 
      bootstrap: "broker:9092" 
      schema_registry_url: "https://xyz:8081" 
sink: 
  type: "datahub-kafka" 
  config: 
    connection: 
      bootstrap: "broker:9092" 
      schema_registry_url: "https://xyz:8081"

After building this Dockerfile, we can run the container locally and provide the recipe file holding the configuration properties for the ingestion of the Kafka topics to DataHub.

docker run --dit --name kafka-datahub-loader -v /<$PWD>/kafka-to-datahub-recipe.yml:/recipe.yml ./datahub ingest -c /recipe.yml

And tada: the existing Kafka topics are available in DataHub!

Example Kafka Topic Metadata overview in DataHub

Repeat with Airflow

Once your project gets going and the amount of Kafka topics starts to grow, you may want to automatically ingest new Kafka metadata periodically. For this, one might consider Apache Airflow, a platform to manage workflows.

I deployed Airflow on my K8S cluster using their own Helm chart, which can be found here. I did tweak some parts of the configuration in the corresponding values file but leaving that out for this post. Or maybe, as you will see in the sample DAG implementation, the Airflow pods need access to the DataHub python package, the values file includes an option to define pip packages that are installed during deployment. Use that to install extra packages.

  extraPipPackages: 
    - "acryl-datahub==0.8.11.1"
    - "acryl-datahub[kafka]"
    - "acryl-datahub[datahub-kafka]"

Once Airflow is up and running, all we need to do is implement a DAG that will execute the necessary steps to ingest the Kafka metadata into DataHub. Starting from an example exposed by DataHub, I came up with following DAG:

from datetime import timedelta
from airflow import DAG

try:
    from airflow.operators.python import PythonOperator
except ModuleNotFoundError:
    from airflow.operators.python_operator import PythonOperator

from airflow.utils.dates import days_ago

from datahub.configuration.config_loader import load_config_file
from datahub.ingestion.run.pipeline import Pipeline

from datetime import datetime

def datahub_recipe():
    pipeline = Pipeline.create(
        # This configuration is analogous to a recipe configuration.
        {
            "source": {
                "type": "kafka",
                "config": {
                  "connection": {
                    "bootstrap": "broker:9092",
                    "schema_registry_url": "http://xyz:8081",
                  },
                },
            },
            "sink": {
                "type": "datahub-kafka",
                "config": {
                  "connection": {
                    "bootstrap":  "broker:9092",
                    "schema_registry_url": "http://xyz:8081",
                  },
                },
            },
        }
    )

    pipeline.run()
    pipeline.raise_from_status()

dag = DAG('datahub_ingest_using_recipe', description='Loading Kafka Metadata into DataHub', schedule_interval='0 12 * * *', start_date=datetime(2017, 3, 20), catchup=False)
datahub_operator = PythonOperator(task_id='ingest_using_recipe', python_callable=datahub_recipe, dag=dag)

datahub_operator

If all went well, after uploading the DAG into Airflow, you should be able to run the DAG to ingest new Kafka metadata into DataHub.

Airflow DAGs overview: the datahub ingest DAG ran successfully

Alternatively you could schedule running the previously created Docker Image as a CronJob on your Kubernetes cluster.

Closing remarks

Photo by Frederick Medina / Unsplash

A few open questions/remarks I have after writing this post:

if you're not using a predefined schema for your Kafka topics, there's not a lot of information to be found in DataHub other than a list of Kafka topics (although you might add extra metadata manually)
still not sure if Airflow is the right tool for configuring the automatic ingestion of Kafka Metadata, it kinda felt like yet another tool do to something quite simple in K8S, but Airflow does give you an higher level of workflow management and monitoring...
debugging Airflow was weary, untill I figured out I should start running the python code locally before uploading it to the Airflow instance running on K8S...