<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[The Cloud Native Data Engineer]]></title><description><![CDATA[Thoughts, stories and ideas.]]></description><link>https://thecloudnativedataengineer.com/</link><image><url>https://thecloudnativedataengineer.com/favicon.png</url><title>The Cloud Native Data Engineer</title><link>https://thecloudnativedataengineer.com/</link></image><generator>Ghost 4.12</generator><lastBuildDate>Sun, 15 Mar 2026 21:54:11 GMT</lastBuildDate><atom:link href="https://thecloudnativedataengineer.com/rss/" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[Innovation in data governance: ETL visualization using graph databases]]></title><description><![CDATA[Tackling data warehouse documentation with Neo4j and Oracle Data Integrator. ]]></description><link>https://thecloudnativedataengineer.com/innovation-in-data-governance-etl-visualization-using-graph-databases/</link><guid isPermaLink="false">624ea1de3bbe3200014f9da2</guid><category><![CDATA[Neo4j]]></category><category><![CDATA[Data Lineage]]></category><category><![CDATA[Data Governance]]></category><category><![CDATA[Graph Database]]></category><dc:creator><![CDATA[Kevin Stobbelaar]]></dc:creator><pubDate>Thu, 07 Apr 2022 08:43:00 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1516738901171-8eb4fc13bd20?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwxMTc3M3wwfDF8c2VhcmNofDN8fHdvcmxkJTIwbWFwJTIwcGluc3xlbnwwfHx8fDE2NDkzMjA1MzY&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=2000" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: markdown--><img src="https://images.unsplash.com/photo-1516738901171-8eb4fc13bd20?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwxMTc3M3wwfDF8c2VhcmNofDN8fHdvcmxkJTIwbWFwJTIwcGluc3xlbnwwfHx8fDE2NDkzMjA1MzY&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=2000" alt="Innovation in data governance: ETL visualization using graph databases"><p><em>Disclaimer: this is a repost of something I did back in 2016</em></p>
<!--kg-card-end: markdown--><p>During a recent project innovation sprint at a customer, we decided to tackle our customer&#x2019;s data warehouse documentation problem. It was hard to get proper insight in the data streams at hand due to various data sources, changing standards and legacy code. Take for example a random field: in which reports is it being used, to which source can it be tracked, which transformations have been applied, etc?<br><br>Since our aim was to thoroughly reshape the infrastructure, we decided to add this kind of information because it would allow us to better gauge the impact of our modifications. During the innovation sprint, we developed a system that builds said info and makes it possible to query.</p><!--kg-card-begin: markdown--><h2 id="oracle-data-integrator-and-neo4j-a-love-story">Oracle Data Integrator and Neo4j: a love story</h2>
<!--kg-card-end: markdown--><p>Oracle Data Integrator (ODI) is a tool that allows you to easily set up ETL flows between various database systems and technologies. Interfaces are probably the most important concept in ODI. They define the data flows between one or more sources and targets.</p><p>All metadata ODI uses to execute the various ETL flows is stored in a database. From the structure of the various data sources to the definition of the interfaces and the connections between various fields: everything is available via this metadata. Following some research into the underlying structure, we were able to extract the necessary information by way of a few simple queries.</p><p>After extracting this info, the next hurdle was to offer it so it&#x2019;s possible to run queries on it. Because our documentation will be used not just by people with a technological background, a clear visualization is necessary. Filtering and targeted searches are also very important, due to the large amount of fields and their interrelatedness.</p><p>When thinking of persisting and presenting relations, one almost always ends up at graph databases. These databases use graphs to shape the connections between various concepts. One of the best-known graph databases is Neo4j. The community is freely accessible.</p><p>The results of our queries on the ODI repository were saved as CSV files. Neo4j makes it very easy to read those CSVs and to transform them to a graph structure. With larger datasets, the import tends to take a long time though. In that case, directly loading the data through the Neo4j Java API may be more performant.</p><!--kg-card-begin: markdown--><h2 id="the-basics-schemas-tables-fields-and-mappings">The basics: schemas, tables, fields and mappings</h2>
<!--kg-card-end: markdown--><p>To start, we can map the tables and their corresponding fields for every schema in the data warehouse. These become the nodes on the graph. The relations between this first set of nodes are quite simple: a field belongs to a table, a table belongs to a schema.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://thecloudnativedataengineer.com/content/images/2022/04/graphdb_1.png" class="kg-image" alt="Innovation in data governance: ETL visualization using graph databases" loading="lazy" width="639" height="442" srcset="https://thecloudnativedataengineer.com/content/images/size/w600/2022/04/graphdb_1.png 600w, https://thecloudnativedataengineer.com/content/images/2022/04/graphdb_1.png 639w"><figcaption><em>A simple example. The table &#x2018;CUSTOMER&#x2019; belongs to the schema of customers and has four fields: name, dateOfBirth, address and id.</em></figcaption></figure><p>Based on the existing set of nodes and relations, it&#x2019;s not possible to make connections between the various fields To solve this, we need an extra type of node: the interface.</p><p>The interface symbolizes the operation in which one or more fields from the source tables are mapped on a field in the target table. From this, you can immediately extract the various relations. The USED_IN relation is used to show that a table is used as a source in an interface. Similarly, the FILLS relation is used to show that a table is used as a target in an interface.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://thecloudnativedataengineer.com/content/images/2022/04/graphdb_2.png" class="kg-image" alt="Innovation in data governance: ETL visualization using graph databases" loading="lazy" width="639" height="327" srcset="https://thecloudnativedataengineer.com/content/images/size/w600/2022/04/graphdb_2.png 600w, https://thecloudnativedataengineer.com/content/images/2022/04/graphdb_2.png 639w"><figcaption><em>CUSTOMER and ADDRESSES are the source tables for the interface POP_ADDRESSES_CUSTOMERS. This interface then fills the target table CUSTOMER_DIMENSION.</em></figcaption></figure><p>By way of the interface node we can already extract a large amount of information from the graf we&#x2019;ve built. But we can go a step further than that, namely by extending the interface node to the level of the fields so we can shape the relations between the various fields and tables. This results in the mapping node.</p><figure class="kg-card kg-image-card"><img src="https://thecloudnativedataengineer.com/content/images/2022/04/graphdb_3.png" class="kg-image" alt="Innovation in data governance: ETL visualization using graph databases" loading="lazy" width="638" height="309" srcset="https://thecloudnativedataengineer.com/content/images/size/w600/2022/04/graphdb_3.png 600w, https://thecloudnativedataengineer.com/content/images/2022/04/graphdb_3.png 638w"></figure><!--kg-card-begin: markdown--><h2 id="unveiling-endless-possibilities">Unveiling endless possibilities</h2>
<!--kg-card-end: markdown--><p>The real power of the graph shows when we want to look into the main structure of the data warehouse or when we want to make targeted impact analyses.</p><!--kg-card-begin: markdown--><h3 id="mapping-the-structure-of-the-metadata">Mapping the structure of the metadata</h3>
<!--kg-card-end: markdown--><figure class="kg-card kg-image-card"><img src="https://thecloudnativedataengineer.com/content/images/2022/04/graphdb_4.png" class="kg-image" alt="Innovation in data governance: ETL visualization using graph databases" loading="lazy" width="639" height="297" srcset="https://thecloudnativedataengineer.com/content/images/size/w600/2022/04/graphdb_4.png 600w, https://thecloudnativedataengineer.com/content/images/2022/04/graphdb_4.png 639w"></figure><p>The defined concepts and relations allow to easily map the structure of the data warehouse. The image above clearly shows the structure of the data warehouse for a specific subject area. The data starts from the source on the left and flows via the various ETL processes (created using mappings) to the third layer of the data warehouse on the right.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://thecloudnativedataengineer.com/content/images/2022/04/graphdb_5.png" class="kg-image" alt="Innovation in data governance: ETL visualization using graph databases" loading="lazy" width="532" height="494"><figcaption><em>The flow of a field on the second layer to a field on the third one.</em></figcaption></figure><p>A question often asked by end users of the reports is what the meaning is of the fields included in a report. If, for example, the report contains a field that denotes the profit of a certain month, then it&#x2019;s interesting to know where this field comes from and how its value is being calculated through the full ETL flow. We can simply deduct this info from the graph we&#x2019;ve put together. The image above shows an example of the flow of a field from the second layer to the third layer of the data warehouse.</p><!--kg-card-begin: markdown--><h2 id="impact-analysis">Impact analysis</h2>
<!--kg-card-end: markdown--><p>It frequently happens that a data model in the source applications is being changed, that a certain data source is no longer included or that the structure of a source table is modified. In these cases, it&#x2019;s convenient for analysts and developers to know what the impact is on the current data warehouse structure.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://thecloudnativedataengineer.com/content/images/2022/04/graphdb_6.png" class="kg-image" alt="Innovation in data governance: ETL visualization using graph databases" loading="lazy" width="640" height="294" srcset="https://thecloudnativedataengineer.com/content/images/size/w600/2022/04/graphdb_6.png 600w, https://thecloudnativedataengineer.com/content/images/2022/04/graphdb_6.png 640w"><figcaption><em>Example of an impact analysis</em></figcaption></figure><p>Using the generated graphs, it&#x2019;s possible to get a clear view of this impact. The above image shows the flow for table 73101. It begins on the first layer of the data warehouse and flows all the way to the third dimensional layer. The graph clearly shows which mappings and tables are used to enable this specific flow.</p><p>The graph allows us to clearly and quickly visualize what the impact may be if we were to stop delivering table 73101 to the source and it&#x2019;s immediately clear which mappings need to be changed and which tables will no longer be loaded completely.</p>]]></content:encoded></item><item><title><![CDATA[Capture MongoDB Change Events with Debezium and Kafka Connect]]></title><description><![CDATA[Still capturing MongoDB change events with the oplog? Try this alternative approach with the help of Kafka Connect and Debezium. ]]></description><link>https://thecloudnativedataengineer.com/capture-mongodb-change-event-with-debezium-and-kafka-connect/</link><guid isPermaLink="false">62362b5d3bbe3200014f9b45</guid><category><![CDATA[Apache Kafka]]></category><category><![CDATA[Debezium]]></category><category><![CDATA[Kafka Connect]]></category><category><![CDATA[MongoDB]]></category><category><![CDATA[Streamsets Datacollector]]></category><category><![CDATA[Change Data Capture]]></category><dc:creator><![CDATA[Kevin Stobbelaar]]></dc:creator><pubDate>Mon, 21 Mar 2022 18:02:32 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1506355683710-bd071c0a5828?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwxMTc3M3wwfDF8c2VhcmNofDI5fHxzdHJlYW18ZW58MHx8fHwxNjQ3NzE3MjM4&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=2000" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: markdown--><h2 id="streaming-mongodb-oplog-records-to-kafka">Streaming MongoDB Oplog Records to Kafka</h2>
<!--kg-card-end: markdown--><img src="https://images.unsplash.com/photo-1506355683710-bd071c0a5828?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwxMTc3M3wwfDF8c2VhcmNofDI5fHxzdHJlYW18ZW58MHx8fHwxNjQ3NzE3MjM4&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=2000" alt="Capture MongoDB Change Events with Debezium and Kafka Connect"><p>A few years ago I had to come up with a system to stream operational data to a Data Warehouse for (near) real time analytics. The operational data was living in a MongoDB. Part of the resulting architecture looked like this:</p><figure class="kg-card kg-image-card"><img src="https://thecloudnativedataengineer.com/content/images/2022/03/initial_arch-1.jpg" class="kg-image" alt="Capture MongoDB Change Events with Debezium and Kafka Connect" loading="lazy" width="1508" height="482" srcset="https://thecloudnativedataengineer.com/content/images/size/w600/2022/03/initial_arch-1.jpg 600w, https://thecloudnativedataengineer.com/content/images/size/w1000/2022/03/initial_arch-1.jpg 1000w, https://thecloudnativedataengineer.com/content/images/2022/03/initial_arch-1.jpg 1508w" sizes="(min-width: 720px) 720px"></figure><p>To have real time data we decided to apply Change Data Capture on top of the MongoDB. We stumbled upon <a href="https://streamsets.com/products/dataops-platform/data-collector-engine/">Streamsets Datacollector </a>to help us with this. It comes with an out of the box integration to capture MongoDB change events by processing the MongoDB oplog. The MongoDB oplog is a dedicated collection in the local database that keeps track of every operation taking place on a collection in the MongoDB instance. With the help of Streamsets it was child&apos;s play to publish the oplog records into a Kafka topic. </p><p>This all works fine, but working with oplog records has one main drawback. Let&apos;s take a look at an update operation and how this operation is reflected in an oplog record.</p><!--kg-card-begin: markdown--><pre><code class="language-bash">rs0:PRIMARY&gt; db.product.updateOne( { id: &quot;1234&quot; },  { $set: { &quot;size&quot;: &quot;L&quot;, status: &quot;P&quot; }})
{ &quot;acknowledged&quot; : true, &quot;matchedCount&quot; : 1, &quot;modifiedCount&quot; : 1 }

rs0:PRIMARY&gt; db.product.find()
{ &quot;_id&quot; : ObjectId(&quot;5e05d3573097942bf4e37a60&quot;)
 , &quot;id&quot; : &quot;1234&quot;, &quot;label&quot; : &quot;dummy&quot;, &quot;size&quot; : &quot;L&quot;, &quot;status&quot; : &quot;P&quot; }
 
 -- In the oplog
{ &quot;ts&quot; : Timestamp(1577440106, 1)
, &quot;t&quot; : NumberLong(19)
, &quot;h&quot; : NumberLong(&quot;4817342222804408742&quot;)
, &quot;v&quot; : 2
, &quot;op&quot; : &quot;u&quot;
, &quot;ns&quot; : &quot;company.product&quot;
, &quot;ui&quot; : UUID(&quot;651f7a10-6eea-4b3b-9afe-178a5b7c297e&quot;)
, &quot;o2&quot; : { &quot;_id&quot; : ObjectId(&quot;5e05d3573097942bf4e37a60&quot;) }
, &quot;wall&quot; : ISODate(&quot;2022-03-20T09:48:26.279Z&quot;)
, &quot;o&quot; : { &quot;$v&quot; : 1
        , &quot;$set&quot; : { &quot;size&quot; : &quot;L&quot;, &quot;status&quot; : &quot;P&quot; } } }
</code></pre>
<!--kg-card-end: markdown--><p>The actual <em>change </em>is reflected in the &quot;o&quot; field of the oplog record. There&apos;s no indication of the full record before or after this operation. If we want to know the actual status of the full record we will need to merge this individual operation with the history of changes we captured for the same record. Certainly not unmanageable but supporting all the different ways of updating a record in MongoDB takes some time. And I&apos;m still not talking about handling updates to nested documents...</p><!--kg-card-begin: markdown--><h2 id="time-to-call-batman-and-robin-or-in-this-case-debezium-and-kafka-connect">Time to call Batman and Robin, or in this case Debezium and Kafka Connect</h2>
<!--kg-card-end: markdown--><p>For another project I did some research into tools that are capable of performing change data capture on top of Oracle databases. When looking for a tool that didn&apos;t use Oracle&apos;s logminer process, I quickly got familiar with the <a href="https://debezium.io/">Debezium</a> project. The description on the top of the site is quite promising:</p><!--kg-card-begin: markdown--><blockquote>
<p>Debezium is an open source distributed platform for change data capture. Start it up, point it at your databases, and your apps can start responding to all of the inserts, updates, and deletes that other apps commit to your databases. Debezium is durable and fast, so your apps can respond quickly and never miss an event, even when things go wrong.</p>
</blockquote>
<!--kg-card-end: markdown--><p>And even more favorable: they have a MongoDB connector! When taking a deep-dive into the documentation of the connector it seems that Debezium is not using the MongoDB oplog but another feature called the <a href="https://docs.mongodb.com/manual/changeStreams/">Change Streams</a>. MongoDB is advertising to use this feature in stead of the oplog. By default these change streams only capture the actual change or the resulting delta of the operation. However you can configure it to return the full document reflected after the operation. </p><p>In the remainder of this post I will guide you through the setup of Debezium together with Kafka Connect to capture change events on top of MongoDB. </p><!--kg-card-begin: markdown--><h2 id="hands-on-implementation-of-debezium-and-kafka-connect">Hands-on implementation of Debezium and Kafka Connect</h2>
<!--kg-card-end: markdown--><p>Before we really get started, there are some prerequisites:</p><!--kg-card-begin: markdown--><ul>
<li>a working and accessible Kubernetes cluster</li>
<li>the Strimzi Kafka Operator installed on the K8S cluster</li>
<li>a MongoDB instance</li>
</ul>
<!--kg-card-end: markdown--><p>I&apos;m using the Strimzi Kafka Operator not only to easily deploy a Kafka Cluster, it also let&apos;s me configure Kafka Connect clusters and Kafka Connect connectors as CRD instances. And suprise, surprise, Kafka Connect happens to be one of the ways to setup Debezium. </p><p>We will need three things to setup a Debezium integration with the help of Kafka Connect: a Docker image for the Kafka Connect workers (1), a Kafka Connect Cluster specification (2), and a Kafka Connect Connector specification (3). The Kafka Connect Connector will hold the configuration of the Debezium integration with MongoDB. </p><!--kg-card-begin: markdown--><h4 id="1-a-dockerfile-for-the-kafka-connect-worker-pods">1. A Dockerfile for the Kafka Connect worker Pods</h4>
<!--kg-card-end: markdown--><p>The worker pods of the Kafka Connect cluster need access to the appropriate libraries to use Debezium and Debezium&apos;s MongoDB connector. For this, we can extend a Dockerfile, provided by Strimzi, with the source code of the plugin. Download the plugin from <a href="https://docs.confluent.io/5.5.1/connect/debezium-connect-mongodb/index.html#install-the-connector-manually">the Confluent pages</a> and add it to a new directory &quot;plugins&quot;. The resulting Dockerfile looks like this:</p><!--kg-card-begin: markdown--><pre><code class="language-bash">FROM strimzi/kafka:latest-kafka-2.6.0

COPY ./plugins/ /opt/kafka/plugins/
USER 1001

</code></pre>
<!--kg-card-end: markdown--><p>Build and push the resulting Docker image to the registry of your preference. I&apos;m using my personal registry on Dockerhub. </p><!--kg-card-begin: markdown--><h4 id="2-a-kafka-connect-cluster-specification">2. A Kafka Connect Cluster Specification</h4>
<!--kg-card-end: markdown--><p>Secondly we need to setup our Kafka Connect cluster. For this we can use the KafkaConnect CRD, imported via the Strimzi Kafka Operator. Most of the configuration is kept pretty default, make sure to replace the image reference to the Docker image you pushed in the previous step. </p><!--kg-card-begin: markdown--><pre><code class="language-yaml">apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaConnect
metadata:
  name: kstobbel-connect-cluster
  annotations:
    strimzi.io/use-connector-resources: &quot;true&quot;
spec:
  replicas: 3
  image: kstobbel/kafka-connect-mongodb:1.0.1
  bootstrapServers: kafka-kafka-bootstrap:9092
  config:
    group.id: kstobbel-connect-cluster
    offset.storage.topic: kstobbel-connect-cluster-offsets
    config.storage.topic: kstobbel-connect-cluster-configs
    status.storage.topic: kstobbel-connect-cluster-status
    key.converter: org.apache.kafka.connect.json.JsonConverter
    value.converter: org.apache.kafka.connect.json.JsonConverter
    key.converter.schemas.enable: true
    value.converter.schemas.enable: true
    config.storage.replication.factor: 3
    offset.storage.replication.factor: 3
    status.storage.replication.factor: 3
  logging:
    type: inline
    loggers:
      log4j.rootLogger: &quot;INFO&quot;

</code></pre>
<!--kg-card-end: markdown--><p>After applying this yaml specification to the cluster, three new pods should be starting in the current namespace. </p><figure class="kg-card kg-image-card"><img src="https://thecloudnativedataengineer.com/content/images/2022/03/Screenshot-2022-03-21-at-17.25.56.png" class="kg-image" alt="Capture MongoDB Change Events with Debezium and Kafka Connect" loading="lazy" width="938" height="126" srcset="https://thecloudnativedataengineer.com/content/images/size/w600/2022/03/Screenshot-2022-03-21-at-17.25.56.png 600w, https://thecloudnativedataengineer.com/content/images/2022/03/Screenshot-2022-03-21-at-17.25.56.png 938w" sizes="(min-width: 720px) 720px"></figure><!--kg-card-begin: markdown--><h4 id="3-a-kafka-connect-connector-specification">3. A Kafka Connect Connector Specification</h4>
<!--kg-card-end: markdown--><p>Last but not least we need to configure our Debezium MongoDB connector. Again using the Strimzi Kafka operator, all we need to do is configure the Kafka Connector CRD:</p><!--kg-card-begin: markdown--><pre><code class="language-yaml">apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaConnector
metadata:
  name: &quot;cdc-connector&quot;
  namespace: default
  labels:
    strimzi.io/cluster: kstobbel-connect-cluster
spec:
  class: io.debezium.connector.mongodb.MongoDbConnector
  tasksMax: 1
  config:
    mongodb.hosts: mongodb-headless:27017
    mongodb.name: products
    snapshot.mode: never
    collection.include.list: company.product
</code></pre>
<!--kg-card-end: markdown--><p>Credentials are left out on purpose. </p><!--kg-card-begin: markdown--><h4 id="bringing-it-all-together">Bringing it all together</h4>
<!--kg-card-end: markdown--><p>We are ready to deploy our Debezium Connector, apply the manifest! Great. Once deployed, one of the Kafka Connect workers is instructed to setup the MongoDB integration. If we take a deeper look into the logs of this worker we see a lot of errors in the console:</p><!--kg-card-begin: markdown--><pre><code class="language-bash">kstobbel-connect-cluster-connect 2022-03-21 16:38:09,030 
WARN [Producer clientId=connector-producer-cdc-connector-0] 
Error while fetching metadata with correlation id 509 : 
{products.company.product=UNKNOWN_TOPIC_OR_PARTITION} 
(org.apache.kafka.clients.NetworkClient) 
[kafka-producer-network-thread connector-producer-cdc-connector-0]
</code></pre>
<!--kg-card-end: markdown--><p>The worker pod seems to have issues with the Kafka topic. After manual creation of the Kafka topic (in our case &quot;products.company.product&quot;) we should be fine. Shall we insert some dummy data in our MongoDB collection? </p><!--kg-card-begin: markdown--><pre><code class="language-bash">db.product.insertOne({&quot;input&quot;: &quot;abc12345&quot;})
</code></pre>
<!--kg-card-end: markdown--><p>The payload part of the corresponding Kafka message looks like this (I&apos;m leaving out the schema definition of the message):</p><!--kg-card-begin: markdown--><pre><code class="language-json">{
    &quot;after&quot;: &quot;{\&quot;_id\&quot;: {\&quot;$oid\&quot;: \&quot;6238a9bd5a957ffd1073cac6\&quot;},\&quot;input\&quot;: \&quot;abc123\&quot;}&quot;,
    &quot;patch&quot;: null,
    &quot;filter&quot;: null,
    &quot;updateDescription&quot;: null,
    &quot;source&quot;: {
        &quot;version&quot;: &quot;1.8.1.Final&quot;,
        &quot;connector&quot;: &quot;mongodb&quot;,
        &quot;name&quot;: &quot;products&quot;,
        &quot;ts_ms&quot;: 1647880637000,
        &quot;snapshot&quot;: &quot;false&quot;,
        &quot;db&quot;: &quot;company&quot;,
        &quot;sequence&quot;: null,
        &quot;rs&quot;: &quot;rs0&quot;,
        &quot;collection&quot;: &quot;product&quot;,
        &quot;ord&quot;: 1,
        &quot;h&quot;: null,
        &quot;tord&quot;: null,
        &quot;stxnid&quot;: null,
        &quot;lsid&quot;: null,
        &quot;txnNumber&quot;: null
    },
    &quot;op&quot;: &quot;c&quot;,
    &quot;ts_ms&quot;: 1647880637552,
    &quot;transaction&quot;: null
}
</code></pre>
<!--kg-card-end: markdown--><p>So far so good, the insert operation is captured by the Kafka Connector and looks familiar. Let&apos;s try an update operation.</p><!--kg-card-begin: markdown--><pre><code class="language-bash">db.product.updateMany({&quot;input&quot;: &quot;abc12345&quot;}, {$set: {&quot;input&quot;: &quot;def456&quot;, &quot;output&quot;: &quot;changed&quot;}})
</code></pre>
<!--kg-card-end: markdown--><p>The resulting Kafka message:</p><!--kg-card-begin: markdown--><pre><code class="language-json">{
    &quot;after&quot;: &quot;{\&quot;_id\&quot;: {\&quot;$oid\&quot;: \&quot;6238a9fe5a957ffd1073cac7\&quot;},\&quot;input\&quot;: \&quot;def456\&quot;,\&quot;output\&quot;: \&quot;changed\&quot;}&quot;,
    &quot;patch&quot;: null,
    &quot;filter&quot;: null,
    &quot;updateDescription&quot;: {
        &quot;removedFields&quot;: null,
        &quot;updatedFields&quot;: &quot;{\&quot;input\&quot;: \&quot;def456\&quot;, \&quot;output\&quot;: \&quot;changed\&quot;}&quot;,
        &quot;truncatedArrays&quot;: null
    },
    &quot;source&quot;: {
        &quot;version&quot;: &quot;1.8.1.Final&quot;,
        &quot;connector&quot;: &quot;mongodb&quot;,
        &quot;name&quot;: &quot;products&quot;,
        &quot;ts_ms&quot;: 1647881865000,
        &quot;snapshot&quot;: &quot;false&quot;,
        &quot;db&quot;: &quot;company&quot;,
        &quot;sequence&quot;: null,
        &quot;rs&quot;: &quot;rs0&quot;,
        &quot;collection&quot;: &quot;product&quot;,
        &quot;ord&quot;: 1,
        &quot;h&quot;: null,
        &quot;tord&quot;: null,
        &quot;stxnid&quot;: null,
        &quot;lsid&quot;: null,
        &quot;txnNumber&quot;: null
    },
    &quot;op&quot;: &quot;u&quot;,
    &quot;ts_ms&quot;: 1647881865582,
    &quot;transaction&quot;: null
}
</code></pre>
<!--kg-card-end: markdown--><p>Cool! Not only does the message contain the update details, it also includes the state of the record after applying the operation. No more need to do manual merges to construct the full record body. True, we still need to parse the &quot;after&quot; field of the payload but I assume you can figure this out. </p><!--kg-card-begin: markdown--><h2 id="final-remarks">Final Remarks</h2>
<!--kg-card-end: markdown--><p>To conclude this post I want to highlight a few things:</p><!--kg-card-begin: markdown--><ul>
<li>Using the Strimzi operator brings a declarative approach to defining Kafka Connect integrations (in stead of performing some API calls)</li>
<li>Having the &quot;after&quot; state of the record in the change event simplifies your data pipelines</li>
<li>Replacing a Data Pipeline Orchestrator (like Streamsets Datacollector or Apache NiFi) by something like Kafka Connect might be limiting at first sight in terms of observability (what is happing with the connector)</li>
<li>Might be something for one of the next posts!</li>
</ul>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Scaling Kafka consumers with KEDA]]></title><description><![CDATA[Start scaling your applications based on functional triggers, rather than technical or/and infrastructural triggers. Learn how KEDA, the event driven scaling platform for K8S, can improve your scaling maturity. ]]></description><link>https://thecloudnativedataengineer.com/scaling-kafka-consumers-with-keda/</link><guid isPermaLink="false">6141c7b63bbe3200014f9999</guid><category><![CDATA[Apache Kafka]]></category><category><![CDATA[Scalability]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[KEDA]]></category><dc:creator><![CDATA[Kevin Stobbelaar]]></dc:creator><pubDate>Sat, 19 Mar 2022 18:41:33 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1506159679421-5a01dbc2f2e0?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwxMTc3M3wwfDF8c2VhcmNofDEzMXx8dHJhZmZpYyUyMGphbXxlbnwwfHx8fDE2MzE3MDA5MDM&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=2000" medium="image"/><content:encoded><![CDATA[<img src="https://images.unsplash.com/photo-1506159679421-5a01dbc2f2e0?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwxMTc3M3wwfDF8c2VhcmNofDEzMXx8dHJhZmZpYyUyMGphbXxlbnwwfHx8fDE2MzE3MDA5MDM&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=2000" alt="Scaling Kafka consumers with KEDA"><p>Guilty. I&apos;m indeed member of that part of society that might build up frustration because of traffic jams. Often do I wonder about ways of decreasing traffic jams in our little country, Belgium. One seemingly simple solution often comes to mind: <em>just </em>add more traffic lanes and all problems are gone, aren&apos;t they?</p><p>Why do I share this with you, you ask? During the last few years working as a data engineer with technologies as Apache Kafka and other streaming platforms, the following traffic jam metaphor crossed my path a number of times. Think of the cars being Kafka messages, the lanes Kafka consumers, and the road a Kafka topic. Adding new lanes<em> to process </em>more cars suddenly becomes more and more convenient. </p><p>In this post I will highlight the upsides of scaling Kafka consumers based on functional/application triggers, and showcase an implementation of functional triggers with <a href="https://keda.sh/">KEDA, Kubernetes Event-driven Autoscaling. </a></p><!--kg-card-begin: markdown--><h1 id="scaling-kafka-consumers-on-kubernetes">Scaling Kafka consumers on Kubernetes</h1>
<!--kg-card-end: markdown--><p>Let&apos;s start with implementing a plain dead simple Kafka consumer application in Python simulating a heavy process on top of incoming Kafka messages, or something <em>similar</em>: sleep for 60 seconds. </p><!--kg-card-begin: markdown--><pre><code class="language-python">from kafka import KafkaConsumer
from datetime import datetime
import time

def print_now():
  now = datetime.now();
  current_time = now.strftime(&quot;%H:%M:%S&quot;)
  print(&quot;Time = &quot;, current_time)

consumer = KafkaConsumer(&apos;lazy-input-topic&apos;,group_id=&apos;lazy-consumer-group&apos;, bootstrap_servers=&apos;xyz:9092&apos;)

for msg in consumer:
  print_now()
  print (msg)
  time.sleep(60)
</code></pre>
<!--kg-card-end: markdown--><p>Before deploying this as a Docker container on a Kubernetes cluster, we can run it locally and send some messages to the Kafka cluster. First create a topic with &gt;1 partitions (important for the remainder of this post). Start the python consumer and watch it starting to consume messages. </p><!--kg-card-begin: markdown--><pre><code class="language-bash">./kafka-topics --create --topic lazy-input --partitions 10  --bootstrap-server xyz:9092
</code></pre>
<!--kg-card-end: markdown--><figure class="kg-card kg-image-card"><img src="https://thecloudnativedataengineer.com/content/images/2021/09/image.png" class="kg-image" alt="Scaling Kafka consumers with KEDA" loading="lazy" width="1034" height="670" srcset="https://thecloudnativedataengineer.com/content/images/size/w600/2021/09/image.png 600w, https://thecloudnativedataengineer.com/content/images/size/w1000/2021/09/image.png 1000w, https://thecloudnativedataengineer.com/content/images/2021/09/image.png 1034w" sizes="(min-width: 720px) 720px"></figure><p>Great. All set to start scaling. </p><!--kg-card-begin: markdown--><h1 id="getting-started-with-keda-event-driven-scaling-for-kubernetes">Getting started with KEDA, event-driven scaling for Kubernetes</h1>
<!--kg-card-end: markdown--><blockquote><strong>KEDA</strong> is a <a href="https://kubernetes.io">Kubernetes</a>-based Event Driven Autoscaler. With KEDA, you can drive the scaling of any container in Kubernetes based on the number of events needing to be processed.</blockquote><p>Like most of the cloud native projects you can easily deploy <a href="https://keda.sh/">KEDA</a> on your own Kubernetes cluster with the help of their <a href="https://helm.sh/">Helm</a> chart. </p><p>Following commands will get you going:</p><!--kg-card-begin: markdown--><pre><code class="language-bash">helm repo add kedacore https://kedacore.github.io/charts
helm repo update

kubectl create namespace keda
helm install keda kedacore/keda --namespace keda
</code></pre>
<!--kg-card-end: markdown--><p>KEDA is implemented following <a href="https://kubernetes.io/docs/concepts/extend-kubernetes/operator/">the Kubernetes Operator pattern</a>. This means that upon installing KEDA your Kubernetes cluster gets extended with four Custom Resource Definitions (CRD): ScaledObjects, ScaledJobs, TriggerAuthentications, and ClusterTriggerAuthentications. </p><p>One more important thing to mention before we dive into it are the KEDA <a href="https://keda.sh/docs/2.4/scalers/">scalers</a>. These are integrations based on which KEDA can decide to scale Kubernetes components. Obviously for this post we&apos;re interested in the Kafka integration, but the Azure Blob Storage scaler, for example, also seems worthy of some exploration time in the future!</p><p>Let&apos;s get going. We need to configure a ScaledObject. And apply it to our Kubernetes cluster.</p><!--kg-card-begin: markdown--><pre><code class="language-yaml">apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: kafka-scaledobject
  namespace: default
spec:
  scaleTargetRef:
    name: lazy-consumer
  pollingInterval: 30
  triggers:
  - type: kafka
    metadata:
      bootstrapServers: kafka-kafka-bootstrap.default:9092
      consumerGroup: lazy-consumer-group
      topic: lazy-input
      # Optional
      lagThreshold: &quot;50&quot;
      offsetResetPolicy: latest

</code></pre>
<!--kg-card-end: markdown--><!--kg-card-begin: markdown--><pre><code class="language-bash">kubectl apply -f scaled-object.yaml
</code></pre>
<!--kg-card-end: markdown--><p>One of the first things I noticed is that the existing deployment in which my lazy-consumer pod is running is scaled down to zero replicas by KEDA. This is confirmed by consulting the logs of the KEDA operator pod:</p><!--kg-card-begin: markdown--><pre><code class="language-bash">keda-operator 1.6476931939554265e+09    INFO    scaleexecutor    Successfully set ScaleTarget replicas count to ScaledObject minR &#x2502;
&#x2502; eplicaCount    {&quot;scaledobject.Name&quot;: &quot;kafka-scaledobject&quot;, &quot;scaledObject.Namespace&quot;: &quot;default&quot;, &quot;scaleTarget.Name&quot;: &quot;lazy-c &#x2502;
&#x2502; onsumer&quot;, &quot;Original Replicas Count&quot;: 1, &quot;New Replicas Count&quot;: 0}
</code></pre>
<!--kg-card-end: markdown--><p>Now we can start publishing data onto our topic. I&apos;m using Streamsets Datacollector for this, seeing I have an instance running on the same Kubernetes cluster. If you&apos;re following along just use whatever is the most efficient to get the job done. </p><p>And boom! Our consumer group is being scaled:</p><figure class="kg-card kg-image-card"><img src="https://thecloudnativedataengineer.com/content/images/2022/03/Screenshot-2022-03-19-at-13.42.46.png" class="kg-image" alt="Scaling Kafka consumers with KEDA" loading="lazy" width="1192" height="1060" srcset="https://thecloudnativedataengineer.com/content/images/size/w600/2022/03/Screenshot-2022-03-19-at-13.42.46.png 600w, https://thecloudnativedataengineer.com/content/images/size/w1000/2022/03/Screenshot-2022-03-19-at-13.42.46.png 1000w, https://thecloudnativedataengineer.com/content/images/2022/03/Screenshot-2022-03-19-at-13.42.46.png 1192w" sizes="(min-width: 720px) 720px"></figure><p>It takes some time for all consumers to actually start consuming messages (due to consumer group rebalances taking place) but they all start processing eventually. Once the consumer lag is gone, the deployment is scaled down. </p><!--kg-card-begin: markdown--><h1 id="but-wait-a-minute-my-deployment-is-not-scaling-down">But wait a minute, my deployment is not scaling down...</h1>
<!--kg-card-end: markdown--><p>After some time, I started noticing weird things. It seems that the consumer pods are getting stuck in a neverending loop... </p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://thecloudnativedataengineer.com/content/images/2022/03/Screenshot-2022-03-19-at-14.11.09.png" class="kg-image" alt="Scaling Kafka consumers with KEDA" loading="lazy" width="1046" height="458" srcset="https://thecloudnativedataengineer.com/content/images/size/w600/2022/03/Screenshot-2022-03-19-at-14.11.09.png 600w, https://thecloudnativedataengineer.com/content/images/size/w1000/2022/03/Screenshot-2022-03-19-at-14.11.09.png 1000w, https://thecloudnativedataengineer.com/content/images/2022/03/Screenshot-2022-03-19-at-14.11.09.png 1046w" sizes="(min-width: 720px) 720px"><figcaption>Message 0 gets processed after message 30...</figcaption></figure><p>A few Github and Stackoverflow pages later, I learned that this is caused by the frequent rebalancing events happening on the consumer group. These rebalancing events are triggered due to KEDA continuously adding and removing new consumers to the group. These rebalancing events are sending the consumers back to the lastest commited offset which, in my case, is the initial offset. </p><p>Time to make a small change to our python script. I configured the KafkaConsumer to disable auto commiting offsets and added a manual commit after each message is processed. The final script:</p><pre><code class="language-python">from kafka import KafkaConsumer
from datetime import datetime
import time

def print_now():
  now = datetime.now();
  current_time = now.strftime(&quot;%H:%M:%S&quot;)
  print(&quot;Time = &quot;, current_time)

consumer = KafkaConsumer(&apos;lazy-input-topic&apos;,group_id=&apos;lazy-consumer-group&apos;, bootstrap_servers=&apos;xyz:9092&apos;, enable_auto_commit=&apos;False&apos;)

for msg in consumer:
  print_now()
  print (msg)
  consumer.commit()
  time.sleep(60)
</code></pre><p>Deploying these changes to the cluster and executing the experiment showed the results we expected before: the increased lag on the consumer group forces KEDA to scale up the deployment and once a consumer has done its job KEDA scales down the deployment, eventually to zero replicas. Nice. </p>]]></content:encoded></item><item><title><![CDATA[Deliver continuous data discoverability on top of Kafka with DataHub and Airflow]]></title><description><![CDATA[Metadata everywhere. Automatically generate discoverability for your Kafka topics with DataHub and Apache Airflow. ]]></description><link>https://thecloudnativedataengineer.com/deliver-continuous-data-observability-on-top-of-kafka-with-datahub-and-airflow/</link><guid isPermaLink="false">6133764a3bbe3200014f9719</guid><category><![CDATA[Apache Kafka]]></category><category><![CDATA[Apache Airflow]]></category><category><![CDATA[DataHub]]></category><category><![CDATA[Data Discoverability]]></category><category><![CDATA[Metadata]]></category><dc:creator><![CDATA[Kevin Stobbelaar]]></dc:creator><pubDate>Thu, 09 Sep 2021 14:14:02 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1521587760476-6c12a4b040da?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwxMTc3M3wwfDF8c2VhcmNofDN8fGxpYnJhcnl8ZW58MHx8fHwxNjMwNzYyNTU1&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=2000" medium="image"/><content:encoded><![CDATA[<img src="https://images.unsplash.com/photo-1521587760476-6c12a4b040da?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwxMTc3M3wwfDF8c2VhcmNofDN8fGxpYnJhcnl8ZW58MHx8fHwxNjMwNzYyNTU1&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=2000" alt="Deliver continuous data discoverability on top of Kafka with DataHub and Airflow"><p>Knowing where your data lives, what it means and what it looks like is becoming more and more important. As data moves from system to system, or from <em><a href="https://martinfowler.com/articles/data-mesh-principles.html">data product to data product</a></em> a clear and intuitive overview of your datasets across your application landscape lets others explore and wonder.</p><p>If you&apos;re like me, a <em>tiny bit </em>all over the place when excited, data discoverability might even help to bring some order in your home experiments with data. Lately I have been playing around with some data sets and when moving the data I&apos;m most used to Apache Kafka to help me out. Once I get going the amount of Kafka topics quickly explodes. This had lead me into the lookout for a tool on top of Kafka to not only get an overview of the available Kafka topics - there&apos;s lots of tools out there for that, I for example frequently use <a href="https://github.com/obsidiandynamics/kafdrop">Kafdrop</a> - but also the corresponding metadata, e.g. the underlying schema, the definition(s), etc.</p><p>One of the tools that came up was <a href="https://datahubproject.io/">DataHub</a> from the engineering team at LinkedIn. &quot;A Metadata Platform for the Modern Data Stack&quot; is their slogan, and from my point of view this highlights the two critical pieces of the tool: first of all it consists of a user friendly UI in which the (meta)data can be explored and discovered, secondly it comes with a lot of integrations to modern data repositories and data wrangling tools out of the box. <em>Luckily for this blog post</em>, they do have an integration to Apache Kafka as well.</p><h2 id="deploy-your-own-datahub-instance">Deploy your own DataHub instance</h2><p>Enough talking, on to the doing. The prerequisites: a running Kubernetes cluster and Helm ready to deliver. That&apos;s it. DataHub provides two Helm charts to get you up and running: one for the underlying components (Elastic Search, neo4j, Mysql, and the Confluent platform) and one for the DataHub components itself. Following commands should get you started:</p><!--kg-card-begin: markdown--><pre><code class="language-bash">helm repo add datahub https://helm.datahubproject.io/ 
helm install prerequisites datahub/datahub-prerequisites 
helm install datahub datahub/datahub
</code></pre>
<!--kg-card-end: markdown--><p>If all goes well you can use Kubernetes port-forwarding to access your DataHub instance (initial credentials: &quot;datahub:datahub&quot;):</p><!--kg-card-begin: markdown--><pre><code class="language-bash">kubectl port-forward &lt;datahub-frontend pod name&gt; 9002:9002
</code></pre>
<!--kg-card-end: markdown--><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://thecloudnativedataengineer.com/content/images/2021/09/Screenshot-2021-09-08-at-22.39.46.png" class="kg-image" alt="Deliver continuous data discoverability on top of Kafka with DataHub and Airflow" loading="lazy" width="2000" height="1061" srcset="https://thecloudnativedataengineer.com/content/images/size/w600/2021/09/Screenshot-2021-09-08-at-22.39.46.png 600w, https://thecloudnativedataengineer.com/content/images/size/w1000/2021/09/Screenshot-2021-09-08-at-22.39.46.png 1000w, https://thecloudnativedataengineer.com/content/images/size/w1600/2021/09/Screenshot-2021-09-08-at-22.39.46.png 1600w, https://thecloudnativedataengineer.com/content/images/size/w2400/2021/09/Screenshot-2021-09-08-at-22.39.46.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>DataHub welcome page</figcaption></figure><h2 id="integrate-existing-kafka-metadata">Integrate existing Kafka metadata</h2><p>Once we&apos;ve got DataHub up and running, next step on our data discoverability journey is to setup an integration between our Kafka cluster and DataHub. DataHub delivers an excellent job documenting their existing integrations, the Kafka Metadata ingestion is described on <a href="https://datahubproject.io/docs/metadata-ingestion/source_docs/kafka">this page</a>.</p><p>In order to make the Kafka metadata ingestion repeatable, I have combined the necessary statements in a Dockerfile.</p><!--kg-card-begin: markdown--><pre><code class="language-bash">FROM python:3.9 

RUN python3 -m pip install --upgrade pip wheel setuptools \
&amp;&amp; python3 -m pip install --upgrade acryl-datahub \
&amp;&amp; pip install &apos;acryl-datahub[kafka,datahub-kafka]&apos;
</code></pre>
<!--kg-card-end: markdown--><p>Next we need to create a recipe yaml file to configure the type of metadata ingestion we want to execute. Based on the DataHub documentation, it is quite easy to end up with this sample to load Kafka topics into DataHub:&#x200C;</p><!--kg-card-begin: markdown--><pre><code class="language-yaml">source:
  type: &quot;kafka&quot; 
  config: 
    connection: 
      bootstrap: &quot;broker:9092&quot; 
      schema_registry_url: &quot;https://xyz:8081&quot; 
sink: 
  type: &quot;datahub-kafka&quot; 
  config: 
    connection: 
      bootstrap: &quot;broker:9092&quot; 
      schema_registry_url: &quot;https://xyz:8081&quot;
</code></pre>
<!--kg-card-end: markdown--><p>After building this Dockerfile, we can run the container locally and provide the recipe file holding the configuration properties for the ingestion of the Kafka topics to DataHub.</p><!--kg-card-begin: markdown--><pre><code class="language-bash">docker run --dit --name kafka-datahub-loader -v /&lt;$PWD&gt;/kafka-to-datahub-recipe.yml:/recipe.yml ./datahub ingest -c /recipe.yml
</code></pre>
<!--kg-card-end: markdown--><p>And tada: the existing Kafka topics are available in DataHub!</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://thecloudnativedataengineer.com/content/images/2021/09/Screenshot-2021-09-09-at-16.11.41.png" class="kg-image" alt="Deliver continuous data discoverability on top of Kafka with DataHub and Airflow" loading="lazy" width="1917" height="909" srcset="https://thecloudnativedataengineer.com/content/images/size/w600/2021/09/Screenshot-2021-09-09-at-16.11.41.png 600w, https://thecloudnativedataengineer.com/content/images/size/w1000/2021/09/Screenshot-2021-09-09-at-16.11.41.png 1000w, https://thecloudnativedataengineer.com/content/images/size/w1600/2021/09/Screenshot-2021-09-09-at-16.11.41.png 1600w, https://thecloudnativedataengineer.com/content/images/2021/09/Screenshot-2021-09-09-at-16.11.41.png 1917w" sizes="(min-width: 720px) 720px"><figcaption>Example Kafka Topic Metadata overview in DataHub</figcaption></figure><h1 id="repeat-with-airflow">Repeat with Airflow</h1><p>Once your project gets going and the amount of Kafka topics starts to grow, you may want to automatically ingest new Kafka metadata periodically. For this, one might consider <a href="https://airflow.apache.org/">Apache Airflow</a>, a platform to manage workflows. </p><p>I deployed Airflow on my K8S cluster using their own Helm chart, which can be found <a href="https://airflow.apache.org/docs/helm-chart/stable/">here</a>. I did tweak some parts of the configuration in the corresponding values file but leaving that out for this post. Or maybe, as you will see in the sample DAG implementation, the Airflow pods need access to <a href="https://pypi.org/project/acryl-datahub/">the DataHub python package</a>, the values file includes an option to define pip packages that are installed during deployment. Use that to install extra packages.</p><!--kg-card-begin: markdown--><pre><code class="language-yaml">  extraPipPackages: 
    - &quot;acryl-datahub==0.8.11.1&quot;
    - &quot;acryl-datahub[kafka]&quot;
    - &quot;acryl-datahub[datahub-kafka]&quot;
</code></pre>
<!--kg-card-end: markdown--><p>Once Airflow is up and running, all we need to do is implement <a href="https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html">a DAG</a> that will execute the necessary steps to ingest the Kafka metadata into DataHub. Starting from <a href="https://github.com/linkedin/datahub/blob/master/metadata-ingestion/src/datahub_provider/example_dags/mysql_sample_dag.py">an example</a> exposed by DataHub, I came up with following DAG:</p><!--kg-card-begin: markdown--><pre><code class="language-python">from datetime import timedelta
from airflow import DAG

try:
    from airflow.operators.python import PythonOperator
except ModuleNotFoundError:
    from airflow.operators.python_operator import PythonOperator

from airflow.utils.dates import days_ago

from datahub.configuration.config_loader import load_config_file
from datahub.ingestion.run.pipeline import Pipeline

from datetime import datetime

def datahub_recipe():
    pipeline = Pipeline.create(
        # This configuration is analogous to a recipe configuration.
        {
            &quot;source&quot;: {
                &quot;type&quot;: &quot;kafka&quot;,
                &quot;config&quot;: {
                  &quot;connection&quot;: {
                    &quot;bootstrap&quot;: &quot;broker:9092&quot;,
                    &quot;schema_registry_url&quot;: &quot;http://xyz:8081&quot;,
                  },
                },
            },
            &quot;sink&quot;: {
                &quot;type&quot;: &quot;datahub-kafka&quot;,
                &quot;config&quot;: {
                  &quot;connection&quot;: {
                    &quot;bootstrap&quot;:  &quot;broker:9092&quot;,
                    &quot;schema_registry_url&quot;: &quot;http://xyz:8081&quot;,
                  },
                },
            },
        }
    )

    pipeline.run()
    pipeline.raise_from_status()

dag = DAG(&apos;datahub_ingest_using_recipe&apos;, description=&apos;Loading Kafka Metadata into DataHub&apos;, schedule_interval=&apos;0 12 * * *&apos;, start_date=datetime(2017, 3, 20), catchup=False)
datahub_operator = PythonOperator(task_id=&apos;ingest_using_recipe&apos;, python_callable=datahub_recipe, dag=dag)

datahub_operator
</code></pre>
<!--kg-card-end: markdown--><p>If all went well, after uploading the DAG into Airflow, you should be able to run the DAG to ingest new Kafka metadata into DataHub.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://thecloudnativedataengineer.com/content/images/2021/09/Screenshot-2021-09-08-at-22.34.25.png" class="kg-image" alt="Deliver continuous data discoverability on top of Kafka with DataHub and Airflow" loading="lazy" width="2000" height="819" srcset="https://thecloudnativedataengineer.com/content/images/size/w600/2021/09/Screenshot-2021-09-08-at-22.34.25.png 600w, https://thecloudnativedataengineer.com/content/images/size/w1000/2021/09/Screenshot-2021-09-08-at-22.34.25.png 1000w, https://thecloudnativedataengineer.com/content/images/size/w1600/2021/09/Screenshot-2021-09-08-at-22.34.25.png 1600w, https://thecloudnativedataengineer.com/content/images/size/w2400/2021/09/Screenshot-2021-09-08-at-22.34.25.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Airflow DAGs overview: the datahub ingest DAG ran successfully</figcaption></figure><p>Alternatively you could schedule running the previously created Docker Image as a CronJob on your Kubernetes cluster. </p><!--kg-card-begin: markdown--><h1 id="closing-remarks">Closing remarks</h1>
<!--kg-card-end: markdown--><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1538121915146-1dedb4191b21?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwxMTc3M3wwfDF8c2VhcmNofDEyfHxpZGVhfGVufDB8fHx8MTYzMTExNjc0MQ&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=2000" class="kg-image" alt="Deliver continuous data discoverability on top of Kafka with DataHub and Airflow" loading="lazy" width="4928" height="3264" srcset="https://images.unsplash.com/photo-1538121915146-1dedb4191b21?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwxMTc3M3wwfDF8c2VhcmNofDEyfHxpZGVhfGVufDB8fHx8MTYzMTExNjc0MQ&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=600 600w, https://images.unsplash.com/photo-1538121915146-1dedb4191b21?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwxMTc3M3wwfDF8c2VhcmNofDEyfHxpZGVhfGVufDB8fHx8MTYzMTExNjc0MQ&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=1000 1000w, https://images.unsplash.com/photo-1538121915146-1dedb4191b21?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwxMTc3M3wwfDF8c2VhcmNofDEyfHxpZGVhfGVufDB8fHx8MTYzMTExNjc0MQ&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=1600 1600w, https://images.unsplash.com/photo-1538121915146-1dedb4191b21?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwxMTc3M3wwfDF8c2VhcmNofDEyfHxpZGVhfGVufDB8fHx8MTYzMTExNjc0MQ&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=2400 2400w" sizes="(min-width: 720px) 720px"><figcaption>Photo by <a href="https://unsplash.com/@frederickjmedina?utm_source=ghost&amp;utm_medium=referral&amp;utm_campaign=api-credit">Frederick Medina</a> / <a href="https://unsplash.com/?utm_source=ghost&amp;utm_medium=referral&amp;utm_campaign=api-credit">Unsplash</a></figcaption></figure><!--kg-card-begin: markdown--><p>A few open questions/remarks I have after writing this post:</p>
<ul>
<li>if you&apos;re not using a predefined schema for your Kafka topics, there&apos;s not a lot of information to be found in DataHub other than a list of Kafka topics (although you might add extra metadata manually)</li>
<li>still not sure if Airflow is the right tool for configuring the automatic ingestion of Kafka Metadata, it kinda felt like yet another tool do to something quite simple in K8S, but Airflow does give you an higher level of workflow management and monitoring...</li>
<li>debugging Airflow was weary, untill I figured out I should start running the python code locally before uploading it to the Airflow instance running on K8S...</li>
</ul>
<!--kg-card-end: markdown-->]]></content:encoded></item></channel></rss>