.NET Kafka Producers
This multipage post will specifically target the concept of getting data out of a specific system or database and on to the Kafka Event Bus. There are many ways of doing this and some are easier to implement than others. It’s important to remember why Kafka is relevant in today’s world; Kafka is a real-time event streaming platform. This is superior to legacy approaches of moving data like SQL Server Integration Services (SSIS) because it can happen in real-time. Not only that, Kafka follows the pub/sub architecture and it’s very easy to add new subscribers (consumers) than modifying an SSIS package to drop that data in another location.
WARNING: Work In Progress! Code coming soon…
What will you learn?
The goal is this blog post is to provide sample code and ideas for people to create .NET Producers that interface a Kafka cluster. There will be many challenges to creating this sample code.
Concurrency
In a modern environment, these producers may run in a Kubernetes cluster and may be scaled up or down depending on need. With more than one producer running simultaneously there may be concurrency problems that lead to race conditions where data could be duplicated because it’s being read twice. This will apply to the Batch Operations, Outbox, and CDC implementations.
Batch Operations
Maybe Kafka is new to the organization and it is ideal to take a less obtrusive approach to pull data out of a specific system or maybe that system can’t be modified to where it produces directly to Kafka or an API. If it’s not possible or practical to implement a Change Data Capture (CDC) in SQL Server then there is another alternative to getting data out of a database. Using a batch operation isn’t the best approach because it doesn’t align with the theory of real-time data streaming and if the data is business time-sensitive this may not be the best approach. The one big plus to this design is that it requires the least amount of changes to an existing system or infrastructure.
.NET Console App
The general idea here is that a scheduled process, a .NET Console App, will run on a schedule to query a database and look for fresh data or data that has not been produced to the Kafka Event Bus. This application will then put that information on the Kafka Event Bus for applications and systems to consume further downstream.
This is a less obtrusive approach because it doesn’t require modifying the code of the data source necessarily. It would be ideal to modify the source system’s SQL database to include a field to designate that the data has been produced to Kafka but overall the existing application will not be modified. However, in some cases, that’s not an option and it’s also possible to just query the data for that day or for that hour or whatever criteria.
Batch operations are good for an interim solution or a backup solution that can be run on demand to pull information that may be missing or to rebuild systems downstream.
Log Compaction Strategy
Another thing to consider is a log compaction strategy. If this batch operation runs several times a day and may produce duplicate data, then a log compaction strategy can reduce the amount of unnecessary data put on the Kafka Event Bus.
Outbox Pattern
I will also demonstrate the “Outbox Pattern” which can be used to move multiple pieces of data out of a system.
AFTER UPDATE
SqlTableDependency
.NET Application Producer
The most ideal solution is to modify code in the existing application to automatically produce data directly to the Kafka Event Bus as it happens. For example, imagine you have a Sales App that creates orders. During the process checkout method, after the record has been persisted to the database that data can be produced to the Kafka Event Bus immediately. This will allow systems downstream to consume this data in near real-time.
E-Commerce Sales App
This is ideal if a sales application creates a Sales Order, and that data needs to get into another system like a Customer Relation Management (CRM). That way if the order is placed and the customer calls the support number directly after placing the order, that information will be available to customer service reps in near real-time.
Sample Systems
A perfect example of why this is deal is that when the customer checks out of an ecommerce website they expect a real-time e-mail confirmation for their order. This can be handled by a microservice that processes all e-mails.
- E-Commerce Website (Customer Sale) -> Event Bus -> CRM
- E-Commerce Website (Customer Sale) -> Event Bus -> Email Service
API Producer (Webhooks)
Another option is to create a .NET Web API application that listens for incoming data from internal systems or external whitelisted IP addresses. For example, if an external system like Dynamics CRM needs to produce data to the Kafka Event Bus a webhook can easily be used to accommodate that. This could happen if a Customer record was updated in Dynamics CRM after a customer called to change their e-mail address.
The only downside I can think of using this internally is that if you had 5 applications hitting the same webhook and then the schema changes then there’s a dependency on adjusting all 5 of those applications to make sure the data is consistent with the new schema changes. That would not be an ideal approach. This is typically designed per system. I often have seen this for external to internal systems where the externals system’s IP address is whitelisted.
Sample Systems
- Stripe Subscription Billed -> API -> Event Bus
- Dynamics CRM (Customer Updated) -> API -> Event Bus
- Incoming Text Message -> API -> Event Bus
Change Data Capture (CDC) / Kafka Connect
Another possibility of getting data out of a source system’s database is to use Change Data Capture (CDC). This is a very useful approach if you want to stream every single event from a database into Kafka or changes on a very specific table. CDC can be used to stream inserts, updates, and deletes. SQL Server’s CDC is based on the database server’s own transaction log and uses a SQL Agent to put that data into CDC Tables. There are system stored procedures that support reading that data. It’s not very difficult to set up but pulling data out of a database can be done by interfacing the CDC’s stored procedures, querying the tables, or by using a tool like Debezium. Kafka Connect is a tool built on Debezium that will pull data out of the CDC and push it to a Kafka cluster. Setting up a Kafka Connect Cluster can also be a reliable way to get data into Kafka. Now, I did find one drawback of using a Debezium based connector, it only supports SQL Server 2016. If that doesn’t work, it’s worth attempting to build a containerized service that queries the CDC stored procedures to pull data.
PluralSight: SQL Server: Change Data Capture
I highly recommend watching the section on “Querying Change Data Using Transact-SQL” it’s a great demonstration of how to use the internal system stored procedures to adequately get data out of CDC in SQL Server.
https://app.pluralsight.com/library/courses/sql-server-change-data-capture/table-of-contents
PluralSight: Kafka Connect Fundamentals
This is a video series on Kafka Connect fundamentals and has a lot of valuable information about how to set up a local Kafka Connect Standalone or Distributed Cluster. It also demonstrates how to create a custom connector built with Java.
https://app.pluralsight.com/library/courses/kafka-connect-fundamentals/
Further Reading
https://www.confluent.io/blog/no-more-silos-how-to-integrate-your-databases-with-apache-kafka-and-cdc/
Streaming ETL: SQL Change Data Capture (Sample Source Code)
Microservices.io: Transactional Outbox
Jefferson Elias: Concurrency Problems Theory and Experimentation in SQL Server
Microsoft: Adventure Works Databases
Adrian Hills: Queue Table Processing in SQL Server