Building Real-Time ETL Pipelines with Apache Kafka

Click for: original source

Whether you’re a data engineer, a data scientist, a software developer, or someone else working in the field of software and data - it’s very likely that you have implemented an ETL pipeline before. By Stefan Sprenger.

ETL stands for Extract, Transform, and Load. These three steps are applied to move data from one datastore to another one. First, data are extracted from a data source. Second, data are transformed in preparation for the data sink. Third, data are loaded into a data sink. Examples are moving data from a transactional database system to a data warehouse or syncing a cloud storage with an API.

The article content is split into:

  • What are real-time ETL pipelines?
  • What are the benefits of real-time ETL pipelines?
  • How to implement real-time ETL with Apache Kafka

The open-source community provides most essentials for getting up and running. You can use open-source Kafka Connect connectors, like Debezium, for integrating Kafka with external systems, implement transformations in Kafka Streams, or even implement operations spanning multiple rows, such as joins or aggregations, with Kafka. Good read!

[Read More]

Tags apache database queues messaging big-data