Building a Data Lake on Google Cloud Platform

Click for: original source

Data has been stored in computers in a variety of ways over the years, including databases, blob storage, and other methods. In order to do effective business analytics, the data created by modern applications must be processed and analyzed. And the volume of data produced is enormous!

It’s critical to store petabytes of data effectively and have the necessary tools to query it in order to work with it. Only then can the analytics on that data produce meaningful results.

With that in mind, this blog aims to provide a small tutorial on how to create a data lake that reads any changes from an application’s database and writes it to the relevant place in the data lake. The tools we shall use for this are as follows:

  • Debezium
  • MySQL
  • Apache Kafka
  • Apache Hudi
  • Apache Spark

There are several ways in which a data lake can be architected. Using a setup like one described above, one can easily scale the pipeline to manage huge data workloads. You will also get links to further reading and Kubernetes deployment files in the article. Good read!

[Read More]

Tags cloud analytics big-data data-science gcp apache