How to handle data duplication in data-heavy Kubernetes environments

Click for: original source

It’s convenient to create a copy of your application with a copy of its state for each team. For example, you might want a separate database copy to test some significant schema changes or develop other disruptive operations like bulk insert/delete/update… By Augustinas Stirbis.

Duplicating data takes a lot of time. That’s because you need first to download all the data from a source block storage provider to compute and then send it back to a storage provider again. There’s a lot of network traffic and CPU/RAM used in this process. Hardware acceleration by offloading certain expensive operations to dedicated hardware is always a huge performance boost. It reduces the time required to complete an operation by orders of magnitude.

The article then covers:

  • Volume Snapshots to the rescue
  • Solution? Creating a Golden Snapshot externally
  • High-level plan for preparing the Golden Snapshot
  • High-level plan for cloning data for each team
    • Step 1: Identify disk
    • Step 2: Prepare your golden source
    • Step 3: Get your Disk Snapshot ID
    • Step 4: Create a development environment for each team

You will also get a loads of screenshots and config yaml files to go with this article. At the end of the tutorial you have Golden Snapshot, which is immutable data. Each team will get a copy of this data, and team members can modify it as they see fit, given that a new EBS/persistent disk will be created for each team. Good read!

[Read More]

Tags data-science devops how-to learning big-data