Mastering AWS Kinesis data streams

Click for: original source

An article by Anahit Pogosova in which she describes how she has been working with AWS Kinesis Data Streams for several years, dealing with over 0.5TB of streaming data per day. Rather than telling you about all the reasons why you should use Kinesis Data Streams (plenty is written on that subject), she will talk about the things you should know when working with the service.

One thing about Kinesis Streams that makes it a very powerful tool, in addition to its nearly endless scalability, is that you can attach custom data consumers to it to process and handle data in any way you prefer, in near real-time.

After writing it to a stream, data is available to read within milliseconds and is safely stored in the stream for at least 24 hours, during which you can “replay” the data as many times as you want. You can increase that time even further, to up to 7 days, but you will be charged extra for any time over 24h.

The article then reads about:

  • Shards
  • Shards and Partition Keys
  • Serverless?
  • Writing to a stream
  • AWS SDK
  • Batch operations
  • Failures
  • Partial failures
  • Pricing

The main cause for these kinds of failures is exceeding the throughput of a stream or an individual shard. The most common reasons for that can be really tricky to fix. They are traffic spikes and network latencies. Both of them may cause records to arrive to the stream unevenly and cause sudden spikes in throughput. Plenty of code examples, links to further reading and charts explaining concepts. Excellent read!

[Read More]

Tags software-architecture event-driven messaging big-data cio data-science code-refactoring