The article describes the development of a real-time video AI service for a major global service provider, leveraging Google Gemini’s Multimodal Live API and Akka’s SDK. The team successfully built, deployed, and scaled the service to handle thousands of transactions per second, far exceeding customer requirements. Key components included video ingestion, augmentation, and conversational storage, all deployed within a private Akka environment provisioned in a Google VPC in just 2 hours. By Johan Andrén.

A significant challenge was the lack of an efficient JVM client for the Gemini API, as the only available option was a blocking, synchronous Python client. The team reverse-engineered the Python client’s behavior and developed their own reactive client using Akka streams and remoting libraries in just one day. The protocol uses JSON objects over WebSocket, with an initial setup message followed by streaming of audio, video, and text data, receiving either audio or text in return.

The implementation utilized Akka HTTP’s WebSocket API and modeled the protocol using Java records. Jackson was employed for JSON serialization/deserialization, with customizations for base64 encoding and field naming. The solution demonstrates how to effectively interact with third-party WebSocket APIs without native JVM support, enabling Akka-based services to perform live video, audio, and text interactions with Google Gemini. Nice one!

[Read More]

Tags akka java ai app-development google