Building self-served ETL pipeline for third-party data ingestion

Click for: original source

An article by Nikolaos Tsipas from Skyscanner with help of colleagues Omar Kooheji and Michael Okarimia about how to solve the puzzle when there is a need to import datasets from external sources, and make them available for querying. Examples of imported data include: analytics metrics, advertising data, and currency exchange rates, all of which are used by Skyscanner engineers and data scientists.

The main entry-point for data at Skyscanner is their Kafka-based streaming platform which allows near real-time processing and archiving.

For a solution that scales across the company they approached the problem with an emphasis on:

  • Minimizing dependency on the Data Platform engineers when onboarding new datasets – preventing our availability from becoming a blocker
  • Moving ETL pipeline ownership to the user – allowing users to own the data they produce
  • Automating boilerplate code and config generation – ensuring that infrastructure, permissions and deployment setup are abstracted away from users
  • Scalability and cost management – maintaining flexibility and cost efficiency, and ensuring the solution is future-proof

They use managed services provided by AWS to enable the various ETL stages and have adopted Cookiecutter for project templating.

Read the rest of this excellent article to learn if they opted for AWS Batch, Glue or both. A high-level illustration of the third-party data ETL pipeline is also provided. Nice one!

[Read More]

Tags big-data data-science software-architecture