Introduction to Apache Spark and its Datasets

Click for: original source

In this article, we will introduce you to the big data ecosystem and the role of Apache Spark in Big data. We will also cover the Distributed database system, the backbone of big data. In today’s world, data is the fuel. Almost every electronic device collects data that is used for business purposes. By Abhishek Jaiswal.

The article also discusses Resilient Distributed Dataset (RDD) and Transformations and actions:

  • What is Apache Spark?
  • Apache Spark Architecture
  • Spark RDDs can’t be modified only can be replaced
  • Spark RDDs are lazy evaluated, which helps in data integrity and doesn’t let data corrupt
  • Spark Supports distributed SQL that is built on top of RDDs
  • Spark Supports various machine learning models, including CNN as well as NLPs

Spark Architecture contains a driver node, context reader, and node manager. Spark works in a distributed manner, the same as Hadoop, but alike Hadoop, it uses In-memory computation instead of disk. Good read!

[Read More]

Tags big-data data-science database miscellaneous