Apache Spark Modules and their Dependencies

alt Apache Spark Modules

As you can see, the module spark-core is the foundation framework for all the others. This module provides the implementations for spark computing engine: rdd, schedule, deploy, executor, storage, shuffle, …

Module spark-sql including spark-hive and spark-catalyst lets you query structured data as a distributed dataset by using SQL queries. The module spark-hive provides the capability of interacting with hive, and the module spark-catalyst is used as a query optimization framework for spark.

Module spark-lib is a scalable machine learning library leveraging the power of computing of spark. spark-lib can even run on streaming data or use sql-queries to extract.

Module spark-streaming and spark-graphx make it easy to build scalable fault-tolerant streaming applications and graph-parallel computation, respectively.



Leave a Comment