Apache Spark:
Apache Spark is in-memory cluster computing technology that increases the processing speed of an application.
Spark uses Hadoop in one way for storage purpose only, Since Spark has its own cluster management computation,
Note:
Features of Apache Spark:
1.Speed:
2.Supports multiple languages:
3.Advanced Analytics:
4.Runs Everywhere:
Components/Modules of Spark:
The following are the components/modules of Apache Spark:
1)Spark core:
Spark core provides In-Memory computing and referencing datasets in external storage systems.
2)Spark SQL:
Spark SQL is a component on top of Spark Core that introduces a new data abstraction called SchemaRDD.
Note:
Provides support for structured and semi-structured data.
Apache Spark is in-memory cluster computing technology that increases the processing speed of an application.
Spark uses Hadoop in one way for storage purpose only, Since Spark has its own cluster management computation,
Note:
- Designed for fast computation.
- It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing etc.,
- It reduces the management burden of maintaining separate tools.
Features of Apache Spark:
- Speed
- Supports multiple languages
- Advanced Analytics
- Runs Everywhere
1.Speed:
- Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory, and 10 times faster when running on disk.
- To achieve speed of processing, It stores the intermediate processing data in memory.
2.Supports multiple languages:
- Scala applications supports built in API's in Java, Scala, Python, R
3.Advanced Analytics:
- Combine SQL, streaming, and complex analytics
- Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.
4.Runs Everywhere:
- Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3.
Components/Modules of Spark:
The following are the components/modules of Apache Spark:
- Spark core
- Spark SQL
- Spark Streaming
- MLlib (Machine Learning Library)
- GraphX
1)Spark core:
Spark core provides In-Memory computing and referencing datasets in external storage systems.
2)Spark SQL:
Spark SQL is a component on top of Spark Core that introduces a new data abstraction called SchemaRDD.
Note:
Provides support for structured and semi-structured data.
3)Spark Streaming:
Spark Streaming on top of spark core to perform streaming analytics in batches of data.
4)MLlib (Machine Learning Library):
MLlib is a distributed machine learning framework above Spark
Note:
Spark MLlib is nine times as fast as the Hadoop disk-based version of Apache Mahout
5)GraphX:
GraphX is a distributed graph-processing framework on top of Spark
Spark Architecture Execution flow:
Spark Architecture Execution flow includes the following below
Spark Architecture Execution flow includes the following below
- Driver Program
- Cluster Manager
- Worker nodes
- Executor
Apache Spark - Cluster modes:
Below are the cluster modes in Apache Spark
Please click the next to proceed further ==> Next
Below are the cluster modes in Apache Spark
- Localmode
- YARN
- Mesos
- Standalone
Please click the next to proceed further ==> Next