Sunday 1 October 2017

2)Spark- Resilient Distributed Dataset

What is RDD?
  • RDD is the spark's core abstraction which is resilient distributed dataset,
  • It is immutable distributed collection of objects, spark distributes the data in RDD, to different nodes across the cluster to achieve parallelization
Note:
resilient - meaning ability to re-computed from history which in turn fault tolerant.


In simple: Resilient Distributed Dataset:
  • Collection
  • Distributed
  • In-memory
  • Resilient
How to create RDD's?
They are two ways to create RDD's
  1. Parallelizing an existing collection
  2. Referencing a external dataset
1.parallelizing an existing collection:
Parallelizing an existing collection in your driver program can be created by calling SparkContext’s parallelize method on an existing iterable or collection

Example: python spark



 Example: scala spark



Note:The number of partitions can be set in the second parameter of parallelize method
sc.parallelize(list, 10)
2.Referencing a external dataset:
Referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, Cassandra or any data source offering a Hadoop InputFormat.

Text file RDDs can be created using SparkContext’s textFile method.

Below are the types of files to Read:
  • textFile(path)
  • sequenceFile(path)
  • objectFile(path)
 
Example:python spark


 Example: scala spark



Note:
SparkContext.wholeTextFiles lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs.

RDD Operations:
RDDs support two types of operations:
  1. transformations
  2. actions
i)transformations:
transformations, which will create a new dataset from an existing one,

Example:
map is a transformation that passes each dataset element through a function and returns a new RDD representing the results.

Example in python spark:


Example in scala spark:




ii)actions:
actions, which will return a value to the driver program after running a computation on the dataset.

Example:

reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program.


Example in python spark:



Example in scala spark:



Note:
  • Since all transformations in Spark are lazy, in that they do not compute their results right away. 
  • Only when the actions need the results of transformations.
  • By default, each transformed RDD may be recomputed each time you run an action on it.
We can also use persist an RDD in memory using the persist (or cache) method.

Please click next to proceed further ==> Next

No comments:

Post a Comment

Fundamentals of Python programming

Fundamentals of Python programming: Following below are the fundamental constructs of Python programming: Python Data types Python...