Share my learning's: 2)Spark- Resilient Distributed Dataset

What is RDD?

RDD is the spark's core abstraction which is resilient distributed dataset,

It is immutable distributed collection of objects, spark distributes the data in RDD, to different nodes across the cluster to achieve parallelization

Note:
resilient - meaning ability to re-computed from history which in turn fault tolerant.

In simple: Resilient Distributed Dataset:

Collection
Distributed
In-memory
Resilient

How to create RDD's?
They are two ways to create RDD's

Parallelizing an existing collection
Referencing a external dataset

1.parallelizing an existing collection:
Parallelizing an existing collection in your driver program can be created by calling SparkContext’s parallelize method on an existing iterable or collection

Example: python spark

Example: scala spark

Note:The number of partitions can be set in the second parameter of parallelize method

sc.parallelize(list, 10)

2.Referencing a external dataset:
Referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, Cassandra or any data source offering a Hadoop InputFormat.

Text file RDDs can be created using SparkContext’s textFile method.

Below are the types of files to Read:

textFile(path)

sequenceFile(path)

objectFile(path)

Example:python spark

Example: scala spark

Note:
SparkContext.wholeTextFiles lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs.

RDD Operations:
RDDs support two types of operations:

transformations
actions

i)transformations:
transformations, which will create a new dataset from an existing one,

Example:
map is a transformation that passes each dataset element through a function and returns a new RDD representing the results.

Example in python spark:

Example in scala spark:

ii)actions:
actions, which will return a value to the driver program after running a computation on the dataset.

Example:
reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program.

Example in python spark:

Example in scala spark:

Note:

Since all transformations in Spark are lazy, in that they do not compute their results right away.

Only when the actions need the results of transformations.

By default, each transformed RDD may be recomputed each time you run an action on it.

We can also use persist an RDD in memory using the persist (or cache) method.

Please click next to proceed further ==> Next

Share my learning's

Sunday, 1 October 2017

2)Spark- Resilient Distributed Dataset

No comments:

Post a Comment

Fundamentals of Python programming