Share my learning's: May 2018

Tuesday, 29 May 2018

4)More about Spark RDD Operations - Actions

Actions are the one of the RDD operation:

Actions:
Actions, which return a value to the driver program after running a computation on the dataset.

Note:
Actions may trigger a previously constructed, lazy RDD to be evaluated.

1)reduce(func):
Aggregate the elements of the dataset using a function func (which takes two arguments and returns one).

Note:
Function should be commutative and associative so that it can be computed correctly in parallel.

Actions	Usage
i)General Actions
reduce(func)	Aggregate the elements of the dataset using a function func (which takes two arguments and returns one).

Pictorial view:

Example: Python spark:

Example: Scala spark:

2)collect():
Returns all the items in the RDD, as a list to driver program.

Actions	Usage
i)General Actions
reduce(func)	Aggregate the elements of the dataset using a function func (which takes two arguments and returns one).
collect()	Return all the elements of the dataset as an array at the driver program.

Pictorial view:

Example: Python spark:

Example: Scala spark:

3)count():
Returns number of elements in the RDD.

Actions	Usage
i)General Actions
reduce(func)	Aggregate the elements of the dataset using a function func (which takes two arguments and returns one).
collect()	Return all the elements of the dataset as an array at the driver program.
ii)Math/Statistical
count()	Return the number of elements in the dataset.

Example: Python spark:

Example: Scala spark:

4)first():
Return the first element in the RDD.

Actions	Usage
i)General Actions
reduce(func)	Aggregate the elements of the dataset using a function func (which takes two arguments and returns one).
collect()	Return all the elements of the dataset as an array at the driver program.
first()	Return the first element of the dataset.
ii)Math/Statistical
count()	Return the number of elements in the dataset.

Example: Python spark:

Example: Scala spark:

5)take(n):
Return an array with the first n elements of the RDD.

Actions	Usage
i)General Actions
reduce(func)	Aggregate the elements of the dataset using a function func (which takes two arguments and returns one).
collect()	Return all the elements of the dataset as an array at the driver program.
first()	Return the first element of the dataset.
take(n)	Return an array with the first n elements of the dataset.
ii)Math/Statistical
count()	Return the number of elements in the dataset.

Example: Python spark:

Example: Scala spark:

6)takeSample(withReplacement:boolean, num:int, [seed]):
Return an array with a random sample of num elements of the dataset, with or without replacement.

Actions	Usage
i)General Actions
reduce(func)	Aggregate the elements of the dataset using a function func (which takes two arguments and returns one).
collect()	Return all the elements of the dataset as an array at the driver program.
first()	Return the first element of the dataset.
take(n)	Return an array with the first n elements of the dataset.
ii)Math/Statistical
count()	Return the number of elements in the dataset.
takeSample(withReplacement, num, [seed])	Return an array with a random sample of num elements of the dataset, with or without replacement,

Example: Python spark:

Example: Scala spark:

7)takeOrdered(n, [ordering]):
Return the first n elements of the RDD using either their natural order or a custom comparator.

Actions	Usage
i)General Actions
reduce(func)	Aggregate the elements of the dataset using a function func (which takes two arguments and returns one).
collect()	Return all the elements of the dataset as an array at the driver program.
first()	Return the first element of the dataset.
take(n)	Return an array with the first n elements of the dataset.
ii)Math/Statistical
count()	Return the number of elements in the dataset.
takeSample(withReplacement, num, [seed])	Return an array with a random sample of num elements of the dataset, with or without replacement,
iii)Set Theory/Relational
takeOrdered(n, [ordering])	Return the first n elements of the RDD using either their natural order or a custom comparator.

Example: Python spark:

Example: Scala spark:

8)saveAsTextFile(path):
Save RDD as text file, using string representations of elements.

Actions	Usage
i)General Actions
reduce(func)	Aggregate the elements of the dataset using a function func (which takes two arguments and returns one).
collect()	Return all the elements of the dataset as an array at the driver program.
first()	Return the first element of the dataset.
take(n)	Return an array with the first n elements of the dataset.
ii)Math/Statistical
count()	Return the number of elements in the dataset.
takeSample(withReplacement, num, [seed])	Return an array with a random sample of num elements of the dataset, with or without replacement,
iii)Set Theory/Relational
takeOrdered(n, [ordering])	Return the first n elements of the RDD using either their natural order or a custom comparator.
iv)Data structure/ io
saveAsTextFile(path)	Write the elements of the dataset as a text file (or set of text files) in a given directory in the local, HDFS or any other Hadoop-supported file system.

Pictorial view:

Syntax:

saveAsTextFile(path,compresessioncode-class)

Parameters:

path – path to file
compressionCodecClass – (None by default) string i.e. “org.apache.hadoop.io.compress.GzipCodec”

Example: Python spark:

Example: Scala spark:

Since the default parallelism is 4, so during export, the RDD is split into fours partitions on local directory.

9)saveAsSequenceFile(path):

Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is available on RDDs of key-value pairs.

Actions	Usage
i)General Actions
reduce(func)	Aggregate the elements of the dataset using a function func (which takes two arguments and returns one).
collect()	Return all the elements of the dataset as an array at the driver program.
first()	Return the first element of the dataset.
take(n)	Return an array with the first n elements of the dataset.
ii)Math/Statistical
count()	Return the number of elements in the dataset.
takeSample(withReplacement, num, [seed])	Return an array with a random sample of num elements of the dataset, with or without replacement,
iii)Set Theory/Relational
takeOrdered(n, [ordering])	Return the first n elements of the RDD using either their natural order or a custom comparator.
iv)Data structure/ io
saveAsTextFile(path)	Write the elements of the dataset as a text file (or set of text files) in a given directory in the local, HDFS or any other Hadoop-supported file system.
saveAsSequenceFile(path)	Saves RDD as sequencefile with key – value pairs

Example: Python spark:

Example: Scala spark:

10)saveAsObjectFile(path):
Saves RDD as an object fil

Actions	Usage
i)General Actions
reduce(func)	Aggregate the elements of the dataset using a function func (which takes two arguments and returns one).
collect()	Return all the elements of the dataset as an array at the driver program.
first()	Return the first element of the dataset.
take(n)	Return an array with the first n elements of the dataset.
ii)Math/Statistical
count()	Return the number of elements in the dataset.
takeSample(withReplacement, num, [seed])	Return an array with a random sample of num elements of the dataset, with or without replacement,
iii)Set Theory/Relational
takeOrdered(n, [ordering])	Return the first n elements of the RDD using either their natural order or a custom comparator.
iv)Data structure/ io
saveAsTextFile(path)	Write the elements of the dataset as a text file (or set of text files) in a given directory in the local, HDFS or any other Hadoop-supported file system.
saveAsSequenceFile(path)	Saves RDD as sequencefile with key – value pairs
saveAsObjectFile(path)	Saves RDD as an object file

Example: Python spark:

Note:
Object file format does not support in python spark.
Example: Scala spark:

11)countByKey():Count the number of elements for each key, and return the result as a dictionary.

Actions	Usage
i)General Actions
reduce(func)	Aggregate the elements of the dataset using a function func (which takes two arguments and returns one).
collect()	Return all the elements of the dataset as an array at the driver program.
first()	Return the first element of the dataset.
take(n)	Return an array with the first n elements of the dataset.
ii)Math/Statistical
count()	Return the number of elements in the dataset.
takeSample(withReplacement, num, [seed])	Return an array with a random sample of num elements of the dataset, with or without replacement,
countByKey()	Returns a hashmap of (K, Int) pairs with the count of each key.
iii)Set Theory/Relational
takeOrdered(n, [ordering])	Return the first n elements of the RDD using either their natural order or a custom comparator.
iv)Data structure/ io
saveAsTextFile(path)	Write the elements of the dataset as a text file (or set of text files) in a given directory in the local, HDFS or any other Hadoop-supported file system.
saveAsSequenceFile(path)	Saves RDD as sequencefile with key – value pairs
saveAsObjectFile(path)	Saves RDD as an object file

Pictorial view:

Example: Python spark:

Example: Scala spark:

12)foreach(func):
To iterate through all the items in RDD and apply function to all.

Note:
Helpful, when we want to insert items in RDD.

Actions	Usage
i)General Actions
reduce(func)	Aggregate the elements of the dataset using a function func (which takes two arguments and returns one).
collect()	Return all the elements of the dataset as an array at the driver program.
first()	Return the first element of the dataset.
take(n)	Return an array with the first n elements of the dataset.
foreach(func)	Run a function func on each element of the dataset.
ii)Math/Statistical
count()	Return the number of elements in the dataset.
takeSample(withReplacement, num, [seed])	Return an array with a random sample of num elements of the dataset, with or without replacement,
countByKey()	Returns a hashmap of (K, Int) pairs with the count of each key.
iii)Set Theory/Relational
takeOrdered(n, [ordering])	Return the first n elements of the RDD using either their natural order or a custom comparator.
iv)Data structure/ io
saveAsTextFile(path)	Write the elements of the dataset as a text file (or set of text files) in a given directory in the local, HDFS or any other Hadoop-supported file system.
saveAsSequenceFile(path)	Saves RDD as sequencefile with key – value pairs
saveAsObjectFile(path)	Saves RDD as an object file

Example: Python spark:

Example: Scala spark:

Please click next to proceed further ==> Next

3)More about Spark RDD Operations - Transformations

This explanation inspired from https://training.databricks.com/visualapi.pdf

Apache Spark RDD Operations:

Apache Spark RDD supports two types of Operations

Transformations
Actions

1.RDD Transformation:
Spark Transformation is a function that produces new RDD from the existing RDDs.

Note:
It takes RDD as input and produces one or more RDD as output.

There are two types of transformations:

Narrow transformations
Wide transformations

i.Narrow transformations:
Each partition of the parent RDD is used by at most one partition of the child RDD.

Note:
Narrow transformations are the result of map(), filter().
ii.Wide transformations:
Multiple child RDD partitions may depend on a single parent RDD partition.

Note:
Wide transformations are the result of groupbyKey() and reducebyKey().

Let's we go through all the transformations one by one:

Transformations	Usage
i)General Transformations
map(func)	Return a new RDD formed by passing each element of the source through a function func.

1)map(func)
Return a new RDD formed by passing each element of the source through a function func.

Pictorial view:
Let's consider the RDD: x is the parent RDD:

Map transformation, to apply for each element in the parent RDD:
1)map transformation on first element

2)map transformation on second element

3)map transformation on third element

4)After map(func) applied , new RDD formation

Example: python spark:
Syntax:

map(func, preserverspatitions = False)

Example: scala spark:

2)filter(func):

keep item if function returns true

Transformations	Usage
i)General Transformations
map(func)	Return a new RDD formed by passing each element of the source through a function func.
filter(func)	Return a new RDD formed by selecting those elements of the source on which func returns true.

Pictorial view:
Let's consider the RDD: x is the parent RDD:

1)filter transformation on first element

2)filter transformation on second element

3)filter transformation on third element

4)After filter(func) applied , new RDD formation

Example: python spark:
Syntax:

filter(func)

Example: scala spark:

3)flatMap(func):
Each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).

Transformations	Usage
i)General Transformations
map(func)	Return a new RDD formed by passing each element of the source through a function func.
filter(func)	Return a new RDD formed by selecting those elements of the source on which func returns true.
flatMap(func)	Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).

Pictorial view:
Let's consider the RDD: x is the parent RDD:

1)flatMap transformation on first element

2)flatMap transformation on second element

3)flatMap transformation on third element

4)After flatMap(func) applied , new RDD formation

Example: python spark:
Syntax:

flatMap(func)

Example: scala spark:

3)mapPartitions(func):
Similar to map, but runs separately on each partition (block) of the RDD, for performance optimization.

Note:
so func must be of type Iterator<T> => Iterator<U> when running on an RDD of type T.

Transformations	Usage
i)General Transformations
map(func)	Return a new RDD formed by passing each element of the source through a function func.
filter(func)	Return a new RDD formed by selecting those elements of the source on which func returns true.
flatMap(func)	Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).
mapPartitions(func)	Similar to map, but runs separately on each partition (block) of the RDD

Pictorial view:

Example: python spark:
Syntax:

mapPartitions(Iterable<T> funciton, pserversPatitioning= False)

Example: scala spark:

Note:
Results [6,15,24] are created with mapPartitions loops through 3 partitions.

Partion 1: 1+2+3 = 6

Partition 2: 4+5+6 = 15

Partition 3: 7+8+9 = 24

To check the defualt parallelism:

>>> print sc.defaultParallelism
4

4)mapPartitionsWithIndex(func):
Similar to mapPartitions, but also provides func with an integer value representing the index of the partition,

Note:
func must be of type (Int, Iterator<T>) => Iterator<U> when running on an RDD of type T.

Transformations	Usage
i)General Transformations
map(func)	Return a new RDD formed by passing each element of the source through a function func.
filter(func)	Return a new RDD formed by selecting those elements of the source on which func returns true.
flatMap(func)	Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).
mapPartitions(func)	Similar to map, but runs separately on each partition (block) of the RDD
mapPartitionsWithIndex(func)	Similar to mapPartitions, but also provides func with an integer value representing the index of the partition,

Pictorial view:

Example: python spark:

Example: scala spark:

Note:
glom() flattens elements on the same partition

5)sample(withReplacement, fraction, seed):
Sample a fraction fraction of the data, with or without replacement, using a given random number generator seed.

Transformations	Usage
i)General Transformations
map(func)	Return a new RDD formed by passing each element of the source through a function func.
filter(func)	Return a new RDD formed by selecting those elements of the source on which func returns true.
flatMap(func)	Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).
mapPartitions(func)	Similar to map, but runs separately on each partition (block) of the RDD
mapPartitionsWithIndex(func)	Similar to mapPartitions, but also provides func with an integer value representing the index of the partition,
ii)Math/Statistical
sample(withReplacement, fraction, seed)	Sample a fraction of the data, with or without replacement, using a given random number generator seed.

Pictorial view:

Example: python spark:

Example: scala spark:

6)union(otherDataset):
Return a new dataset that contains the union of two RDDs elements

Transformations	Usage
i)General Transformations
map(func)	Return a new RDD formed by passing each element of the source through a function func.
filter(func)	Return a new RDD formed by selecting those elements of the source on which func returns true.
flatMap(func)	Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).
mapPartitions(func)	Similar to map, but runs separately on each partition (block) of the RDD
mapPartitionsWithIndex(func)	Similar to mapPartitions, but also provides func with an integer value representing the index of the partition,
ii)Math/Statistical
sample(withReplacement, fraction, seed)	Sample a fraction of the data, with or without replacement, using a given random number generator seed.
ii)Set Theory/Relational
union(otherDataset)	Return a new dataset that contains the union of two RDDs elements

Pictorial view:

Example: python spark:

Example: scala spark:

Note:Union can take duplicates as well,

7)intersection(otherDataset):
Return a new RDD that contains the intersection of elements in both RDD's

Transformations	Usage
i)General Transformations
map(func)	Return a new RDD formed by passing each element of the source through a function func.
filter(func)	Return a new RDD formed by selecting those elements of the source on which func returns true.
flatMap(func)	Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).
mapPartitions(func)	Similar to map, but runs separately on each partition (block) of the RDD
mapPartitionsWithIndex(func)	Similar to mapPartitions, but also provides func with an integer value representing the index of the partition,
ii)Math/Statistical
sample(withReplacement, fraction, seed)	Sample a fraction of the data, with or without replacement, using a given random number generator seed.
ii)Set Theory/Relational
union(otherDataset)	Return a new dataset that contains the union of two RDDs elements
intersection(otherDataset)	Return a new RDD that contains the intersection of elements in both RDD's

Example: python spark:

Example: scala spark:

8)distinct([numTasks])):
Return a new dataset that contains the distinct elements of the source RDD.

Transformations	Usage
i)General Transformations
map(func)	Return a new RDD formed by passing each element of the source through a function func.
filter(func)	Return a new RDD formed by selecting those elements of the source on which func returns true.
flatMap(func)	Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).
mapPartitions(func)	Similar to map, but runs separately on each partition (block) of the RDD
mapPartitionsWithIndex(func)	Similar to mapPartitions, but also provides func with an integer value representing the index of the partition,
ii)Math/Statistical
sample(withReplacement, fraction, seed)	Sample a fraction of the data, with or without replacement, using a given random number generator seed.
ii)Set Theory/Relational
union(otherDataset)	Return a new dataset that contains the union of two RDDs elements
intersection(otherDataset)	Return a new RDD that contains the intersection of elements in both RDD's
distinct([numTasks]))	Return a new dataset that contains the distinct elements of the source RDD

Pictorial view:

Example: python spark:

Example: scala spark:

9)groupByKey([numTasks]):When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs.

Transformations	Usage
i)General Transformations
map(func)	Return a new RDD formed by passing each element of the source through a function func.
filter(func)	Return a new RDD formed by selecting those elements of the source on which func returns true.
flatMap(func)	Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).
mapPartitions(func)	Similar to map, but runs separately on each partition (block) of the RDD
mapPartitionsWithIndex(func)	Similar to mapPartitions, but also provides func with an integer value representing the index of the partition,
ii)Math/Statistical
sample(withReplacement, fraction, seed)	Sample a fraction of the data, with or without replacement, using a given random number generator seed.
ii)Set Theory/Relational
union(otherDataset)	Return a new dataset that contains the union of two RDDs elements
intersection(otherDataset)	Return a new RDD that contains the intersection of elements in both RDD's
distinct([numTasks]))	Return a new dataset that contains the distinct elements of the source RDD
groupByKey([numTasks])	When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs.

Pictorial view:

After groupByKey transformation:

Example: python spark:

Example: scala spark:

10)reduceByKey(func, [numTasks]):
Returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V.

Transformations	Usage
i)General Transformations
map(func)	Return a new RDD formed by passing each element of the source through a function func.
filter(func)	Return a new RDD formed by selecting those elements of the source on which func returns true.
flatMap(func)	Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).
mapPartitions(func)	Similar to map, but runs separately on each partition (block) of the RDD
mapPartitionsWithIndex(func)	Similar to mapPartitions, but also provides func with an integer value representing the index of the partition,
reduceByKey(func, [numTasks])	Operates on key, value pairs again, but the func must be of type (V,V) => V.
ii)Math/Statistical
sample(withReplacement, fraction, seed)	Sample a fraction of the data, with or without replacement, using a given random number generator seed.
iii)Set Theory/Relational
union(otherDataset)	Return a new dataset that contains the union of two RDDs elements
intersection(otherDataset)	Return a new RDD that contains the intersection of elements in both RDD's
distinct([numTasks]))	Return a new dataset that contains the distinct elements of the source RDD
groupByKey([numTasks])	When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs.

Example: python spark:

Example: scala spark:

Tuesday, 29 May 2018

4)More about Spark RDD Operations - Actions

3)More about Spark RDD Operations - Transformations

Fundamentals of Python programming