Share my learning's: October 2017

Tuesday, 10 October 2017

Scala tuples

Scala tuple:
Scala tuple is a class and combines a fixed number of items together so that they can be passed around as a whole.

Note:

Unlike an array or list, a tuple can hold objects with different types but they are also immutable.
A tuple isn't actually a collection; it's a series of classes named Tuple2, Tuple3, etc., through Tuple22.

Example:
Little bag or container you can use to hold things and pass them around.

1)Creating tuples:
Tuples can be created in two ways:

Using the enclosing elements in parentheses.
Creating a tuple with ->

i)Using the enclosing elements in parentheses:

Ex:

val tupleex = (1,"Mano")

To get to know the tuple class:

tupleex.getClass()

ii)Creating a tuple with ->:
We can also create tuple with ->, mostly useful at Map collection.

2)Accessing tuple elements:

Accessing tuples in different ways:

Using underscore with position
Use variable names to access tuple elements
Iterating over a Scala tuple
The tuple toString method

i)using underscore with position
We can access tuple elements using an underscore syntax. The first element is accessed with _1, the second element with _2, and so on.

Ex:

ii)Use variable names to access tuple elements:
When referring to a Scala tuple we can also assign names to the elements in the tuple.

Let's we try to do this when returning miscellaneous elements from a method.

Create method, that returns tuple:

def tuplemeth = (1,"Mano",24)

Create variables to hold tuple elements:

val(id,name,age) = tuplemeth

We can ignore the elements by using an underscore placeholder for the elements you want to ignore.

val(id,name,_) = tuplemeth

iii)Iterating over a Scala tuple
As mentioned, a tuple is not a collection; it doesn't descend from any of the collection traits or classes. However, we can treat it a little bit like a collection by using its productIterator method.

iv)The tuple toString method:
The tuple toString method gives you a good representation of a tuple.

Sunday, 1 October 2017

2)Spark- Resilient Distributed Dataset

What is RDD?

RDD is the spark's core abstraction which is resilient distributed dataset,

It is immutable distributed collection of objects, spark distributes the data in RDD, to different nodes across the cluster to achieve parallelization

Note:
resilient - meaning ability to re-computed from history which in turn fault tolerant.

In simple: Resilient Distributed Dataset:

Collection
Distributed
In-memory
Resilient

How to create RDD's?
They are two ways to create RDD's

Parallelizing an existing collection
Referencing a external dataset

1.parallelizing an existing collection:
Parallelizing an existing collection in your driver program can be created by calling SparkContext’s parallelize method on an existing iterable or collection

Example: python spark

Example: scala spark

Note:The number of partitions can be set in the second parameter of parallelize method

sc.parallelize(list, 10)

2.Referencing a external dataset:
Referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, Cassandra or any data source offering a Hadoop InputFormat.

Text file RDDs can be created using SparkContext’s textFile method.

Below are the types of files to Read:

textFile(path)

sequenceFile(path)

objectFile(path)

Example:python spark

Example: scala spark

Note:
SparkContext.wholeTextFiles lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs.

RDD Operations:
RDDs support two types of operations:

transformations
actions

i)transformations:
transformations, which will create a new dataset from an existing one,

Example:
map is a transformation that passes each dataset element through a function and returns a new RDD representing the results.

Example in python spark:

Example in scala spark:

ii)actions:
actions, which will return a value to the driver program after running a computation on the dataset.

Example:
reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program.

Example in python spark:

Example in scala spark:

Note:

Since all transformations in Spark are lazy, in that they do not compute their results right away.

Only when the actions need the results of transformations.

By default, each transformed RDD may be recomputed each time you run an action on it.

We can also use persist an RDD in memory using the persist (or cache) method.

Please click next to proceed further ==> Next

Tuesday, 10 October 2017

Scala tuples

Sunday, 1 October 2017

2)Spark- Resilient Distributed Dataset

Fundamentals of Python programming