What is RDD?
resilient - meaning ability to re-computed from history which in turn fault tolerant.
In simple: Resilient Distributed Dataset:
They are two ways to create RDD's
Parallelizing an existing collection in your driver program can be created by calling SparkContext’s parallelize method on an existing iterable or collection
Example: python spark
Example: scala spark
Note:The number of partitions can be set in the second parameter of parallelize method
Referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, Cassandra or any data source offering a Hadoop InputFormat.
Text file RDDs can be created using SparkContext’s textFile method.
Below are the types of files to Read:
Example:python spark
Example: scala spark
Note:
SparkContext.wholeTextFiles lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs.
RDD Operations:
RDDs support two types of operations:
transformations, which will create a new dataset from an existing one,
Example:
map is a transformation that passes each dataset element through a function and returns a new RDD representing the results.
Example in python spark:
Example in scala spark:
ii)actions:
actions, which will return a value to the driver program after running a computation on the dataset.
Example:
reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program.
Example in python spark:
Example in scala spark:
Note:
Please click next to proceed further ==> Next
- RDD is the spark's core abstraction which is resilient distributed dataset,
- It is immutable distributed collection of objects, spark distributes the data in RDD, to different nodes across the cluster to achieve parallelization
resilient - meaning ability to re-computed from history which in turn fault tolerant.
In simple: Resilient Distributed Dataset:
- Collection
- Distributed
- In-memory
- Resilient
They are two ways to create RDD's
- Parallelizing an existing collection
- Referencing a external dataset
Parallelizing an existing collection in your driver program can be created by calling SparkContext’s parallelize method on an existing iterable or collection
Example: python spark
Example: scala spark
Note:The number of partitions can be set in the second parameter of parallelize method
sc.parallelize(list, 10)2.Referencing a external dataset:
Referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, Cassandra or any data source offering a Hadoop InputFormat.
Text file RDDs can be created using SparkContext’s textFile method.
Below are the types of files to Read:
- textFile(path)
- sequenceFile(path)
- objectFile(path)
Example:python spark
Example: scala spark
Note:
SparkContext.wholeTextFiles lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs.
RDD Operations:
RDDs support two types of operations:
- transformations
- actions
transformations, which will create a new dataset from an existing one,
Example:
map is a transformation that passes each dataset element through a function and returns a new RDD representing the results.
Example in python spark:
Example in scala spark:
ii)actions:
actions, which will return a value to the driver program after running a computation on the dataset.
Example:
reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program.
Example in python spark:
Example in scala spark:
Note:
- Since all transformations in Spark are lazy, in that they do not compute their results right away.
- Only when the actions need the results of transformations.
- By default, each transformed RDD may be recomputed each time you run an action on it.
Please click next to proceed further ==> Next
No comments:
Post a Comment