Common RDD operations
Element-wise transformations
map()
map()
The map() transformation takes in a function and applies it to each element in the RDD with the result of the function being the new value of each element in the resulting RDD.
output
Info It is useful to note that
map()
’s return type does not have to be the same as its input type. For example,map()
can operate on an RDD of strings and return an RDD of doubles.
flatMap()
flatMap()
Sometimes we want to produce multiple output elements for each input element. The operation to do this is called flatMap()
. Like map()
, flatMap()
operates on each element, but returns an iterator with the return values.
flatMap()
“flattens” the iterators returned to it, so that instead of ending up with an RDD of lists we have an RDD of the elements in those lists.
Basic RDD transformations on an RDD containing {1, 2, 3, 3}
Two-RDD transformations on RDDs containing {1, 2, 3} and {3, 4, 5}
Actions
Basic actions on an RDD containing {1, 2, 3, 3}
Refer to Spark v2.1.0 docs for more RDD operators in pyspark.
aggregate
The aggregate action does not require the return type to be the same type as the RDD. Like with fold, we supply an initial zero value of the type we want to return. Then we provide two functions. The first one is used to combine the elements from our RDD with the accumulator. The second function is needed to merge two accumulators.
Consider an RDD of durations and we would like to compute the total duration, and the number of elements in the RDD (.count()
).
In partition #1 we have [0, 0.1, 0.2]
and partition #2 has [0.4, 0]
. To compute the total duration and number of elements in duration, we can use the aggregate function. Alternatively, to get the total duration we could do a .reduce(add)
action and the number of elements could be obtained using .count()
. However, this requires iterating through the RDD twice. To obtain the (sum, count)
equivalent of (.reduce(add), .count())
, we use the .aggregate()
method.
The aggregate
method would look something like this:
In partition 1, as the aggregate function iterates through each element in the task, and encounters the first element 0
, the initial value of the accumulator is (0, 0)
and runs the combine value with acc
(first function) in the aggregate()
:
Similarly in partition #2, which has (0.4, 0)
, yields an accumulated values of (sum, count)
as (0.4, 2)
. This is followed by combine accumulators function.
Thus the combine accumulators yield a value of (sum, count)
as (0.7, 5)
. The use .aggregate(...)
means the RDD is iterated through only once, which improves the efficiency.
Last updated