Pyspark Notes
Apache Spark
In memory data processing framework designed for large scale distributed data processing
In-memory processing in Apache Spark refers to the ability to store and process data in the memory (RAM) of the cluster's nodes rather than on disk. This feature significantly boosts performance, especially for iterative algorithms and interactive data analysis.
Resilient Distributed Datasets (RDDs) are a fundamental data structure in Apache Spark, designed to handle large-scale data processing tasks efficiently. They are immutable, distributed collections of objects that can be processed in parallel across a cluster. Here’s a detailed explanation of RDDs:
Key Features of RDDs
- Immutability:
- Once an RDD is created, it cannot be altered. However, transformations can be applied to produce new RDDs.
- This immutability ensures consistency and simplifies parallel processing.
- Distributed:
- RDDs are distributed across multiple nodes in a cluster, enabling parallel processing.
- This distribution allows Spark to handle large datasets that wouldn't fit on a single machine.
- Fault Tolerance:
- RDDs achieve fault tolerance through lineage information. If a partition of an RDD is lost, it can be recomputed from the original data using the sequence of transformations (lineage).
- This design eliminates the need for data replication.
- Lazy Evaluations:
- Transformations on RDDs are lazily evaluated, meaning they are not computed immediately. Instead, Spark builds a lineage graph of transformations.
- The actual computation happens only when an action (e.g.,
count
,collect
) is invoked.
- Operations:
- RDDs support two types of operations: transformations and actions.
- Transformations (e.g.,
map
,filter
,reduceByKey
): Create new RDDs from existing ones. These are lazy. - Actions (e.g.,
collect
,count
,saveAsTextFile
): Trigger the execution of transformations and return results to the driver or write them to storage.
- Transformations (e.g.,
- RDDs support two types of operations: transformations and actions.
Way faster than Hadoop MapReduce!
Apache Spark incorporates libraries with composable APIs for:
- MLib
- SparkSQL
- Structured Streaming - for interating with real time data
- GraphX
Spark can read data from multiple sources.
Spark Driver Responsible for starting spark session. Requests resources like CPU and memory from cluster manager for spark executors.
Spark Session Entry point for all spark application
Lazy Evaluation
Transformations don’t compute the data immediately. They keep a record of the filters, joins, map or other functions as a lineage and perform them only when an action is performed.
This helps in design the lineage to compute it in the most optimum way.
The lineage is saved as a DAG direct acyclic graph.
Transformations orderBy(), groupBy(), filter(), select(), join()
Action show(), take(), count(), collect(), save()
Narrow vs Wide Transformations
Narrow Transformations
Narrow transformations are those where each input partition contributes to exactly one output partition. These transformations do not require data shuffling across the network, making them more efficient in terms of performance and resource usage.
Characteristics of Narrow Transformations:
- No Data Shuffling: Data is not transferred across the network, which minimizes the overhead.
- Pipelining: Multiple narrow transformations can be pipelined and executed in a single stage.
- Local Dependencies: Each output partition depends only on a small subset of input partitions.
Examples of Narrow Transformations:
- map: Applies a function to each element of the RDD and returns a new RDD.
pythonCopy code rdd2 = rdd.map(lambda x: x * 2)
- filter: Selects elements that satisfy a predicate.
pythonCopy code rdd2 = rdd.filter(lambda x: x > 2)
- flatMap: Similar to
map
, but each input element can be mapped to zero or more output elements.pythonCopy code rdd2 = rdd.flatMap(lambda x: [x, x*2])
- sample: Samples a subset of elements from the RDD.
pythonCopy code rdd2 = rdd.sample(False, 0.5)
Wide Transformations
Wide transformations are those where each input partition contributes to multiple output partitions. These transformations typically involve shuffling data across the network, which can be resource-intensive and time-consuming.
Characteristics of Wide Transformations:
- Data Shuffling: Data is redistributed across the network to produce the output RDD.
- Stage Boundaries: Wide transformations usually result in the creation of new stages in the DAG, separated by shuffles.
- Global Dependencies: Each output partition can depend on multiple input partitions.
Examples of Wide Transformations:
-
groupByKey: Groups values with the same key.
pythonCopy code rdd2 = rdd.groupByKey()
-
reduceByKey: Aggregates values with the same key using a specified associative reduce function.
pythonCopy code rdd2 = rdd.reduceByKey(lambda x, y: x + y)
-
join: Joins two RDDs by their keys.
pythonCopy code rdd3 = rdd1.join(rdd2)
-
distinct: Returns a new RDD with distinct elements.
pythonCopy code rdd2 = rdd.distinct()
-
repartition: Changes the number of partitions in the RDD.
pythonCopy code rdd2 = rdd.repartition(4)