Pyspark Notes

You,Fri May 31 2024•spark

Apache Spark

In memory data processing framework designed for large scale distributed data processing

In-memory processing in Apache Spark refers to the ability to store and process data in the memory (RAM) of the cluster's nodes rather than on disk. This feature significantly boosts performance, especially for iterative algorithms and interactive data analysis.

Resilient Distributed Datasets (RDDs) are a fundamental data structure in Apache Spark, designed to handle large-scale data processing tasks efficiently. They are immutable, distributed collections of objects that can be processed in parallel across a cluster. Here’s a detailed explanation of RDDs:

Key Features of RDDs

Immutability:
- Once an RDD is created, it cannot be altered. However, transformations can be applied to produce new RDDs.
- This immutability ensures consistency and simplifies parallel processing.
Distributed:
- RDDs are distributed across multiple nodes in a cluster, enabling parallel processing.
- This distribution allows Spark to handle large datasets that wouldn't fit on a single machine.
Fault Tolerance:
- RDDs achieve fault tolerance through lineage information. If a partition of an RDD is lost, it can be recomputed from the original data using the sequence of transformations (lineage).
- This design eliminates the need for data replication.
Lazy Evaluations:
- Transformations on RDDs are lazily evaluated, meaning they are not computed immediately. Instead, Spark builds a lineage graph of transformations.
- The actual computation happens only when an action (e.g., count, collect) is invoked.
Operations:
- RDDs support two types of operations: transformations and actions.
  - Transformations (e.g., map, filter, reduceByKey): Create new RDDs from existing ones. These are lazy.
  - Actions (e.g., collect, count, saveAsTextFile): Trigger the execution of transformations and return results to the driver or write them to storage.

Way faster than Hadoop MapReduce!

Apache Spark incorporates libraries with composable APIs for:

MLib
SparkSQL
Structured Streaming - for interating with real time data
GraphX

Spark can read data from multiple sources.

Untitled

Spark Driver Responsible for starting spark session. Requests resources like CPU and memory from cluster manager for spark executors.

Spark Session Entry point for all spark application

Untitled

Lazy Evaluation

Transformations don’t compute the data immediately. They keep a record of the filters, joins, map or other functions as a lineage and perform them only when an action is performed.

This helps in design the lineage to compute it in the most optimum way.

The lineage is saved as a DAG direct acyclic graph.

Transformations orderBy(), groupBy(), filter(), select(), join()

Action show(), take(), count(), collect(), save()

Narrow vs Wide Transformations

Narrow Transformations

Narrow transformations are those where each input partition contributes to exactly one output partition. These transformations do not require data shuffling across the network, making them more efficient in terms of performance and resource usage.

Characteristics of Narrow Transformations:

No Data Shuffling: Data is not transferred across the network, which minimizes the overhead.
Pipelining: Multiple narrow transformations can be pipelined and executed in a single stage.
Local Dependencies: Each output partition depends only on a small subset of input partitions.

Examples of Narrow Transformations:

map: Applies a function to each element of the RDD and returns a new RDD.
```
pythonCopy code
rdd2 = rdd.map(lambda x: x * 2)
```

filter: Selects elements that satisfy a predicate.

pythonCopy code
rdd2 = rdd.filter(lambda x: x > 2)

flatMap: Similar to map, but each input element can be mapped to zero or more output elements.
```
pythonCopy code
rdd2 = rdd.flatMap(lambda x: [x, x*2])
```
sample: Samples a subset of elements from the RDD.
```
pythonCopy code
rdd2 = rdd.sample(False, 0.5)
```

Wide Transformations

Wide transformations are those where each input partition contributes to multiple output partitions. These transformations typically involve shuffling data across the network, which can be resource-intensive and time-consuming.

Characteristics of Wide Transformations:

Data Shuffling: Data is redistributed across the network to produce the output RDD.
Stage Boundaries: Wide transformations usually result in the creation of new stages in the DAG, separated by shuffles.
Global Dependencies: Each output partition can depend on multiple input partitions.

Examples of Wide Transformations:

groupByKey: Groups values with the same key.

pythonCopy code
rdd2 = rdd.groupByKey()

reduceByKey: Aggregates values with the same key using a specified associative reduce function.
```
pythonCopy code
rdd2 = rdd.reduceByKey(lambda x, y: x + y)
 
```

join: Joins two RDDs by their keys.

pythonCopy code
rdd3 = rdd1.join(rdd2)

distinct: Returns a new RDD with distinct elements.
```
pythonCopy code
rdd2 = rdd.distinct()
 
```
repartition: Changes the number of partitions in the RDD.
```
pythonCopy code
rdd2 = rdd.repartition(4)
 
```