Pyspark Notes

You,spark

Apache Spark

In memory data processing framework designed for large scale distributed data processing

In-memory processing in Apache Spark refers to the ability to store and process data in the memory (RAM) of the cluster's nodes rather than on disk. This feature significantly boosts performance, especially for iterative algorithms and interactive data analysis.

Resilient Distributed Datasets (RDDs) are a fundamental data structure in Apache Spark, designed to handle large-scale data processing tasks efficiently. They are immutable, distributed collections of objects that can be processed in parallel across a cluster. Here’s a detailed explanation of RDDs:

Key Features of RDDs

  1. Immutability:
    • Once an RDD is created, it cannot be altered. However, transformations can be applied to produce new RDDs.
    • This immutability ensures consistency and simplifies parallel processing.
  2. Distributed:
    • RDDs are distributed across multiple nodes in a cluster, enabling parallel processing.
    • This distribution allows Spark to handle large datasets that wouldn't fit on a single machine.
  3. Fault Tolerance:
    • RDDs achieve fault tolerance through lineage information. If a partition of an RDD is lost, it can be recomputed from the original data using the sequence of transformations (lineage).
    • This design eliminates the need for data replication.
  4. Lazy Evaluations:
    • Transformations on RDDs are lazily evaluated, meaning they are not computed immediately. Instead, Spark builds a lineage graph of transformations.
    • The actual computation happens only when an action (e.g., count, collect) is invoked.
  5. Operations:
    • RDDs support two types of operations: transformations and actions.
      • Transformations (e.g., map, filter, reduceByKey): Create new RDDs from existing ones. These are lazy.
      • Actions (e.g., collect, count, saveAsTextFile): Trigger the execution of transformations and return results to the driver or write them to storage.

Way faster than Hadoop MapReduce!

Apache Spark incorporates libraries with composable APIs for:

  1. MLib
  2. SparkSQL
  3. Structured Streaming - for interating with real time data
  4. GraphX

Spark can read data from multiple sources.

Untitled

Spark Driver Responsible for starting spark session. Requests resources like CPU and memory from cluster manager for spark executors.

Spark Session Entry point for all spark application

Untitled

Untitled

Lazy Evaluation

Transformations don’t compute the data immediately. They keep a record of the filters, joins, map or other functions as a lineage and perform them only when an action is performed.

This helps in design the lineage to compute it in the most optimum way.

The lineage is saved as a DAG direct acyclic graph.

Transformations orderBy(), groupBy(), filter(), select(), join()

Action show(), take(), count(), collect(), save()

Narrow vs Wide Transformations

Narrow Transformations

Narrow transformations are those where each input partition contributes to exactly one output partition. These transformations do not require data shuffling across the network, making them more efficient in terms of performance and resource usage.

Characteristics of Narrow Transformations:

Examples of Narrow Transformations:

Wide Transformations

Wide transformations are those where each input partition contributes to multiple output partitions. These transformations typically involve shuffling data across the network, which can be resource-intensive and time-consuming.

Characteristics of Wide Transformations:

Examples of Wide Transformations: