The title of today’s post is somewhat careless (I am not really revisiting, just sort of…). But with a purpose, which is to re-introduce a topic that I just barely sketched earlier in a post that was a re-share from the excellent The Morning Paper by Adrian Colyer. A recent post by that same Blog rekindled my interest in revisit here the topic of resilient distributed datasets (RDDs), in a context of in-memory cluster computing, quite a timely topic to data engineers and data scientists.
The reference list of today’s post from The Morning Paper offered the paper that I thought should review here. I would do a kind of reference to myself to further study with this post, recognizing that I am not an expert. But I feel an interest in the subject sufficiently enough to be able to learn it to the best depth of my possibilities. Indeed the paper touch on topics where I might feel closer proximity such as interactive data mining and data streaming open source protocols like Spark with Scala algorithms, developments in data analytics such as MapReduce and database query languages such as SQL. Having said this let us move to the paper:
The research for this paper was carried out by a quite long group of researchers from the University of California at Berkeley. I counted 9 authors. But the issue is understandably an involved one, where the interaction of research in Academia and direct practical or business applicability is intense, where often a large group of players is required.
The main issue is the presentation of a resilient distributed dataset framework that lets programmers perform computations in-memory on large clusters in a fault-tolerant manner:
We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarse-grained transformations rather than fine-grained updates to shared state. However, we show that RDDs are expressive enough to capture a wide class of computations, including recent specialized programming models for iterative jobs, such as Pregel, and new applications that these models do not capture. We have implemented RDDs in a system called Spark, which we evaluate through a variety of user applications and benchmarks
Through the abstract above we could immediately see the interesting proposal by the authors: to use coarse-grained computations in-memory to handle iterative algorithms performing inefficient computing frameworks, instead of fine-grained ones. One would have thought that is a bad idea, but the further details throughout the paper dismisses the argument.
So what was the main motivation? Something closely related to this outline:
Cluster computing frameworks like MapReduce  and Dryad  have been widely adopted for large-scale data analytics. These systems let users write parallel computations using a set of high-level operators, without having to worry about work distribution and fault tolerance. Although current frameworks provide numerous abstractions for accessing a cluster’s computational resources, they lack abstractions for leveraging distributed memory. This makes them inefficient for an important class of emerging applications: those that reuse intermediate results across multiple computations. Data reuse is common in many iterative machine learning and graph algorithms, including PageRank, K-means clustering, and logistic regression. Another compelling use case is interactive data mining, where a user runs multiple ad hoc queries on the same subset of the data. Unfortunately, in most current frameworks, the only way to reuse data between computations (e.g., between two MapReduce jobs) is to write it to an external stable storage system, e.g., a distributed file system. This incurs substantial overheads due to data replication, disk I/O, and serialization overheads due to data replication, disk I/O, and serialization, which can dominate application execution times.
Recognizing this problem, researchers have developed specialized frameworks for some applications that require data reuse. For example, Pregel  is a system for iterative graph computations that keeps intermediate data in memory, while HaLoop  offers an iterative MapReduce interface. However, these frameworks only support specific computation patterns (e.g., looping a series of MapReduce steps), and perform data sharing implicitly for these patterns. They do not provide abstractions for more general reuse, e.g., to let a user load several datasets into memory and run ad-hoc queries across them.
This is the proposal solution to the problem at hand:
In this paper, we propose a new abstraction called resilient distributed datasets (RDDs) that enables efficient data reuse in a broad range of applications. RDDs are fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operators.
But, and following the content of today’s The Morning Paper review about the debugging of distributed programs, the implementation of RDDs has got still some challenges to overcome as rightly described also in this paper. And they are to do with the implemnetation of the application programming interface (API) in a way that minimizes the costs involved:
The main challenge in designing RDDs is defining a programming interface that can provide fault tolerance efficiently. Existing abstractions for in-memory storage on clusters, such as distributed shared memory , keyvalue stores , databases, and Piccolo , offer an interface based on fine-grained updates to mutable state (e.g., cells in a table). With this interface, the only ways to provide fault tolerance are to replicate the data across machines or to log updates across machines. Both approaches are expensive for data-intensive workloads, as they require copying large amounts of data over the cluster network, whose bandwidth is far lower than that of RAM, and they incur substantial storage overhead.
RDDs abstractions in a coarse-grained setting seemed powerful enough to the authors, and provide the necessary cost trade-off between efficiency and re-usability:
Although an interface based on coarse-grained transformations may at first seem limited, RDDs are a good fit for many parallel applications, because these applications naturally apply the same operation to multiple data items. Indeed, we show that RDDs can efficiently express many cluster programming models that have so far been proposed as separate systems, including MapReduce, DryadLINQ, SQL, Pregel and HaLoop, as well as new applications that these systems do not capture, like interactive data mining. The ability of RDDs to accommodate computing needs that were previously met only by introducing new frameworks is, we believe, the most credible evidence of the power of the RDD abstraction.
The belief is based on sound judgement and direct or otherwise informed fact checking about Spark and its Scala implementation:
We have implemented RDDs in a system called Spark, which is being used for research and production applications at UC Berkeley and several companies. Spark provides a convenient language-integrated programming interface similar to DryadLINQ  in the Scala programming language . In addition, Spark can be used interactively to query big datasets from the Scala interpreter.
We believe that Spark is the first system that allows a general-purpose programming language to be used at interactive speeds for in-memory data mining on clusters. We evaluate RDDs and Spark through both microbenchmarks and measurements of user applications. We show that Spark is up to 20× faster than Hadoop for iterative applications, speeds up a real-world data analytics report by 40×, and can be used interactively to scan a 1 TB dataset with 5–7s latency
Resilient Distributed Datasets (RDDs) abstractions
In this section I outline briefly the resilient distributed datasets (RDDs) abstractions as those were implemented in the paper:
Formally, an RDD is a read-only, partitioned collection of records. RDDs can only be created through deterministic operations on either (1) data in stable storage or (2) other RDDs. We call these operations transformations to differentiate them from other operations on RDDs. Examples of transformations include map, filter, and join.
RDDs do not need to be materialized at all times. Instead, an RDD has enough information about how it was derived from other datasets (its lineage) to compute its partitions from data in stable storage. This is a powerful property: in essence, a program cannot reference an RDD that it cannot reconstruct after a failure.
Finally, users can control two other aspects of RDDs: persistence and partitioning. Users can indicate which RDDs they will reuse and choose a storage strategy for them (e.g., in-memory storage). They can also ask that an RDD’s elements be partitioned across machines based on a key in each record. This is useful for placement optimizations, such as ensuring that two datasets that will be joined together are hash-partitioned in the same way.
And now the Spark Programming Interface:
Spark exposes RDDs through a language-integrated API similar to DryadLINQ  and FlumeJava , where each dataset is represented as an object and transformations are invoked using methods on these objects.
Programmers start by defining one or more RDDs through transformations on data in stable storage (e.g., map and filter). They can then use these RDDs in actions, which are operations that return a value to the application or export data to a storage system. Examples of actions include count (which returns the number of elements in the dataset), collect (which returns the elements themselves), and save (which outputs the dataset to a storage system). Like DryadLINQ, Spark computes RDDs lazily the first time they are used in an action, so that it can pipeline transformations.
In addition, programmers can call a persist method to indicate which RDDs they want to reuse in future operations. Spark keeps persistent RDDs in memory by default, but it can spill them to disk if there is not enough RAM. Users can also request other persistence strategies, such as storing the RDD only on disk or replicating it across machines, through flags to persist. Finally, users can set a persistence priority on each RDD to specify which in-memory data should spill to disk first.
Figure 1 explained (Console Log Mining):
Suppose that a web service is experiencing errors and an operator wants to search terabytes of logs in the Hadoop filesystem (HDFS) to find the cause. Using Spark, the operator can load just the error messages from the logs into RAM across a set of nodes and query them interactively. She would first type the following Scala code:
lines = spark.textFile(“hdfs://…”)
errors = lines.filter(_.startsWith(“ERROR”))
Line 1 defines an RDD backed by an HDFS file (as a collection of lines of text), while line 2 derives a filtered RDD from it. Line 3 then asks for errors to persist in memory so that it can be shared across queries. Note that the argument to filter is Scala syntax for a closure.
The user can also perform further transformations on the RDD and use their results, as in the following lines:
// Count errors mentioning MySQL:
// Return the time fields of errors mentioning
// HDFS as an array (assuming time is field)
// number 3 in a tab-separated format):
errors.filter(_.contains(“HDFS”)) .map(_.split(’\t’)(3)) .collect()
After the first action involving errors runs, Spark will store the partitions of errors in memory, greatly speeding up subsequent computations on it. Note that the base RDD, lines, is not loaded into RAM. This is desirable because the error messages might only be a small fraction of the data (small enough to fit into memory).
Finally, to illustrate how our model achieves fault tolerance, we show the lineage graph for the RDDs in our third query in Figure 1. In this query, we started with errors, the result of a filter on lines, and applied a further filter and map before running a collect. The Spark scheduler will pipeline the latter two transformations and send a set of tasks to compute them to the nodes holding the cached partitions of errors. In addition, if a partition of errors is lost, Spark rebuilds it by applying a filter on only the corresponding partition of lines.
The advantages of RDDs sketched in a Table:
The main difference between RDDs and DSM is that RDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads and writes to each memory location.3 This restricts RDDs to applications that perform bulk writes, but allows for more efficient fault tolerance. In particular, RDDs do not need to incur the overhead of checkpointing, as they can be recovered using lineage.4 Furthermore, only the lost partitions of an RDD need to be recomputed upon failure, and they can be recomputed in parallel on different nodes, without having to roll back the whole program.
Transparently clear what these RDDs frameworks aren’t suitable for:
Applications Not Suitable for RDDs As discussed in the Introduction, RDDs are best suited for batch applications that apply the same operation to all elements of a dataset. In these cases, RDDs can efficiently remember each transformation as one step in a lineage graph and can recover lost partitions without having to log large amounts of data. RDDs would be less suitable for applications that make asynchronous fine-grained updates to shared state, such as a storage system for a web application or an incremental web crawler. For these applications, it is more efficient to use systems that perform traditional update logging and data checkpointing, such as databases, RAMCloud , Percolator  and Piccolo . Our goal is to provide an efficient programming model for batch analytics and leave these asynchronous applications to specialized systems.
The paper further details the Spark Programming Interface, the respresentation of RDDs, the implementation and evaluation of the proposal, as well as discussions of related work in the field. I would encourage the full paper reading and reference to all interested readers (well worth it…). For now I will finish with the nice main conclusions from the paper:
We have presented resilient distributed datasets (RDDs), an efficient, general-purpose and fault-tolerant abstraction for sharing data in cluster applications. RDDs can express a wide range of parallel applications, including many specialized programming models that have been proposed for iterative computation, and new applications that these models do not capture. Unlike existing storage abstractions for clusters, which require data replication for fault tolerance, RDDs offer an API based on coarsegrained transformations that lets them recover data efficiently using lineage. We have implemented RDDs in a system called Spark that outperforms Hadoop by up to 20× in iterative applications and can be used interactively to query hundreds of gigabytes of data. We have open sourced Spark at spark-project.org as a vehicle for scalable data analysis and systems research.
Inserted Images: from the paper link
Featured Image: Introduction to Big Data with Apache Spark