For the start of the second week of 2017 I just discover another good Blog about Data Science and computing systems which I thought suitable for a share here in The Information. It features some nice and important posts on important topics in distributed systems and Cloud computing, such as dataflow graphs, big data processing and Byzantine fault-tolerance. The researcher (Murat Demirbas) is from the United States and is an Associate Professor at the Computer Science and Engineering Department of the University of Buffalo, SUNY.
I thought this post from 2014 interesting as a way to introduce some new topics to cover here in this Blog. For instance the MapReduce programming language, heavily used for Big data processing wasn’t yet properly analyzed here, especially when it comes to its limitations with large-scale iterative graph processing given it is a functional programming language, and in this post we could read about some alternatives and new frameworks designed to deal with this (I just need to know if the developments described in this post are still relevant today two and a half years later).
Naiad is an opensource project that has been receiving a lot of attention recently. I expect we will hear more about Naiad, because it is very useful for low-latency real-time querying and high-throughput incremental-processing of streaming big data. What is not to like?
Naiad is useful especially in incremental processing of graphs. As has been observed before, MapReduce is inappropriate for graph processing because of the large number of iterations needed in graph applications. MapReduce is a functional language, so using MapReduce requires passing the entire state of the graph from one stage to the next, which is inefficient. And for real-time applications batch processing delays of MapReduce becomes unacceptable.
The developer supplies a logical graph to Naiad to describe the dataflow of computation. (Don’t confuse this with the large-scale input graph that Naiad computes on). The edges in this graph show dataflow. The vertices are stateful computation stages.
Figure 1 shows the overall architecture, with two main separate components: (1) incremental processing of incoming updates and (2) low-latency realtime querying of the application state. The query path is cleanly separated from the update path. Queries are done separately on a slightly stale version of the current application state. This way, the query path does not get blocked or incur delays due to the update processing. This also avoids complexity: If queries shared the same path with updates, the queries could be accessing partially processed/incomplete/inconsistent states, and we would have to worry about how to prevent this.
Loops in dataflow graph
The logical dataflow graph can have loops and even nested loops. (Note that, in contrast, a MapReduce computation dataflow does not have any loops, it is a chain of stages; at each stage you progress forward using output of previous stage and producing input for the next stage.)
The loop concept in the dataflow graph is very useful as it enables new applications that may not be possible to compute with MapReduce like frameworks (at least in a reasonably efficient manner). Loops are natural way of dealing with iterative graph processing as in PageRank and machine learning applications.
Naiad even allows nested loops. But, as useful as they are, loops complicate the job of a stream processing system significantly. How do you keep track of when data is purged out, and that data doesn’t keep looping in the system? How do you keep differentiate between older data looping in the system versus new data that is just entering the loop? To deal with these issues the paper introduces an epoch based timestamp to label data that is being processed. These timestamps make the model suitable for tracking global progress in iterative algorithms in a local manner. The progress tracking looks like a deep topic, the Naiad paper refers the readers to a separate 2013 paper for the full explanation of the progress tracking algorithm.
The paper calls the resulting model, the timely dataflow model. In sum, the timely dataflow model supports:
+ structured loops allowing feedback in the dataflow,
+ stateful data flow vertices capable of consuming and producing records without global coordination, and
+ notifications for vertices once they have received all records for a given round of input or loop iteration.
The logical dataflow graph, which denotes the stages of computation and the dataflow between these stages, is mapped on the physical worker machines in many-to-1 fashion. Each worker may host several stages of the dataflow graph. The workers are data nodes. They keep a portion of the input data (usually a large-scale input graph, such as Twitter follower graph) in memory. So it makes sense to move computation (dataflow graph stages) to the data (the partitioned input graph).
Here it is an example of a Naiad program, as it is in the Blog post:
The Java nature of MapReduce/Naiad is easily visible.
The paper has extensive evaluation results. Naiad was deployed on up to 64 computers and scalability results are shown for throughput, global barrier latency, progress tracking and speedup. PageRank (on Twitter follower graph), logistic regression (as an example of batch iterative machine learning) and k-Exposure algorithms (for Twitter topics) are used as examples.
The paper is 16 pages long, and packed with information. But several things remain unclear to me after reading.
How does Naiad do rate control? Within a loop at each epoch a larger neighborhood of a vertex may get affected/triggered (e.g., think of a PageRank like spreading application). How does this not cause an input avalanche? How does Naiad do rate control to send/initiate only as much as it can consume on the worker nodes?
It is not clear if we can implement tightly coordinated applications in Naiad. By tightly coordinated applications I mean applications that require multihop transactions on input graph, such as graph coloring and graph subcoloring.