The part III of Fault-tolerant multi-agent optimization continues today with the third installment. Following the conclusions of the previous parts I and II of this series, this post extends the results to a global cost function with the definition of a family of functions *C* (normalization functions) between two weaker versions of the problem for crash faults Byzantine faults, respectively:

## Fault-Tolerant Multi-Agent Optimization: Part III

*Abstract:*

We study fault-tolerant distributed optimization of a sum of convex (cost) functions with real-valued scalar input/output in the presence of crash faults or Byzantine faults. In particular, the goal is to optimize a global cost function 1/n(∑i∈Vhi(x)), where V={1,…,n} is the collection of agents, and hi(x) is agent i‘s local cost function, which is initially known only to agent i. Since the above global cost function cannot be optimized exactly in presence of crash faults or Byzantine faults, we define two weaker versions of the problem for crash faults and Byzantine faults, respectively.

When some agents may crash, the goal for the weaker problem is to generate an output that is an optimum of a function formed asC( ∑i∈N hi(x)+ ∑i∈F αihi(x)),

where N is the set of non-faulty agents, F is the set of faulty agents (crashed agents), 0≤αi≤1 for each i∈F and C is a normalization constant such that C(|N|+∑i∈Fαi)=1. We present an iterative algorithm in which each agent only needs to perform local computation, and send one message per iteration.

When some agents may be Byzantine, the system cannot take full advantage of the data kept by non-faulty agents. The goal for the associated weaker problem is to generate an output that is an optimum of a function formed as∑i∈N αihi(x),

such that αi≥0 for each i∈N and ∑i∈N αi=1. We present an iterative algorithm, where only local computation is needed and only one message per agent is sent in each iteration, that ensures that at least |N|−f agents have weights (αi‘s) that are lower bounded by 1/2(|N|−f).

### Problem Formulation

The authors of this interesting paper formulate a hypothesis of an iterative algorithm, deployed to synchronous systems but generalizable to asynchronous systems (discussed only in the final part of the paper), and which finds applications in the domain of large-scale distributed machine learning, where there exists either a data transmission constraint and/or a privacy issue:

This problem finds its applications in the domain of large-scale distributed machine learning, where data are generated at different locations and the data center at each location is not allowed to transmit all the locally collected data to other centers either due to transmission capacity constraint or due to privacy issue. This problem is well-studied in the scenario where each agent is reliable throughout any execution of an algorithm [7,15,21]. In this work, we consider the fault-tolerant version of this problem. In particular, we consider the setting where up to f of the n agents may crash or be Byzantine faulty.

The problems 1 and 2 were set by a figure as follows:

Problem 1

x˜ ∈ arg min x∈R C X i∈N hi(x) +∑ i∈F αihi(x) !

such that

∀i ∈ F, 0 ≤ αi ≤ 1 and C |N | + ∑i∈F αi ! = 1

Problem 2

with parameters β, γ, β ≥ 0

x˜∈ arg min x∈R ∑ i∈N αihi(x)

such that

∀ ∑ i ∈ N , αi ≥ 0, i∈N αi = 1, and

∑ i∈N 1(αi > β) ≥ γ

( Fig. 1: Problem formulations: All non-faulty agents must output an identical value x˜ ∈ R that satisfies the constraints specified in each problem formulation. )

We will say that Problem 1 or 2 is solvable if there exists an algorithm that will find a solution for the problem (satisfying all its constraints) for all admissible local cost functions, and all possible behaviors of faulty agents. Our problem formulations require that all non-faulty agents output asymptotically identical x˜ ∈ R, while satisfying the constraints imposed by the problem (as listed in Figure 1). Thus, the traditional fault-tolerant consensus [9] problem, which also imposes a similar agreement condition, is a special case of our optimization problem.1 Therefore, the lower bound of n > 3f for Byzantine consensus [9] also applies to our problem. Hence we assume that n > 3f.

The Algorithm 1 and 2 for the Problems 1 and 2 will be sketched for a reference about the computations involved in the paper:

for agent j at iteration t:Algorithm 1

Step 1:

Send xj [t − 1] to all the agents (including agent j itself).

Step 2:

Upon receiving xi[t−1] from agent i, compute h′j(xi[t−1]) – the gradient of function hj(·)at xi[t − 1] – and send it back to agent i.

Step 3:

Let R1j [t − 1] denote the set of gradients of the form h′i(xj [t − 1]) received as a result ofstep 1 and step 2. Update sj as

sj [t] = xj [t − 1] −λ[t − 1] / | R1j [t − 1] | (∑(i∈R1j) [t−1] h′i(xj [t − 1]) ) (2)

Step 4:

Send sj [t] to all the agents (including agent j itself).

Step 5:

Let R²j [t − 1] denote the set of auxiliary variables si[t] received as a result of step 4.

Update xj as

xj [t] =1 / |R²j [t − 1]| ∑(i∈R²j[t−1])si[t]. (3)

Steps 1, 2 and 3 correspond to the first round of information exchange, and step 4 corresponds to the second round of information exchange. We will show that Algorithm 1 correctly solves Problem 1. Intuitively speaking, the first round of information exchange corresponds to the standard gradient-method iterate, which drives each local estimate to a global optimum; the second round of information exchange forces all local estimates at non-faulty agents to reach consensus. Algorithm 2 will achieve a similar goal with a single round of exchange.

After a lengthy and thorough sketch of proofs, lemma and theorems (somewhat very lengthy), the authors presented us with the Algorithm 2:

In Algorithm 1, in each iteration t ≥ 1, there are two rounds of information exchange.Next we will present a simple algorithm which only requires one message sent by each agent per iteration. In this algorithm, each agent j maintains one local estimate xj , where xj [0] is an arbitrary input at agent j.

Algorithm 2

for agent j for iteration t ≥ 1:

Step 1:

Compute h′j(xj [t−1])– the gradient of local function hj(·) at point xj [t−1], and send the tuple (xj [t − 1], h′j (xj [t − 1])) to all the agents (including agent j itself).

Step 2: Let Rj [t − 1] denote the set of tuples of the form (xi[t − 1], h′i(xi[t − 1])) received as a

result of step 1. Update xj as

xj [t] = 1/|Rj [t − 1]| (∑(i∈Rj[t−1]) xi[t − 1] − λ[t − 1]h′i(xi[t − 1]))(61)

### Discussion and Conclusions

Following the presentation of the Algorithm 2 and 3 – this third one regarding the synchronous byzantine consensus algorithm, which I do not present here but recommend a further check by the readers, as well as all the appendices section -, the authors finally discuss and conclude the paper, with their proposal for the generalization to asynchronous byzantine fault crash algorithms (with a further Algorithm 4 and final Theorem 5 proposals):

So far, a synchronous system is considered. In an asynchronous system, when there are up to f crash faults, Problem 1 is not solvable, since it is possible that every agent in the system is non-faulty,but f agents are slow. In this case, the system will mistakenly “treat” the slow agents as crashed agents. Consequently, the weights of the slow agents may be strictly smaller than the other agents. Despite the impossibility of solving Problem 1 in asynchronous system, nevertheless, Problem 2can be solved with β ≥ 1 n and γ ≥ |N| − f. In particular, Algorithm 2 can be easily adapted for asynchronous system by modifying the receiving step (step 2). For completeness, we list out the algorithm for crash faults.(…)

In an asynchronous system, when there are up to f Byzantine faults, simple iterative algorithms like Algorithm 3 may not exist, observing that it is impossible to achieve Byzantine consensus with single round of message exchange with only n = 3f + 1 agents. In contrast, when the algorithm introduced in [1] is used as a communication mechanism in each iteration, we believe that Algorithm 3 can be modified such that it can solve Problem 2 with β ≥ 1 2(|N |−f) and γ ≥ |N | −2f. There may be a tradeoff between the system size n and the communication load in each iteration. We leave this problem for future exploration. Note that the definition of admissibility of the local functions in this report is slightly different from that in [19]. Comparing to [19], stronger assumptions are used in proving the correctness of the three iterative algorithms developed in this work. In particular, we require that the local functions have to have L–Lipschitz derivatives. Whether such assumptions are necessary or not is still open, and we leave this for future exploration as well.

*Featured Image: Solmaz S. Kia*