This is the review of the paper “Knowledge and Common Knowledge in a Distributed Environment” by Joseph Halpern and Yoram Moses, who won the 2009 Dijkstra Prize in Distributed Computing for their research of knowledge in the multi-agent setting.

Designing distributed systems is hard. Hard, because the majority of specifications imposes global constraints on the system, but which independent agents must co-operate to satisfy using only local information. Hard, because communication is the modus operandi of distributed systems: often all that the agents can do is to send or receive messages.

The inability to gauge the global state of the system, synchronization problems and faulty communication channels are just a few of the challenges that any developer meets in practice. Dealing with them requires the ability to track the diffusion of knowledge among the nodes and differentiate between states which appear similar on the first sight.

Bounding this uncertainty requires understanding how system controls affect its behavior, and quantifying the feedback from groups of agents emerging in the process of solving a global problem. Halpern and Moses formalize the change in the knowledge state and progression up the knowledge hierarchy, building a solid ground for the design and analysis of distributed protocols.

The authors provide two environments where knowledge management proves to be crucial for understanding the solution or its impossibility: the muddy children puzzle and the coordinated attack problem.

The muddy children puzzle

Distributed protocols often involve agents with some information about other nodes in the system. The following vignette captures this common layout:

A group of \(n\) children are playing together. During play, \(k\) of them, where \(k > 1\), get mud on their foreheads, but all stay silent as one of the parents approaches and says: "At least one of you has mud on your face". The parent then continues to ask the following question again and again: "Can any of you prove that you have mud on your forehead?". Assuming that all children understand the situation, have strong deductive abilities and tell the truth simultaneously, what happens next?

You might want to think about the problem on your own. The authors give the solution: all the kids say no \(k - 1\) times, but on the \(k\)th time all the dirty children say yes.

Before the kids start answering, each one of them already knows the truth of what the parent is going to say. Nevertheless, repeating the same question multiple times still manages to provide useful information. Halpern and Moses proceed to lay out a logical framework and define a knowledge hierarchy which enables rigorous reasoning about knowledge in distributed systems.

Building the knowledge hierarchy

Shedding light on the muddy children puzzle requires us to consider the collective knowledge of independent agents. Here we use two properties of knowledge as formalized in the framework:

  • what an agent knows depends only on its local history: the initial information and any events observed since the beginning
  • an agent knows only the truth

Then we are ready to classify the types of group knowledge:

Example: knowledge distribution
  • Unknown facts: none of the group members know the fact
  • Distributed knowledge: someone learning everything that each member of the group knows would know the fact
  • Someone knows: some members of the group know the fact
  • Everyone knows: every member of the group knows the fact
  • Everyone knows \(k\)-recursively: every member knows (that every member knows)\(^{k-1}\) the fact
  • Common knowledge: everyone knows \(k\)-recursively for all \(k \geq 1\)

Despite the clear hierarchical nature of the classification, these notions might coincide with each other, as in the case of a distributed system consisting of parallel processes with the common memory. On the other hand, if the processors are instead connected via some communication channel and each one of them has its own dedicated memory, then the distinction between classes is strict in the sense that there exists a task which requires an action such that one level of the hierarchy would suffice, but no other lower level would. In the case of the muddy children puzzle, everyone knows \(k\)-recursively is enough to obtain confessions from all of the kids with dirty faces, but everyone knows \((k-1)\)-recursively is not. Indeed, the parent's phrase guarantees that the children have common knowledge of the fact and that each child knows it.

This is a very common situation in distributed systems: communication helps the agents climb up the knowledge hierarchy. Two concepts in this view are of utmost importance: fact discovery, which promotes distributed knowledge to the explicit state (so that at least someone knows the fact) and fact publication, which makes the fact common knowledge. Halpern and Moses focus on the latter.

Common knowledge and fact publication are fundamental to reaching an agreement, starting a convention and deciding on a coordinated action. Halpern and Moses discuss two major ways of obtaining the common knowledge of a fact. The first one, group membership, is evident on the roads: law-abiding drivers must obtain a license before embarking on a trip. The second one is co-presence with the occurrence of the fact. If instead the parent told the same information in private, without other children taking a notice, then no common knowledge is available to the kids for deduction.

In the context of distributed systems, the fact that each agent is a node in the distributed protocol implies community membership, with common knowledge preset before the protocol starts running. Ensuring co-presence through messages, however, is not a trivial task.

Reasoning about the protocols

The coordinated attack problem is a nice example of all the subtleties that arise from using communication channels for simulating co-presence:

Two generals, each leading a separate army, are preparing to attack the city in the valley below the hilltops where the armies stay. If one army attacks without the other, they lose, but they have not agreed on the initial plan on where to launch the assault. Neither general is willing to attack until receiving the confirmation from the other army, sent by a messenger through the valley filled with the enemy. There are no other means of communication. The messenger might get lost or captured by the defenders of the city, but on this particular night, the communication is smooth. How long will the coordination take?

The fact that the coordination takes forever is a well-known result: the generals never reach an agreement and never attack. Proving such impossibility results is often a daunting task without formal tools that capture all the possibilities. The logical framework for knowledge that the authors develop provides all the necessary components.

The notion of a trace (called run by the authors) is the building block of the model we are going to consider. The trace is a description of how the system evolves, from start to the end of execution. The paper focuses on deterministic protocols, and thus the internal state of each processor is a function of the history that we can recover from the trace.

Halpern and Moses derive two key results, giving a logical interpretation of the model, which not only gives us the power of propositional logic, but also enables us to reason about the knowledge in a system in terms of modal operators. First of all, achieving common knowledge requires simultaneity: whenever a group learns a new fact, the change in local histories of its members must be simultaneous. For the coordinated attack problem, any protocol must account for the fact that the decision to attack, if ever reached, is common knowledge, and thus the generals must act simultaneously. Finally, the authors prove that nothing can become common knowledge unless the fact is common knowledge in absence of communication. Thus, no army ever launches an assault in the coordinated attack problem, given a correct protocol. Even when the channel of communication is reliable, but the message delivery time is unbounded, agents still cannot achieve common knowledge.

Bringing common knowledge to practice

Relaxing the requirement of simultaneity to account for temporal imprecision, the authors show that, in the strict sense, practical distributed systems still cannot attain common knowledge in the absence of a global clock. When considering the muddy children puzzle, for example, we can say that some children could overhear what the parent says in the beginning, so they never attain common knowledge to start with. We seem to get stuck in the paradox: agreement and consensus are key to many real well-performing distributed systems on the one hand, and on the other — each of us is likely to have the gut feeling that the intuitive common knowledge is not a rare phenomenon.

The issue is that the notion of coordination that we have used so far is too strong. Many applications do not require simultaneous coordination. The authors offer three ways of introducing weaker states of common knowledge.

Suppose that all nodes receive a message within some epsilon time interval after it is sent. This \(\epsilon\)-common knowledge is attainable, but may require additional overhead for deciding the optimal value of epsilon.

Now, consider the case when the communication is asynchronous: the delivery times have no bounds, but channels guarantee that messages eventually reach the destinations. Unfortunately, this eventual common knowledge does not resolve the issue of the coordinated attack: any protocol that guarantees that whenever either army attacks, the other also eventually attacks, is such that neither party necessarily attacks. If you want to perform an action at all sites within a preset time period, asynchronous communication is of no use for coordination.

Finally, the authors suggest using relativistic time for attaining timestamped common knowledge in many practical cases by associating a timestamp with all knowledge. For example, distributed protocols often work in phases, which provide a natural interpretation as internal clocks. Paxos and atomic broadcast are just two algorithms that leverage timestamped common knowledge for consensus.

What's next?

The proposed framework suffers from the curse of generality: in the attempt to capture the complexity of the largest possible scope, the model makes several simplifying assumptions and discards specifics which may prove to be useful in designing a more optimal solution. For example, the paper looks only at deterministic protocols, which do a poor job at reflecting most real systems, where crashes and faults are common.

Despite these limitations, the paper has proven to be seminal not only in the distributed systems research, but also in AI security and game theory, showing that knowledge management on a group level is a viable approach for revealing the implementation subtleties and analyzing the performance of multi-agent systems.