Google B4 SDN-based WAN Paper Review

ReviewNetwork


Problem

How to create an effective and efficient SDN-based WAN?

Introduction

WAN packet losses are typically considered to be unacceptable, resulting in the overprovision of expensive WAN links. Yet, Google's traffics have different priorities and some could defer to other more important traffics, which this fact could be leveraged to reduce total bandwidth.

Previous Work

OpenFlow has been proposed as an abstraction for forwarding plane APIs. RCP/RouterFlow proposes to aggregate computation from multiple switches to a single place, which lays the basis for the separation of the control plane and forwarding plane. Traffic engineering algorithms are also proposed but they do not support global optimization at runtime.

Implementation

The SDN-based WAN introduces several advantages. It supports rapid iteration of existing protocols and the introduction of new protocols. Additionally, it supports easy capacity-management via a central traffic engineering server and allowing simplified management over WAN.

For hardware, the author utilizes a spine-leaf architecture with high-speed merchant silicon and a processor running Linux with an agent supporting an extended version of OpenFlow. For controllers, the authors leverage the Paxos algorithm to elect a master. For traffic engineering, the authors create a centralized TE service that supports tunneling and failure handling.

Insights

One of the most important insights is the observation that shows WAN traffics are of different priorities and should be managed effectively. This insight dictates the usefulness of the SDN solution over the conventional WAN solution. From a global outage, the authors note a few insights about SDN-based WAN solution: latency between OpenFlow agent and controller is of great importance and TE servers must be adapted to controller failures.

The SDN-based solution is helpful for Google's network traffic pattern but may not be generalized to all WAN. For Google, the solution has been already deployed and used for 3 years, ending with 99% uptime guarantees.