What the Hell is BGP and How Did It Cripple Google?

What is BGP?

The Border Gateway Protocol (BGP) is the routing protocol of the Internet, used to route traffic across the Internet. For that reason, it's a pretty important protocol, and it can also be one of the hardest protocols to understand.

There are two main aspects of Internet routing, internal fine-grained portions managed by an Interior Gateway Protocol (IGP) such as Open Shortest Path First (OSPF), and then the interconnections of those Autonomous Systems (AS) via BGP.

BGP basics

  • The current version of BGP is BGP version 4, based on RFC4271.
  • BGP is the path-vector protocol that provides routing information for autonomous systems on the Internet via its AS-Path attribute.
  • BGP is a Layer 4 protocol, of the OSI Model, that sits on top of TCP.  It is much simpler than OSPF, because it doesn’t have to worry about the things TCP will handle.
  • Peers that have been manually configured to exchange routing information will form a TCP connection and begin speaking BGP. There is no discovery in BGP.
  • An important aspect of BGP is that the AS-Path itself is an anti-loop mechanism. Routers will not import any routes that contain themselves in the AS-Path.
The image above is the example of BGP from the Cisco BGP Case Studies. 

Why do you need to understand BGP?

When BGP is configured incorrectly, it can cause massive availability and security problems, as Google discovered for a second time on November 12th, 2018 when it's popular Google Cloud addresses began hopping through dozens of different countries.  A BGP Hijack attack is the illegitimate takeover of groups of IP addresses by corrupting Internet routing tables maintained using the Border Gateway Protocol.  However, Google is not the only company to be targeted by these malicious attacks.  Clearly, BGP is significant. Here we'll provide a short overview of how BGP works, along with the problems it solves and causes.

Autonomous systems

First a little terminology. In the world of BGP, each routing domain is known as an autonomous system, or AS. What BGP does is help choose a path through the Internet, usually by selecting a route that traverses the least number of autonomous systems: the shortest AS path.

You might need BGP, for example, if your corporate network is connected to two large ISPs. To use BGP you would need an AS number, which you can get from the American Registry of Internet Numbers (ARIN).

Once BGP is enabled, your router will pull a list of Internet routes from your BGP neighbors, who in this case will be your two ISPs. It will then scrutinize them to find the routes with the shortest AS paths. These will be put into the router's routing table. Keep in mind that if your company or network only connects to a single ISP then BGP would not be necessary since there is only one path to the Internet. Generally, but not always, routers will choose the shortest path to an AS. BGP only knows about these paths based on updates it receives.

Route updates

Unlike Routing Information Protocol (RIP), a distance-vector routing protocol which employs the hop count as a routing metric, BGP does not broadcast its entire routing table. At boot, your peer will hand over its entire table. After that, everything relies on updates received.

Route updates are stored in a Routing Information Base (RIB). A routing table will only store one route per destination, but the RIB usually contains multiple paths to a destination. It is up to the router to decide which routes will make it into the routing table, and therefore which paths will actually be used. In the event that a route is withdrawn, another route to the same place can be taken from the RIB.

The RIB is only used to keep track of routes that could possibly be used. If a route withdrawal is received and it only existed in the RIB, it is silently deleted from the RIB. No update is sent to peers. RIB entries never time out. They continue to exist until it is assumed that the route is no longer valid.

BGP path attributes

In many cases, there will be multiple routes to the same destination. BGP therefore uses path attributes to decide how to route traffic to specific networks. The easiest of these to understand is Shortest AS_Path. What this means is the path which traverses the least number of AS "wins."

Another important attribute is Multi_Exit_Disc (Multi-exit discriminator, or MED). This makes it possible to tell a remote AS that if there are multiple exit points on to your network, a specific exit point is preferred.  The Origin attribute specifies the origin of a routing update. If BGP has multiple routes, then origin is one of the factors in determining the preferred route.

BGP issues

To get a true sense of how BGP works, it's important to spend some time talking about the issues that plague the Internet.

First, we have a very big problem with routing table growth. If someone decides to deaggregate a network that used to be a single /16 network, they could potentially start advertising hundreds of new routes. Every router on the Internet will get every new route when this happens. People are constantly pressured to aggregate, or combine multiple routes into a single advertisement. Aggregation isn't always possible, especially if you want to break up a /19 into two geographically separate /20s. Routing tables are approaching 200,000 routes now, and for a time they were appearing to grow exponentially.

Second, there is always a concern that someone will "advertise the Internet." If some large ISP's customer suddenly decides to advertise everything, and the ISP accepts the routes, all of the Internet's traffic will be sent to the small customer's AS. There's a simple solution to this. It's called route filtering. It's quite simple to set up filters so that your routers won't accept routes from customers that you aren't expecting, but many large ISPs will still accept the equivalent of "default" from peers that have no likelihood of being able to provide transit.

Finally, we come to flapping. BGP has a mechanism to "hold down" routes that appear to be flaky. Routes that flap, or come and go, usually aren't reliable enough to send traffic to. If routes flap frequently, the load on all Internet routes will increase due to the processing of updates every time someone disappears and reappears. Dampening will prevent BGP peers from listening to all routing updates from flapping peers. The amount of time one is in hold-down increases exponentially with every flap. It's annoying when you have a faulty link, since it can be more than an hour before you can get to many Internet sites, but it is very necessary.

This quick discussion of BGP should be enough to get you thinking the right way about the protocol but is by no means comprehensive. Spend some time reading the RFCs if you're tasked with operating a BGP router. Your peers will appreciate it.

Google Affected

On November 12th, 2018, when trying to connect to this blog site that is being hosted by Google Cloud Services, I began to notice issues regarding the connection of my home ISP and the IP provided by Google.  When running a traceroute I saw that instead of moving towards the middle of the United States, my traffic was taking a trip around the world through companies such as Russia. Reviewing traffic from a workplace it was noticed that it was not isolated to a singular location but was impacting multiple users from multiple different ISPs.   The path the traffic was taken was concerning, but after looking at the paths for multiple Google sites and non-Google sites I considered it was an upstream issue that was causing this issue. Turns out the issue was on a more global scale.  

Image provided by Ripe NCC which shows a snapshot of BGP routing announcements that led to Google traffic being routed in a roundabout path through China telecom.

This incident at a minimum caused a massive denial of service to G Suite and Google Search. However, this also put valuable Google traffic in the hands of ISPs in countries with a long history of Internet surveillance.  While the traffic might have been encrypted, in terms of cryptographic strength there is always a question whether or not this data can be reversed, either now or in the future.  Additionally, I am do not know if this was a misconfiguration or a malicious act, these leaked routes propagated from China Telecom, via TransTelecom to NTT and other transit ISPs.  

The incident that has occurred to this Internet giant really shows the weakness in the infrastructure of the Internet.  BGP was designed to be a chain of trust between ISPs and universities.  This trust was supposed to be blindly accepted from different points of origin.  That trust has not evolved to reflect the complex commercial and security issues in current times.  There are certain verification methods like Route Origin Authorization (ROA) that exist, however very few ISPs use them.  Even the Internet giant Google are not immune from BGP hijacks and leaks.  Google took nearly two hours to resolve the issue.  

In the absence of guarantees, enterprises need to continuously monitor their BGP routes and detect such incidents quickly in order to mitigate any service impacts to their business.  Additionally, advancements in the underlying infrastructure of the Internet needs to be reevaluated going forward.  Until next time!  

References:

https://tools.ietf.org/html/rfc4271

https://blog.thousandeyes.com/internet-vulnerability-takes-down-google/

https://en.wikipedia.org/wiki/Border_Gateway_Protocol

https://www.arin.net/

https://bgpmon.net/securing-bgp-routing-with-rpki-and-roas/