Ragib Hasan's Computer Science Blog: Matthew Caesar's talk on Trustworthy Internet Infrastructure

On 11th March, 2009, I attended UIUC CS Assistant Prof. Matthew Caesar's talk on trustworthy Internet Infrastructure. This was part of the ITI/UIUC TSS Seminar Series.

Matt talked about how to make the Internet routing system trustworthy. Internet infrastructure is very complex. Some people consider it as the most complex thing ever designed. And modeling internet operation is difficult ...

Because of the enormous cost of changing the way the Internet operates, traditional approaches have been to focus on incremental workarounds. Any radical redesign would be useless as it is not practical to deploy.

However, the quick hacks are not clearly working, as we face many problems.

Matt pondered, what if we could start over? Start from scratch and focus on a clean slate approach? He argued that any redesign is not going to happen because of the enormous task of changing routers. So, the best we can do is to work with the existing schemes and make them trustworthy.

Caesar talked about 3 schemes towards trustworthy Internet. The first is easier network monitoring, second, making making network operations/debugging/monitoring simpler to design. and third, making network more available and resilient, with high uptimes.

Among these, Matt talked about Bug tolerant networks.

Building bug tolerant networks with bug tolerant routers

Today's routers have a lot of complex software running on them. Because of the complexity, bugs abound, causing a string of high profile vulnerabilities, outages. Like any complex software , complexity is the cause of bugs, which cause serious issues, that propagate , ISPs often run routers in a way that errors are cascaded.

A recent example was Misconfiguration at AS 47868, Supronet, that started as a BGP error. BGP routing updates, has path vectors, A particular isp supronet misconfigured their routers , a typo, they were running a mikrotik router, this vendor has a bug in router, and cisco also has a bug in their OS, long AS path would trigger buffer overflow. FEb 16, 2009, 1/3 of internet was slow.

Challenges of router bugs

Bugs are different from traditional failures. fail stop failures cause routers to stop working. Bugs cause misbehavior, violate protocol, need vendor repair, exploitable by attackers.

Vikram Adve asked what code was the problem -- can a malicious guy write bad code to run in the router. Caesar replied that's not the main issue they are concerned with. Carl Gunter asked how similar this is to active networking, it seems like a similar concept. Caesar said this is similar to active networks.

Solving the bug problem

Caesar's main research here focuses on how to run multiple copies of router software on the router. A single copy/implementation may have bugs, but with voting among multiple different implementations, bugs can be filtered out.

So, he and his team built a router that had 3 routing software daemons (Quaga, zorp, etc.) running in parallel. These run on top of a hypervisor. After these run, a voting tool decides the final result. The hypervisor gives the illusion of a single routing daemon.

Nitin Vaidya asked, does Caesar assume that similar specs used by all the implementations? Has Caesar thought if two BGP implementations can have different results? Caesar replied that yes, because there are some non-determinism, so results may vary between implementations.

Bill Sanders asked, there are lots of work in voting issues in faut tolerant computing, . Caesar said, the non-determinism is disabled by ISP operators. because for traffic engineering, they need to make things deterministic.

Majoz (?) asked what you do when you have a buggy daemon ... do you remove it? Caesar said they haven't thought of it, but you can patch/update a daemon, or give weights.

Someone asked how much overhead is there? Caesar said update has some additional latency, they implemented it, overheads are 10s of milliseconds to process. But that's OK, because routers only exchange updates every several seconds.

Carl Gunter remarked that most problems are caused by misconfiguration, so what if someone misconfigured all 3 of the daemons. Caesars said that, configuration issues are orthogonal to their research.

Someone asked about the overhead with additional instances of daemons. Caesar replied that each daemons can be run on a separate core in multicore proc (routers have desktop like processors).

Bill Sanders asked if Matt knew about papers or reports that document quantifiably how diverse different copies of software are to produce the same things. Caesar commented it is similar to Knight-Levenson experiment, but Sanders said that was a 1985 experiment done with a few group of students. He then pondered, how much diversity do you need here?

Anecdotal evidence by Bill Sanders - they had a big DARPA projects, two different operating systems, turned off all unneeded service. and then looked how many advisories are common to both operating systems. the answer was 60%. Is it good or bad number? They don't know.

Carl Gunter said that there is a much better chance of replicating the knight experiment because the code size is small. Sanders commented what fraction of error will be caught by the small piece of errors doing voting.

Vikram Adve said: if you are running 3 copies of code, then isn't performance reduced by a factor of 3. Caesar said the bottleneck is in the hardware in data plane.

When sending updates to neighbors, the router also uses voting to choose the update to advertise.

Finally, Matt presented the research questions.

how to do voting?
nature o routing bugs?
how to achieve diversity?

They also looked into different voting strategies:

Wait for consensus: output majority. Downside = wait for consensus to happen, that slows down updates.

Master slave - always output master's answer, but switch to slave on buggy behavior.

Another aspect Matt had little time to talk about was the Study of nature of router bugs. They looked at Bugzilla bug db, and characterized bugs, most are from se fault/crash/freeze.

Wrong behavior (Bugs) = 43%, router crash = 57%.

There was a big diversity across virtual routers, different techniques, code bases, software versions, configurations,

And then he ran out of time.

My take: Interesting idea. But the problem is, often all different implementations of a particular algorithm or routing scheme reuse some components (e.g. random number generator? Some existing common library?). Often, security vulnerabilities are in the common library code, rather than in any particular implementation. How much dependability do you get by using 3 different copies? I'll be very interested in looking into more numbers. I hope Matt's paper on this has more experimental results.

My take2: And obviously, I'm interested to know how secure provenance can solve the malicious router update problem. It seems that by considering provenance of an update, it may be possible to filter out bad/tampered update info.