IPv6 dual-stack client loss in Norway

By Tore Anderson, Redpill Linpro AS

With the kind assistance of two of my customers, A-Pressen Digitale Medier and VG Multimedia, I've been able to measure end user behaviour towards dual-stacked web sites. APDM and VG are both interested in making their content available over IPv6, but wanted first to make sure that this did not cause any unwanted consequences. Both APDM and VG are Norwegian-language online news publishers - their users are completely ordinary and non-technical.

The primary purpose of the measurement is therefore to determine to what extent we lose expected HTTP accesses, in the case when the end user's web browser is given the choice to access the content via either IPv4 or IPv6 (from a dual-stack host-name, that is), compared to the situation where the web browser has only one choice - IPv4. The assumption is that with all else equal, a larger loss of accesses to the dual-stack host-name indicates that this is caused by end-users/clients having some kind of difficulty accessing content available over both IPv4 and IPv6. I use the term client loss when referring to this unexpected additional loss of accesses from the dual-stack hostname compared to that of the IPv4-only hostname. (I also describe the full setup and calculations used further down.)

The secondary purpose of the measurement is to determine why I observe client loss - what are the underlying causes? The answer to that appears to be old versions of the Opera web browser and Mac OS X. This is because they prefer transitional IPv6 connectivity (6to4, Teredo) above more reliable IPv4 in certain cases, and this makes them less likely to succeed in contacting a dual-stack web server than a single-stack IPv4 one.

UPDATE 2010-12-21: Today, we deployed IPv6 on both A-pressen and VG's sites. This means that the brokenness percentage starting from now on will likely be much understated, as broken users will be predisposed to not reach the measurement rig in the first place. The brokenness percentage in the period 20101214-20101220 was 0.024% (I've saved a snapshot of how this page looked on that day.)

UPDATE 2011-05-09: Last week (2nd-8th of May) we turned off IPv6 in order got get a new (and final) brokenness measurement. The result was a brokeness percentage of 0.015%. This result was presented at the IPv6-Kongress; the slide deck is available here.

Current status

The first graph shows the current overall client loss, while the second one shows a breakdown of the IPv6 traffic I see to the dualstack host.



The Mac OS X problem

Mac OS X has a problem in that it, in versions older than 10.6.5, will prefer 6to4-based IPv6 over IPv4. That is very unfortunate, as 6to4 is much less reliable than IPv4. Most of the 6to4 traffic I see from OS X hosts are using EUI-64 derived IPv6 addresess, indicating that some other device in the end user's network is performing the 6to4 tunneling. I've made an ASCII art illustration of such a network. The following numbers and graphs are intended to show how the situation will look like, when all Mac OS X users have upgraded to version 10.6.5 or above. Unfortunately, the patch is only installable of users that are already running 10.6 "Snow Leopard", for the users of 10.5 "Leopard" and 10.4 "Tiger", no patch is available.

The graphs are generated by simply removing all log lines that contains "Mac OS X" in the User-Agent field prior to running the calculations.

With Mac OS X out of the picture, the amount of 6to4 traffic drops significantly (and with it the amount of IPv6 traffic in total).

To further emphasise Mac OS X' IPv6 problems, I've made the following graphs and numbers that shows the client loss amongst OS X-based clients only. Client loss is much, much higher than on the internet in general, and so is use of 6to4.


The following graph shows the distribution of Mac OS X versions I see in my logs. Regarding the different versions shown:


In a perfect world...

These numbers and graphs show my idea of an ideal situation, where all Opera users have upgraded to 10.50 or later, and all Mac OS X users have upgraded to 10.6.5 or later. The client loss number is very close to 0% at this point - I believe that in this situation, dualstack client loss would no longer be a concern for most content providers.

When there's barely any client loss, there's barely any transitional IPv6:

Any remaining problems is really hard to distinguish from statistical noise at this point. However I have some ideas/guesses:

Test setup

My customers have added the following code (replacing "cust" with their real domain name) to the HTML templates on their (IPv4-only) main sites:

<iframe src="http://dualstack-exp.cust.no/linkgen-cust.php" width="1" height="1" frameborder="0"></iframe>

This script is served from a web server under my control. It generates two randomly-ordered image links to a 1x1 pixel large PNG:

<?php
$rand = rand();
$id = time().":".$rand;
error_log('linkgen: '.$id.' '.$_SERVER["REMOTE_ADDR"]);
if($rand % 2) {
print '<img src="http://dualstack.cust.no/1x1-cust.png?id='.$id.'&src='.$_SERVER["REMOTE_ADDR"].'" height="1" width="1" alt="">'."\n";
print '<img src="http://ipv4-only.cust.no/1x1-cust.png?id='.$id.'&src='.$_SERVER["REMOTE_ADDR"].'" height="1" width="1" alt="">'."\n";
} else {
print '<img src="http://ipv4-only.cust.no/1x1-cust.png?id='.$id.'&src='.$_SERVER["REMOTE_ADDR"].'" height="1" width="1" alt="">'."\n";
print '<img src="http://dualstack.cust.no/1x1-cust.png?id='.$id.'&src='.$_SERVER["REMOTE_ADDR"].'" height="1" width="1" alt="">'."\n";
}
?>

All of the DNS hostnames in question have a five second long TTL:

dualstack.cust.no.	5	IN	A	87.238.40.2
dualstack.cust.no.	5	IN	AAAA	2a02:c0:1010:2::2
ipv4-only.cust.no.	5	IN	A	87.238.40.3
dualstack-exp.cust.no.	5	IN	A	87.238.40.4

Everything is hosted on the same web server. The MTU is set to 1280 and TCP MSS to 1220 for IPv6 and 1240 for IPv4. The PNG is small enough that the entire HTTP response fits comfortably inside a single packet anyway - the only thing that I've seen require fragmentation are HTTP requests with very long headers.

When determining the client loss, I simply parse the HTTP access logs on the test server, and count the number of hits to the linkgen.php (N) script and the 1x1 PNG via the ipv4-only hostname (Ns) and the dualstack hostname (Nd). Hits that re-use an already seen ID string (on the same hostname) are discarded. With all else equal, the assumption is that we should see an identical number of 1x1 PNG hits on the dualstack and the ipv4-only hostnames. If there is a difference, it is considered client loss. So for instance, if we have:

The measured client loss is here measured as 100% * (995-994) / 1000 = 0.1%.

The 10 second timeout

Since the IFRAME and PNGs are loaded in the background, it is likely that some of the apparantly successful hits to the 1x1 PNG only occur after an initial attempt via IPv6 timed out - the user will generally not notice this. However, if such a timeout were to happen on the main site itself, it would cause an unacceptable service degradation. The 10 second timeout variant is an attempt to compensate for this effect. What I do is simply to discard all 1x1 PNG requests that occur more than 10 seconds after the linkgen.php script that generated the IMG links have run (including ones to the ipv4-only hostname), and the remaining log file is processed in the same way as the no-timeout variant. The graphs included on this page all apply the 10s timeout, however you can find graphs without the timeout being applied here.

The assumption is that the 10 seconds will not be sufficient for any application to fall back from a failed initial IPv6 attempt to IPv4. The quickest systemic fallback time we've been able to identify is about 21 seconds (non-Opera browsers on Windows). I've made a graph that compare the 10s timeout to 5 and 20s ones - it shows very little difference between the three timeouts, which I think means that the assumption holds.

Resources and links

Various interesting pages I've come across, please feel free to tip me about others!

Acknowledgements

I'd like to thank APDM and VG for allowing me to perform experiments on their readers, my own employer Redpill Linpro for encouraging me to use time on this, and Steinar H. Gunderson from Google for helping out tremendously all along.

Also I'd like to thank Opera Software for working with me and fixing the problem in their browser, Apple for fixing Mac OS X, and Fedora, Canonical, Gentoo, Novell, Mandriva, and Debian for applying my patches to glibc in their respective Linux distributions.