It’s time to change the web and stop paying bandwidth toll booths
Meet Edgemesh: the next generation of content distribution.
This web page is about 3.4 megabytes in size (compressed), and when you opened it in your browser a complicated (but well orchestrated) dance occurred which moved the data from the LinkedIN servers to your device across an interconnected edge network called a Content Distribution Network (CDN). LinkedIn uses multiple CDN providers for different domains. For example, the domains http://www.linkedin.com and http://static.licdn.com are both served via Level3 Communications and the core platform content (located at http://platform.linkedin.com ) was served up from the godfather of all CDN’s — Akamai. It all happens in a split second, and it’s been this way since the late 1990’s when Akamai pioneered the idea.
The hidden cost of the web: bandwidth
But there is a hidden transaction that took place, LinkedIN had to pay for the bandwidth it used to get this content to you. It’s not much for this 3.4 MB page served up to a user in say New York . For a New York edge it only costs LinkedIN about .015 cents. If you’re reading this from Sao Paulo, Brazil (bom dia!) it’s a little bit more at about .091 cents. If you’re in Sydney (G’day mate) then LinkedIN had to cough up roughly .33 cents.
It’s not that much money, or so it seems, but it’s a BIG business. The CDN market is roughly a 4 billion dollar market today, and that’s excluding the raw bandwidth prices everyone is paying on “the cloud”. A little back of the envelope math shows us what we’re talking about:
Using some guesstimates we can break down the total monthly CDN fees to bring LinkedIN to users everywhere, and it’s probably somewhere in the 12 million dollar per year range. This seems accurate given Dan Rayburn’s estimate that the Level3 2015 CDN revenue was ~170mm/year.
We can then look across the web and see what all our favorite sites are paying, and the numbers start to add up quickly. The New York Times with all those beautiful graphics and images racks up a hefty bill but it’s nothing compared to the millions of photos Pinterest is serving up to millions of devices every month. Undoubtably there are further savings via pre-payment and minimum monthly commitments so these shouldn’t be taken as hard numbers, but they’re Engineer Estimate Accurate ®.
If CDN’s are so expensive, let’s just use “the Cloud ™ “!
The Cloud is amazing and it’s everywhere. Never before has it been so easy to spin up servers anywhere on the globe allowing you to co-locate (cloud-locate?) your services near your customers. As a former operator of a cloud computing company and current board member of a major telecommunications company, I would like to think I have some experience in this field and I feel like it’s time to come clean with some truth. It’s true that doing work on data (compute) in the cloud is cheap, and storing data in the cloud is insanely cheap… but moving data out of the cloud? No. Thats not cheap. Let’s take a look. If you head on over to the AWS pricing page you’ll quickly find it’s bandwidth pricing to get data OUT of the cloud is extremely well hidden. The screenshot below shows US East (N. Virginia) per GB bandwidth rates out to the internet.
Yes, thats roughly 2x the CDN rate for the US. The secret to cloud margins (aside from scale economics on storage and compute) is to charge an absurd amount for bandwidth. In-bound data is almost always free, but outbound? Pay the toll please. To add complexity atop complexity, you can use a CDN alongside your cloud provider. I think Microsoft’s Azure does an excellent job of explaining this here: take note of the small print and the strange “Zone” based pricing.
Google (who along with Akamai) is the only company which peers with over 1,000 networks — so surely they have some sensible rates? Take a look at GCE Internet Egress Rates, and then go punch something. My favorite aspect of that pricing is that the cost is asymmetric — e.g. it costs $0.12 /GB to go from the US to Japan, but it costs $0.14 /GB to come back?
Ok , this is terrible
If you’ve made it this far, you’re probably thinking there has to be a better way. We thought the same thing and so we got to work on solving this problem. To start, lets take a quick tour through the antiques road show of the web and see how we got here and where we need to go.
1990s : Centralized Servers
In the beginning (1990’s) there was the server. Not a virtual machine, and not some 2U pizza box but a server : a hulking, fire breathing, disk spinning, power gulping server. The kind of machine that while running could heat an entire apartment complex.
The average web page in 1998 was around 170 KB, or roughly 15x smaller than today. In those days a centralized server held the content and your browser would make the pilgrimage from your home to wherever the content was. Did it take a while? Yea. A study by Jakob Neilson noted that in 1998 users would tolerate page load times of 15 seconds. To give you a feeling of what an eternity that is today, note that a packet can go from a server in London to another server in New York and back 234 times and still have a few milliseconds to spare.
It was so bad in fact that many users believed the WWW stood for “World Wide Wait”.
2000s: The Age of Akamai
The idea was simple, in order to make the web run faster we should have many smaller servers distributed around the world and direct users to the nearest server. The company they founded was Akamai and it has been, by any measure, a run away success.
Today Akamai has more than 200,000 servers in over 120 countries and is one of the most interconnected networks on the planet. 1 out of every 3 companies in the Fortune 500 uses Akamai. Akamai has been so successful in fact (2015 revenues of 2.2+ billion) that competition has been few and far between. The sheer vastness of the Akamai network means any competitor would need to be at least as large and scalable to win large customers.
The problem(s) with the old design
How exactly does a CDN work? It’s pretty simple. A CDN is a server (often called an edge server) that sits between your users and your content (called the origin). When a user requests a page (say https://cnn.com) the Domain Naming System converts that to an IP address and directs the user to nearest edge server.
The CDN server then checks if it has the content you’ve requested, and if so serves the page. If it doesn’t have the content (a cache miss) then it simply requests the asset from the origin and then serves the page.
This all happens seamlessly to the user who doesn’t know they’re going to an edge server rather than an origin server. The DNS lookup (which here is technically called DNS hijacking) abstracts away all the complexity of finding the nearest server to you. If this seems like a lot of power to entrust to a vendor, it is. Luckily, big DNS providers don’t often fail but when they do it tends to take down everyone.
So, as I said most of the time this all works seamlessly.
Except when it doesn’t. To the right is an example of the kind of errors that can occur when you place a server between your origin and your users. At the time of this post — https://cnn.com was still returning an SSL error from their upstream CDN provider.
At the end of the day, there’s no getting around the fact that your online business is dependent on your chosen edge partner. Unfortunately, those edge partners all use the same basic networks and tend to (because of the economics of peering) exist in the same data centers.
Below is a graph from the team at Cedexis which shows the response time (latency) over 30 days for a collection of CDN providers. If it looks like they are all highly correlated, it’s because they are. In our research we found the correlation of latency across all CDN providers in a region was ~84%.
Edgemesh : A new model for content distribution
So let’s recap:
- Distributing smaller servers around the world is a good idea
- Paying bandwidth fees at different rates around the world is a bad idea
- Hijacking a customer’s DNS is both a tough sell and a bad idea
- Using the traditional network design will increase your correlated failure rate which is a bad idea
- And finally … if we can do something that is 100% software and works with the existing edge infrastructure, then thats a good idea
Step 1: Move the edge server into the browser
Step 2: Replicate local caches between browsers in a peer to peer manner
While your users are reading the page, the edgemesh client calls out to other (nearby) browsers and asks for new content (which you may need next). This is done using a peer to peer protocol called WebRTC . If you need an asset which isn’t in your local cache, before the browser goes to the origin (or CDN) the edgemesh client checks if any nearby user has the required content. If it does then the browser simply downloads the content from the nearest user. The effect is 0 bandwidth fees for the website, and a much faster page load time for the user.
Step 3: Render the right asset, and optimize!
If you’re reading this post on a mobile device, here’s what happened. Your device downloaded all the images in this post as they were — then resized them down to fit on your screen. So you downloaded a 500Kb image and then resized it to ~200Kb. Why not just get the optimal image? Well, that’s a lot of work for the website operator. There are dozens of image formats and even more screen sizes. Storing (and paying for cache storage) on all those formats and sizes is cost prohibitive.
Step 4: Cache and Scale!
Here is where we get to the best part. Today, the web has an inherent bottleneck and single point of failure: the up stream server that delivers the content. As your web traffic increases (hello Cyber Monday) the load on those edge servers increases, and eventually if pushed hard enough, they will fail. But with edgemesh the edge servers are the users — so you get reverse scaling. The more users on the site — the faster and more resilient it is!
With edgemesh, we’ve fundamentally changed the physics of content distribution. Although this may seem like magic, in reality we are simply following the natural progression of the web towards a more distributed and resilient design. There is a lot of technology we’ve glossed over here, and if it sounds like this would be slow just checkout our site — it is of course edgemesh powered.
In our current customer deployments we’re finding that edgemesh can reduce bandwidth costs by up to 80% while simultaneously decreasing page load time by as much as a factor of 2.
Until next time!