Disturbed Distributed Computing

I don't care if it works on your machine! We are not shipping your machine!
- Vidiu Platon
stig lindqvist@stojg

feel free to interrupt me at any point to ask questions or have opinions.

I'm not an expert, but I do have severe post traumatic stress from dealing with networked computers.

What is distributed computing?

A distributed system is a model in which components located on networked computers communicate and coordinate their actions by passing messages.

  • browsers
  • routers
  • load balancers
  • web servers
  • database servers
  • asset storage

a.k.a internet

Why distributed computing?

  • performance
  • availability
  • scalability
  • reliability
  • price
  • upgradability

the network

The Network is reliable

this can't possible fail, the network is reliable

                
echo file_get_contents("http://52.213.12.13");
                
            

The Network is not reliable

Nothing has 100% uptime

The Network has zero latency

this can't timeout, the network is using quantum entanglement

                
echo file_get_contents("http://evil-corp.moon");
                
            

The Network has latency

CPU instruction1 ns
Datacenter roundtrip0.5 ms
Disk seek10 ms
Wellington -> Auckland roundtrip10 ms
New Zealand -> U.S.A roundtrip150 ms

Approximate timing for various operations on a typical PC: http://norvig.com/21-days.html#answers

The Network has infinite bandwidth

this will take no time at all, the network is broader than the universe

                
echo file_get_contents("http://linux.com/linux_kernel.iso");
            
            

The Network has limited bandwidth

Smartphone photo ~ 3MB

56kbit/s0.002images / second
24mbit/s1 image / second
50mbit/s2 images / second
100mbit/s4 images / second
1000mbit/s40 images / second

The Network is secure


$url = "http://secret.com?username=admin&password=horse";
echo file_get_contents($url);
            

The Network is not secure


$ traceroute 8.8.8.8
traceroute to 8.8.8.8 (8.8.8.8), 64 hops max, 52 byte packets
1  router.silverstripe.com (192.168.1.1)
2  igw-int.knossos.net.nz (202.160.48.64)
3  gw-49.knossos.net.nz (202.160.49.1)
4  xgw2.knossos.net.nz (202.160.48.6)
5  tengigabitethernet0-2-0-972.wnmur-rt1.fx.net.nz (131.203.245.129)
6  tengige0-0-2-0-310.aktnz-art1.fx.net.nz (202.53.187.202)
7  ten-0-3-0-1002.bdr01.akl05.akl.vocus.net.nz (175.45.102.57)
8  ten-0-2-0-3.cor01.alb01.akl.vocus.net.nz (114.31.202.88)
9  ten-0-2-1-0.cor03.syd03.nsw.vocus.net.au (114.31.202.36)
10  ten-1-2-0.bdr03.syd03.nsw.vocus.net.au (114.31.192.53)
11  as15169.cust.bdr01.syd03.nsw.vocus.net.au (114.31.201.18)
12  216.239.41.77 (216.239.41.77)
13  216.239.41.79 (216.239.41.79)
14  google-public-dns-a.google.com (8.8.8.8)
             

The C.A.P. theorem

Theorem

a general proposition not self-evident but proved by a chain of reasoning; a truth established by means of accepted truths.

C.A.P.

Consistency | Availability | Partition tolerance

The CAP theorem asserts that any net­worked shared-data system can have only two of three desirable properties. How­ever, by explicitly handling partitions, designers can optimize consistency and availability, thereby achieving some trade-off of all three.

wat?

Consistency

a read is guaranteed to return the most recent write for a given client.

Availability

a non-failing node will return a reasonable response within a reasonable amount of time (no error or timeout).

Partition tolerance

the system will continue to function when network partitions occur.

Consistency & Partition tolerance

data writes waits until partition has been resolved

data reads waits until partition has been resolved

reader/writer deals with conflicts and timeouts

  • git push
  • atomic read and writes, e.g. some SQL modes

Availability & Partition tolerance

data reads always work, but no guarantees of accuracy.

data writes always work, but no guarantees that it will be kept.

  • Content Delivery Networks
  • Facebook
  • AWS

"2 out of 3" is not actually true

without partition tolerance, you don't have strong consistency or high availability, you have a non distributed system.

Sliding

As you slide between non function requirements for features you slide the ratio between CP and AP.

Stories from the trenches

.. or how to deal with this as a user / client

Wrestling with AWS/cloud

Starting an instance

Super naive


$instances = runInstances([...]);
foreach($instance in $instances) {
    createTags($instance->ID, 'Name', "myserver");
}
            

Dealing with network errors


try {
    $instances = runInstances([...]);
} catch(ConnectionException $e) {
    echo ':sadtrombone:';
    exit(1);
}
            

Dealing with timeouts


try {
    $instances = runInstances([...]);
} catch(ConnectionException $e) {
    echo ':sadtrombone:';
    exit(1);
} catch(TimeoutException $e) {
    echo ':hippo:';
    exit(1);
}
            

Dealing with other API errors


try {
    $instances = runInstances([...]);
} catch(ConnectionException $e) {
    echo ':sadtrombone:';
    exit(1);
} catch(TimeoutException $e) {
    echo ':hippo:';
    exit(1);
} catch(ThrottleException $e) {
    echo ':headdesk:';
    exit(1);
} catch(InstanceLimitException $e) {
    echo ':picard:';
    exit(1);
} catch(Exception $e) {
    echo ':ive_stopped_caring_at_this_point:';
    exit(1);
}
            

Retries with exponential backoff and random jitter


define('MAX_RETRIES', 5);
$instances = runInstances([...]);
$retries = 0
do {
    $sleepTime = pow(2, $retries) * 100 + rand(0, 100);
    usleep($sleepTime);
    $status = getStatus($instances);

    if($status == SUCCESS) $retry = false;
    elseif($status == NOT_READY) $retry = true;
    elseif($status == THROTTLED) $retry = true;
    else $retry = false; // some random error, stop trying
    retries++;
} while($retry && $retries < MAX_RETRIES);
            

Transactions


try {
    $instance = runInstance([...]);
    createTags($instance);
    stopInstance($instance);
    $ami = createImage($instance);
    createTags($ami);
    copyAMI($ami, 'client-account')
    deleteAMI($ami);
} catch(Exception $e) {
    // we can:
    // 1) retry the last step a couple of times
    // 2) but eventually we need to undo all the things
}
            

Availability & Partition tolerance?

or

Consistency & Partition tolerance?

Dealing with The State

my archenemy

Scaling out a web site

Need moar servers

separate out the database

Sessions

need separate session storage so the visitors can bounce

assets and caches

uploadable artifacts and clusterwide cache

.. and actually

the LBs and services needs to be redundant

easily scalable

now we can scale the webservers without worrying

Oh dear

Each arrow is a network connection that needs to handle:

  • network unreliability
  • network latency
  • network bandwidth
  • network security

Endnote

computing is easy to scale

avoid state if possible

be defensive and paranoid

.. or toss it over the wall to the sysadmins