I don't care if it works on your machine! We are not shipping your machine!stig lindqvist — @stojg
- Vidiu Platon
feel free to interrupt me at any point to ask questions or have opinions.
I'm not an expert, but I do have severe post traumatic stress from dealing with networked computers.
A distributed system is a model in which components located on networked computers communicate and coordinate their actions by passing messages.
this can't possible fail, the network is reliable
echo file_get_contents("http://52.213.12.13");
Nothing has 100% uptime
this can't timeout, the network is using quantum entanglement
echo file_get_contents("http://evil-corp.moon");
CPU instruction | 1 ns |
Datacenter roundtrip | 0.5 ms |
Disk seek | 10 ms |
Wellington -> Auckland roundtrip | 10 ms |
New Zealand -> U.S.A roundtrip | 150 ms |
Approximate timing for various operations on a typical PC: http://norvig.com/21-days.html#answers
this will take no time at all, the network is broader than the universe
echo file_get_contents("http://linux.com/linux_kernel.iso");
Smartphone photo ~ 3MB
56kbit/s | 0.002 | images / second |
24mbit/s | 1 | image / second |
50mbit/s | 2 | images / second |
100mbit/s | 4 | images / second |
1000mbit/s | 40 | images / second |
$url = "http://secret.com?username=admin&password=horse";
echo file_get_contents($url);
$ traceroute 8.8.8.8
traceroute to 8.8.8.8 (8.8.8.8), 64 hops max, 52 byte packets
1 router.silverstripe.com (192.168.1.1)
2 igw-int.knossos.net.nz (202.160.48.64)
3 gw-49.knossos.net.nz (202.160.49.1)
4 xgw2.knossos.net.nz (202.160.48.6)
5 tengigabitethernet0-2-0-972.wnmur-rt1.fx.net.nz (131.203.245.129)
6 tengige0-0-2-0-310.aktnz-art1.fx.net.nz (202.53.187.202)
7 ten-0-3-0-1002.bdr01.akl05.akl.vocus.net.nz (175.45.102.57)
8 ten-0-2-0-3.cor01.alb01.akl.vocus.net.nz (114.31.202.88)
9 ten-0-2-1-0.cor03.syd03.nsw.vocus.net.au (114.31.202.36)
10 ten-1-2-0.bdr03.syd03.nsw.vocus.net.au (114.31.192.53)
11 as15169.cust.bdr01.syd03.nsw.vocus.net.au (114.31.201.18)
12 216.239.41.77 (216.239.41.77)
13 216.239.41.79 (216.239.41.79)
14 google-public-dns-a.google.com (8.8.8.8)
a general proposition not self-evident but proved by a chain of reasoning; a truth established by means of accepted truths.
Consistency | Availability | Partition tolerance
The CAP theorem asserts that any networked shared-data system can have only two of three desirable properties. However, by explicitly handling partitions, designers can optimize consistency and availability, thereby achieving some trade-off of all three.
a read is guaranteed to return the most recent write for a given client.
a non-failing node will return a reasonable response within a reasonable amount of time (no error or timeout).
the system will continue to function when network partitions occur.
data writes waits until partition has been resolved
data reads waits until partition has been resolved
reader/writer deals with conflicts and timeouts
data reads always work, but no guarantees of accuracy.
data writes always work, but no guarantees that it will be kept.
without partition tolerance, you don't have strong consistency or high availability, you have a non distributed system.
As you slide between non function requirements for features you slide the ratio between CP and AP.
.. or how to deal with this as a user / client
Starting an instance
$instances = runInstances([...]);
foreach($instance in $instances) {
createTags($instance->ID, 'Name', "myserver");
}
try {
$instances = runInstances([...]);
} catch(ConnectionException $e) {
echo ':sadtrombone:';
exit(1);
}
try {
$instances = runInstances([...]);
} catch(ConnectionException $e) {
echo ':sadtrombone:';
exit(1);
} catch(TimeoutException $e) {
echo ':hippo:';
exit(1);
}
try {
$instances = runInstances([...]);
} catch(ConnectionException $e) {
echo ':sadtrombone:';
exit(1);
} catch(TimeoutException $e) {
echo ':hippo:';
exit(1);
} catch(ThrottleException $e) {
echo ':headdesk:';
exit(1);
} catch(InstanceLimitException $e) {
echo ':picard:';
exit(1);
} catch(Exception $e) {
echo ':ive_stopped_caring_at_this_point:';
exit(1);
}
define('MAX_RETRIES', 5);
$instances = runInstances([...]);
$retries = 0
do {
$sleepTime = pow(2, $retries) * 100 + rand(0, 100);
usleep($sleepTime);
$status = getStatus($instances);
if($status == SUCCESS) $retry = false;
elseif($status == NOT_READY) $retry = true;
elseif($status == THROTTLED) $retry = true;
else $retry = false; // some random error, stop trying
retries++;
} while($retry && $retries < MAX_RETRIES);
try {
$instance = runInstance([...]);
createTags($instance);
stopInstance($instance);
$ami = createImage($instance);
createTags($ami);
copyAMI($ami, 'client-account')
deleteAMI($ami);
} catch(Exception $e) {
// we can:
// 1) retry the last step a couple of times
// 2) but eventually we need to undo all the things
}
or
my archenemy
separate out the database
need separate session storage so the visitors can bounce
uploadable artifacts and clusterwide cache
the LBs and services needs to be redundant
now we can scale the webservers without worrying
Each arrow is a network connection that needs to handle:
computing is easy to scale
avoid state if possible
be defensive and paranoid
.. or toss it over the wall to the sysadmins