Xuankang Lin

在准备Qual2的时候看到了几个mutual exclusion algorithm, 那篇paper里举这么几个例子是为了说明他们的new program logic的可用性。但是一直拖到了现在505 TA的课上也说到了类似的东西才终于来写一篇总结..

其实也并没有特别深入的去了解对比每一个算法，也不全，不过有写总比没写好，对吧..

Spinlock

「自旋锁」这个名字其实很不直白，本质其实就是一个thread busy wait, 直到这个lock真正空出来。Wiki上说，spinlock的优势在于，当context switch的overhead比较大，而预期等待时间比较短的时候，用spinlock直接死等会更好。

一种实现如下：

lock(l):
	r := 0
	while (r == 0) {
		r := CAS(l, 0, m)  // m is thread ID, l == 0 means available
	}

如果还不太了解CAS, 出门右转wiki. 它表示一个atomic operation内，只有当l == 0的时候才将其设置成m.

Peterson’s Algorithm

直接上代码吧：

bool flag[0] = false;
bool flag[1] = false;
int turn;

P0:
	flag[0] = true;
	turn = 1; // P0_gate
	while (flag[1] and turn == 1) {
		// busy wait
	}
	// in critical section
	flag[0] = false;

P1:
	flag[1] = true;
	turn = 0; // P1_gate
	while (flag[0] and turn == 0) {
		// busy wait
	}
	// in critical section
	flag[1] = false;

显然，这个有个要求就是：write propagates immediately and atomically. 所以relaxed memory model下、或者compiler乱序优化一下就跪了。

这里为啥能够达到mutual exclusion呢？当Pi在critical section中的时候，显然flag[i] == true, 此时，P_{1-i}有这么几种情况：（只有这几个地方修改了相关变量）

flag[1-i] = true;以及turn = i都还没有执行。所以一会儿它执行到while loop的时候，将会再次更新turn为i, 因此满足了while (flag[i] and turn == i)的条件，进入死等状态。
执行了flag[1-i] = true;, 但是还没有执行turn = i. 这个和第一种情况没差别，这个process一会执行的时候也是会更新turn然后堵在while loop那里busy waiting.
已经都执行了flag[1-i] = true; turn = i的设置，但是既然Pi已经进入了critical section, 就说明Pi这里的flag[1-i] && turn == 1-i的while loop条件没有满足，因此turn肯定是设置成了i, 于是P_{1-i}那里就会busy waiting. 不会进入critical section.

综上所有条件都说明会busy waiting, 不会同时进入critical section.

那么会不会卡在路口那里俩人都不得动弹呢？不会的。因为现在有了”write propagates immediately and atomically”的假定，所以即使两个process都执行完所有的设置语句来到while loop面前了，它们俩的turn变量赋值一定会决出一个胜负。所以一定有一个人进入while loop busy wait, 另一个人直接进入critical section. 反正遍历所有可能的执行顺序组合，都会出现一个process在critical section里头，另外的一个在外头busy waiting的情形啦。

此外更进一步的guarantee是：一个process在离开critical section之后，假设有其它process正在等待，那么就不会立马重新进入critical section. 所以每个process都能有bounded-waiting guarantee.

以上只是对于2 processes的情况，关于如何适应>2 proc的情况，我就没有看了。

话说，我感觉Dekker’s Algorithm和Peterson’s Algorithm超像的，不过应该是因为更晚才发布，Peterson的这个确实更加清晰明了，网上也确实有人好奇问过这个问题的。

Bakery Algorithm

这是Lamport大神提出的，for multiple threads.

Initalization:
	flag: array [1..N] of bool = { false };
	label: array [1..N] of int = { 0 }; // assume no bound

For Proc i that wishes to enter critical section:
	flag[i] = true; // enter the doorway
	label[i] = 1 + max(label[i], ..., label[N]); get ticket
	flag[i] = false; // exit the doorway
	for j = 1 to N such that j≠i {
		while (flag[j]); // wait until j is not in doorway
		while (label[j]≠0 and (label[j], j) << (label[i], i); // wait until j is not "ahead"
	}
	// critical section
	label[i] = 0; // exit section

Why this works? 直接用Lamport大神最开始的比喻就很好理解了：

在一个bakery store, 进来的顾客要先拿号，上头的”doorway”就是拿号的过程。注意，这里是允许几个threads拿到了同样的号的，那样的话就break the ties using thread ID, 就是同时比较label[i], i的过程。号存在label[i]里。
然后拿好号之后就开始排队，准确的说就是等我拿到的号最小了，那么就轮到我了。这里需要先等待没人正在doorway取号中，为什么呢？
- 当这个等待过后，若是有人再去取号，获得的号肯定是比我的大的，所以不用操心。
- 而如果不执行这个wait就直接进入比号的阶段的话，我看到网上说，好像是因为max()此时并不假定是atomic的。那我就能理解了。

4-slot Algorithm

Simpson’s 4-slot Algorithm, 这个在那篇paper里举例说到了，在505的课程上没有打算提到。这个算法是设计用来确保 a single write & a single reader 之间的mutual exclusion的。

其代码其实很简单，但是就是很绕不好看懂！Richard Bornat有专门一个slides来介绍这个，里头是这么写的：

Programming is really hard: only nine lines, no CAS, and you still can’t understand it

哈哈哈哈哈，我竟无法反驳..

global variables:
	reading : bool, initialized to 0 // the pair about to be, being, or last read
	latest : bool, initialized to 0 // the pair last written
	slot : [0..1], both initialized to 0
	data : [0..1][0..1], all initialized to null

Writer (item : datatype):
	pair := ¬reading
	index = ¬slot[pair]
	data[pair,index] = item;
	slot[pair] := index;
	latest := pair;

Reader:
	pair := latest
	reading := pair
	index := slot[pair]
	return data[pair, index]

代码理解如下：

上边里头的pair以及index都是local variable.
关于slot[0]以及slot[1], 原文解释如下：Each element of this array indicates the index of the slot which contains the latest data within the corresponding pair.
对于writer来说，挑¬reading作为新pair很显然的，然后第一次slot[pair]读到的值是上一次写的位置，所以这一次就要换一个位置了。slot[reading]和slot[¬reading]是独立存档的，所以没事。
Why it works? 简要的说，它存了好几份copy在data[0..1][0..1]里，因此可以做到你写你的，我读我的。注意，这里reader是先更新了reading再真正去读数据，而writer是写好之后再更新latest. 一旦reader声明了我要读之后，writer如果还有动作，则要嘛是有新的东西要开始写——那肯定是写在另一个pair里了；要嘛是有新的东西已经写到一半只是还没有更新latest, 那么看情况吧，reader读到latest written (not published yet)还是second to latest都有可能。
显然，writer这里需要是latest更新的时候，之前的update都publish出去的呀。因此，再一次，这个算法的未修改版本（i.e. 未加fences)对于relaxed memory models或者compiler reordering都是不能继续保证正确性的。

经paper中提醒我才注意到，这个algorithm里是没有岔路的哦，一条路走到底！设计这么个算法是要满足3个属性的：

Asynchronism: writer和reader不会影响对方的执行时间，即我不block你
Coherence: 原句是：interleaved access to any data record by the writer and the reader is not permitted. 不过这句话本身也不直白。我直接理解为各种memory model里经常出现的那个coherence了，即所有的write是有一个total order的。各种read也不会先读到后来的write再读到更早的write.
Freshness: Writer写好的”latest complete data record”一定要被reader读到，一complete即available.

电视机前的小朋友们还有可能会问：

为啥2-slot不行呢：paper上说：

The two-slot mechanism fails because there are circumstances in which the writer requires a slot for new data but the reader is still busy with the slot containing the last-but-one data item.

而如果硬是要使用2-slot的话，就只能要嘛牺牲coherence, 要嘛牺牲freshness了。那么为啥4-slot就解决了这个问题了呢：4-slot直接写到另一个pair里去了呀，然后如果reader一直沉迷在旧数据中不可自拔，而writer一直有新数据来的话，那么reader从最老的一两个数据，跳过其间的很多个数据，直接来读最新的一两个数据也是完全没有问题的。
为啥3-slot不行呢：paper上(P21)讲了2 flaws, 但是并没有很理解.. 也无法复述在这里.. 似乎是关于interleaving of control variables时的预料外情况吧。4-slot中关于这个的话，是分了2个pair来分别分档进度的，应该能好一些。
为啥4-slot就足够了呢：呃.. 上边一长串已经大约解说了为啥it works, 至于为啥选4这个数字，仅仅是因为2x2的吧。

关于这个算法就先这样吧，知道大致的思路了，以后若是有具体用到的地方再来细细探索吧。