Please do not hesitate to correct me if I am wrong anywhere. Thanks in advance. 🙂
This new paper will appear in the upcoming POPL’17. It’s related to what I have been thinking about for a while. As you’ll see, the idea in the paper is very inspiring, I quite like this paper.
I usually use a shorter term “Memory Models” for them. They essentially define what values can be returned for a READ in different scenarios, mostly in a concurrent setting.
Litmus tests are essentially some tiny concurrent program snippets with a final assertion specifying which is the expected behavior. For example:
Generally it is enough for the final condition to refer to a specific state instead of some abstract assertions. There might be more abstract expressive litmus tests, but at least this paper only takes this form, for the purpose of their Execution→LitmusTest synthesis.
First of all, the authors re-formulated 4 kinds of problems into one general problem:
- About generating litmus tests for one model.
- About comparing different models.
- About verified optimization or fence insertion – can strengthening a program ever introduce additional behaviors?
- Such strengthening (refinement) is supposed to reduce behaviors just like “optimization”, isn’t it. No unprecedented behavior is expected.
- About verified compilation on RMM.
- Again, it’s introducing new behaviors during compilation, some behaviors that are disallowed in the source model, but permitted in the target model.
They formulate the new general problem as:
g(M, N, ▲).
It basically says to find some litmus tests
(P, σ) and
(Q, σ). Here
(P, σ) is some litmus test with P being the program and σ being the final state to check observability. They should satisfy following properties.
M |≠ (P, σ), the litmus test
(P, σ)does not conform to M.
N ⊨ (Q, σ), the litmus test
(Q, σ)conforms to N.
P ▲ Q, two litmus test programs are connected in some relation ▲.
// In the paper, the ▲ is actually some black triangle pointing to the right. But I didn’t know how to type that.. so I just picked a similar symbol ▲.
Then the 4 kinds of problems are instantiations of this general problem.
About generating litmus tests for one model
g(M, ⊤, id), where ⊤ allows all executions and id is identity relation.
- i.e., Find those litmus tests that do not conform to M.
About comparing different models.
g(M, N, id).
- i.e., Find those litmus tests that don’t conform to M but conform to N.
About verified optimization or fence insertion – can strengthening a program ever introduce additional behaviors?
g(M, M, 'is weaker than').
- i.e., Find some programs that cannot reach some final state in M, but can do so after some strengthening.
About verified compilation on RMM.
g(M, N, 'compiles to').
- i.e., Find some programs that cannot reach some final state in M, but its compiled program can do so on the real architecture N.
In their problem definition, to actually reason about the given two models, the definitions about M, N, and the relation ▲ must also be defined (e.g. in Fig. 11), in a way that is based on their Event / Execution hierarchy, a way that can easily encode and delegate to SAT solvers. We will see.
M ~~~~~ N
The overall workflow, in my comprehension, can be illustrated as above.
- In order to reason about models M / N, try to find some witness litmus tests
- But directly reasoning on the litmus test program level is intractable, so reduce the problem to execution level.
- First find some concrete satisfying executions X and Y.
- Then recover the litmus tests from concrete executions X and Y.
I’ll explain these steps in more detail.
g(M, N, ▲) requires to find some
(P, σ) that must fail on model M, it’s essentially asking to find
∀ executions . it fails. This universal quantifier
∀ turns everything intractable.
Their insight is to transform the litmus test level search problem into execution level search problem, and demonstrate that this becomes tractable. And reconstruct litmus test from executions later on.
They want to search for two executions X and Y such that:
- Find execution X (corresponding to P) such that
M |≠ X
- Find execution Y (corresponding to Q) such that
N ⊨ Y
- X △ Y.
// Again, in the paper the relation △ is some white triangle pointing to the right, and I don’t know how to type that..
Note that there is no more σ at the moment, the final state of found execution X will be used as σ. And when reconstructing litmus tests it will need both X and Y to allow the same σ as final condition.
But just these are not enough, obviously. Because this is finding just one execution X that fails on M. For the litmus test program P corresponding to X, it only means P may fail. To ensure that P must fail, they need to impose extra requirements –
X ∈ Dead_M.
The definition for
X ∈ Dead basically says: any other execution corresponding to the same litmus test program P that is consistent with M will not reach final state. In other words, all those executions of P that reach this final state will be inconsistent with M. Hence derived “P must fail on M”.
Above is the semantics definition, the authors presented some syntactic approximation later in the paper, for real-world implementation. It’s approximation, thus loses completeness, but still preserves soundness for up to 8 events according to them.
Honestly however, I am not 100% sure why those rules would work and suffice. I’ll omit that part in this note..
They utilize SAT solvers for executions search. In my opinion, their framework for encoding is also worthy of mentioning.
Executions, which I call execution traces sometimes or execution graphs, the latter one may be more accurate, are defined on top of Events, as usual.
However, their events are some objects purely with an ID tag. You know they were usually defined to contain at least the following information altogether:
- which Thread
- action type: Read/Write/Fence/Lock/Unlock/etc.
But now in this paper, all those attached properties are represented as some set property. For example,
E ∈ ReadSet means E is a Read event.
There are several advantages, according to the authors. The most important reason in my opinion is that:
- The goal is to enable comparing different models. Different models could have different underlying event structures, by hardcoding them in some struct it’s hard to compare while by encoding as some set property it becomes easier, easier for SAT solvers.
- Events are defined upon set properties.
- Executions are defined based on Events.
- Memory model axioms are defined upon Executions & Events.
Essentially, everything is now friendly with set manipulation, therefore friendly with SAT solvers.
They first define a tiny DSL for litmus test programs. For the sake of simplicity, they also define some well-formedness property, e.g., different values for different WRITEs.
Then the next (final) step is to recover litmus tests from the found executions. There is one inference rule in the paper (ξ3.2):
(X, Y) ∈ g'(M, N, △)
So the goal is to find some P / Q / σ such that:
(P, σ) ∈ litmin(X)
(Q, σ) ∈ lit(Y)
P ▲ Q
Their approach is to define some more abstract semantic predicate
lit'(X, P, σ, disabled, failures). This is to connect the execution and the litmus test, semantically.
- X and
(P, σ)are the input and output, respectively.
- “disabled” refers to those in P’s branching statements that are not executed because condition test leads to the other way.
- “failures” refers to those in P’s CAS instructions that are executed after CAS-failed.
lit'() holds whenever there is a bijection μ between P and X, essentially, the instructions in P and events in X are all matched, one by one.
lit() are defined on top of
lit()is some instantiation of any disabled and failures.
litmin()is some instantiation of disabled and failures being ∅.
litmin() is used here for generation of P, while
lit() is used for Q. I am not 100% certain about the reason of using a minimal for P, but I guess it’s due to the
P ▲ Q requirement later on. Having weaker constraints may allow more space in searching of satisfying Q.
As mentioned before, there is one last step, to ensure that X is the dead execution in the corresponding P. But that is not very clear to me yet, so I’d suggest you to read the original paper yourselves. They were able to check that the syntactic approximation ⊆ semantic deadness, for all executions with no more than 8 events.
Their feasibility demonstration examples are very impressive and convincing, they manage to replay many existing problem instances and discover new instances. Their tool can generate simpler litmus tests. I’ll omit all those for simplicity here.