NSEC20 QR-Code. part 2/2 - Fri, May 22, 2020 - Jean Privat
QR code challenge part 2
We extracted the graph representations and transcribed the algorithm in python to progress on a graph. Basically, each graph is a weirdly encoded deterministic finite automaton (DFA); there are 278 automata with approximately 700 states each.
While playing around on some automaton, we see that a valid input length is always 64 chars. Since there is only 2 valid input characters, there is 2^64 inputs to check. No way to brute force that.
We need to find a string that match each one of these 278 automata. We assume there is only a single string that matches each of them, it means that for a given incorrect input, at least one automaton will reject it.
Instead of trying each 2^64 qr strings we can just progress on each automaton in parallel and backtrack the search when a prefix is refused by any automata, no need to progress further.
If we are lucky, we will have something like
- q -> ok
- qq -> ok
- qqq -> refused by automaton 34, so backtrack
- qqr -> refused by automaton 124, so backtrack
- qr -> ok
- qrq -> ok
- etc.
class DNode:
# ...
def get_next_node_for(self, elem):
nexts = []
for node in self.nodes:
nextnode = node.get_next_node_for(elem)
if nextnode is None:
# tous avancent ou personne !
return None
nexts.append(nextnode)
n = self.dg.get_dnode(nexts)
return n
For a given prefix string, each automaton associates it to a single state (single because the D in DFA); so with 278 automata we have 278 states. It means that a prefix corresponds to a tuple of 278 states.
Eg. qrqqr
is
- state 45 in automaton 0
- state 12 in automaton 1
- state 205 in automaton 3
- …
- state 187 in automaton 277
So qrqqr
is the tuple (45, 12, 105, …, 187)
To avoid loops and optimize equivalent prefixes, we memorize the tuples of visited states. If we meet a tuple twice: it means that we are in a loop or that independent prefixes share common suffixes, no need to visit it again.
If we are lucky, we will have something like
- …
- qrq -> new tuple of states (145, 64, …, 135), we progress
- qrqq -> …
- …
- qrqrq -> again tuple of states (145, 64, …, 135), we are in a loop, we backtrack
- qrqrr -> …
- …
- rrq -> tuple of states (145, 64, …, 135) already known, no need to visit again, we backtrack
- rrr -> …
- …
So, if we are lucky, we will backtrack early and the search space will be way lower that 2^64. If we are unlucky, the number of tuples is around 700^278 (ie the combinations of each state of each automaton), so we will keep the bound of 2^64 with an additional memory cost due to tuple memorization.
class DGraph:
# ...
def solve(self):
s = self.start
s.len = 0
todo = collections.deque([s])
seen = {s: ""}
i = 0
while len(todo) > 0:
i = i + 1
n = todo.popleft()
s = seen[n]
if (i % 10001 == 0):
print(f"[{i}+{len(todo)}] {n.name}: {s}")
if n.exit_node:
print("SOLVED: " + s)
return s
nq = n.get_next_node_for(227)
if nq is not None and nq not in seen:
seen[nq] = s + 'q'
nq.len = n.len + 1
todo.append(nq)
nr = n.get_next_node_for(229)
if nr is not None and nr not in seen:
seen[nr] = s + 'r'
nr.len = n.len + 1
todo.append(nr)
But, we weren’t lucky. The automaton visit, with memorization and in Python, did not converge fast enough and did consume too much memory.
We needed to beef up our approach, data structures and efficiency; we didn’t even look at a single automaton to see if there is dead ends, useless states, loops or whatever. What we did was based on assumption that a basic parallel walk approach will possibility success and that the combination of 278 automata of 700 states will behave nicely.
Years ago, I developed nitcc that included a quick and dirty regular expression and finite automata processing for a compilation course. Let’s reuse it.
We used the python script to convert the weirdly encoded graph data structure and generate a proper by-the-book DFA and then import them in Nit to process with nitcc.
Here the first automaton:
And there are 277 other automata like this one.
Lots of paths with a lot of branching that goes to 4 acceptation states followed by a bunch of dead-end states with some loops. So basically
- an automaton rejects a string very late, so no early backtracking
- a lot of branches, so the number of tuples explodes, and the memorization is less effective and costs a lot
- dead-end states with some loops after the acceptation states
- no loops before the acceptation states
Progressing in parallel on 278 of these automata is insanely hard.
DFA can be trimmed : just remove dead-end states that do not lead to an acceptation and even minimized to basically find the smallest equivalent DFA. Those transformations are text-book common and these algorithms are likely implemented in any DFA library, including some quick and dirty versions in nitcc.
Here the first automaton after minimization:
Only 105 states, straightforward transitions, and there is 277 other minimized automata like this one.
Could progressing in parallel on 278 minimized automata be done ? The number of tuples is now around 70^278 (way better than 700^278 but sill insane) but most of these combinations are impossible so let’s try in case of we are lucky or too tired to do basic estimation.
We weren’t lucky. Progressing is way faster than with python code (thanks to automaton minimization and compiled code) but we hit the memory limit soon enough.
And the CTF ended. We were tired and sad. Later Towel, the challenge designer, gave a hint to the participants « use z3 ». Dafuck ?
How could z3 help ? It’s an SMT solver, basically it can solve logic and arithmetic systems. But… can we convert our automata problem into an easy and small SAT problem?
If we look at the first minimized automaton, there is only 3 possible paths with only few constraints. We can easily convert it into a very short logic formula
char12 is q and char48 is q
or char12 is q and char48 is r and char52 is q
or char12 is r
And there is 277 other logic formulas like this one.
In z3, we represented each input character by boolean value, eg. q45 for the 45th character, where true means it’s ‘q’ and false means it’s ‘r’. The first automaton becomes:
(or (and q12 q48) (and q12 (not q48) q52) (and (not q12) ))
We generate and write a big ‘(and …)’ around all the 278 formulas
redef class Automaton
fun to_z3: String
do
var res = new Array[String]
to_z3_work(start, 1, "(and ", res)
if res.length > 1 then
return "(or " + res.join(" ") + ")"
else
return res.first
end
end
fun to_z3_work(n: State, len: Int, prefix: String, res: Array[String])
do
var nq = n.trans_char('q')
var nr = n.trans_char('r')
if nq == nr then
if nq == null then
res.add prefix + ")"
return
end
to_z3_work(nq, len+1, prefix, res)
return
end
if nq != null then
to_z3_work(nq, len+1, prefix + " q{len}", res)
end
if nr != null then
to_z3_work(nr, len+1, prefix + " (not q{len}) ", res)
end
end
end
# ...
var t = new Template
for c in [1..64] do t.add("(declare-const q{c} Bool)\n")
t.add("(assert (and\n{z3.join("\n")}\n))\n(check-sat)\n(get-model)\n")
t.write_to_file("system.z3")
Then feed it to z3 and grab a coffee.
$ time z3 system.z3
sat
(model
(define-fun q48 () Bool
true)
(define-fun q52 () Bool
[...]
(define-fun q43 () Bool
false)
)
real 0m0,049s
user 0m0,015s
sys 0m0,008s
That was fast ! No coffee =(.
Now, get the code in ascii. Too bored to parse the z3 output is a sane way, sed and others will do the job…
$ z3 system.z3 | sed 's/q\(.\) /q0\1 /;/Bool/N;s/\n//' | sort | sed 's/.*true.*/q/p;s/.*false.*/r/p;d' | tr -d '\n'
rqrqrrqrqrqqqrrrrqrqqqrqrrqqrqqrqrrrrqqrrrrrqqrqqrqrqqqrrrqrrqrq