| Prog Rev | Nsec20

QR code challenge part 2

We extracted the graph representations and transcribed the algorithm in python to progress on a graph. Basically, each graph is a weirdly encoded deterministic finite automaton (DFA); there are 278 automata with approximately 700 states each.

While playing around on some automaton, we see that a valid input length is always 64 chars. Since there is only 2 valid input characters, there is 2^64 inputs to check. No way to brute force that.

We need to find a string that match each one of these 278 automata. We assume there is only a single string that matches each of them, it means that for a given incorrect input, at least one automaton will reject it.

Instead of trying each 2^64 qr strings we can just progress on each automaton in parallel and backtrack the search when a prefix is refused by any automata, no need to progress further.

If we are lucky, we will have something like

q -> ok
qq -> ok
qqq -> refused by automaton 34, so backtrack
qqr -> refused by automaton 124, so backtrack
qr -> ok
qrq -> ok
etc.

class DNode:
    # ...
    def get_next_node_for(self, elem):
        nexts = []
        for node in self.nodes:
            nextnode = node.get_next_node_for(elem)
            if nextnode is None:
                # tous avancent ou personne !
                return None
            nexts.append(nextnode)
        n = self.dg.get_dnode(nexts)
        return n

For a given prefix string, each automaton associates it to a single state (single because the D in DFA); so with 278 automata we have 278 states. It means that a prefix corresponds to a tuple of 278 states.

Eg. qrqqr is

state 45 in automaton 0
state 12 in automaton 1
state 205 in automaton 3
…
state 187 in automaton 277

So qrqqr is the tuple (45, 12, 105, …, 187)

To avoid loops and optimize equivalent prefixes, we memorize the tuples of visited states. If we meet a tuple twice: it means that we are in a loop or that independent prefixes share common suffixes, no need to visit it again.

If we are lucky, we will have something like

…
qrq -> new tuple of states (145, 64, …, 135), we progress
qrqq -> …
…
qrqrq -> again tuple of states (145, 64, …, 135), we are in a loop, we backtrack
qrqrr -> …
…
rrq -> tuple of states (145, 64, …, 135) already known, no need to visit again, we backtrack
rrr -> …
…

So, if we are lucky, we will backtrack early and the search space will be way lower that 2^64. If we are unlucky, the number of tuples is around 700^278 (ie the combinations of each state of each automaton), so we will keep the bound of 2^64 with an additional memory cost due to tuple memorization.

class DGraph:
    # ...
    def solve(self):
        s = self.start
        s.len = 0
        todo = collections.deque([s])
        seen = {s: ""}
        i = 0
        while len(todo) > 0:
            i = i + 1

            n = todo.popleft()
            s = seen[n]
            if (i % 10001 == 0):
                print(f"[{i}+{len(todo)}] {n.name}: {s}")
            if n.exit_node:
                print("SOLVED: " + s)
                return s
            nq = n.get_next_node_for(227)
            if nq is not None and nq not in seen:
                seen[nq] = s + 'q'
                nq.len = n.len + 1
                todo.append(nq)
            nr = n.get_next_node_for(229)
            if nr is not None and nr not in seen:
                seen[nr] = s + 'r'
                nr.len = n.len + 1
                todo.append(nr)

But, we weren’t lucky. The automaton visit, with memorization and in Python, did not converge fast enough and did consume too much memory.

We needed to beef up our approach, data structures and efficiency; we didn’t even look at a single automaton to see if there is dead ends, useless states, loops or whatever. What we did was based on assumption that a basic parallel walk approach will possibility success and that the combination of 278 automata of 700 states will behave nicely.

Years ago, I developed nitcc that included a quick and dirty regular expression and finite automata processing for a compilation course. Let’s reuse it.

We used the python script to convert the weirdly encoded graph data structure and generate a proper by-the-book DFA and then import them in Nit to process with nitcc.

Here the first automaton:

And there are 277 other automata like this one.

Lots of paths with a lot of branching that goes to 4 acceptation states followed by a bunch of dead-end states with some loops. So basically

an automaton rejects a string very late, so no early backtracking
a lot of branches, so the number of tuples explodes, and the memorization is less effective and costs a lot
dead-end states with some loops after the acceptation states
no loops before the acceptation states

Progressing in parallel on 278 of these automata is insanely hard.

DFA can be trimmed : just remove dead-end states that do not lead to an acceptation and even minimized to basically find the smallest equivalent DFA. Those transformations are text-book common and these algorithms are likely implemented in any DFA library, including some quick and dirty versions in nitcc.

Here the first automaton after minimization:

Only 105 states, straightforward transitions, and there is 277 other minimized automata like this one.

Could progressing in parallel on 278 minimized automata be done ? The number of tuples is now around 70^278 (way better than 700^278 but sill insane) but most of these combinations are impossible so let’s try in case of we are lucky or too tired to do basic estimation.

We weren’t lucky. Progressing is way faster than with python code (thanks to automaton minimization and compiled code) but we hit the memory limit soon enough.

And the CTF ended. We were tired and sad. Later Towel, the challenge designer, gave a hint to the participants « use z3 ». Dafuck ?

How could z3 help ? It’s an SMT solver, basically it can solve logic and arithmetic systems. But… can we convert our automata problem into an easy and small SAT problem?

If we look at the first minimized automaton, there is only 3 possible paths with only few constraints. We can easily convert it into a very short logic formula

char12 is q and char48 is q
or char12 is q and char48 is r and char52 is q
or char12 is r

And there is 277 other logic formulas like this one.

In z3, we represented each input character by boolean value, eg. q45 for the 45th character, where true means it’s ‘q’ and false means it’s ‘r’. The first automaton becomes:

(or (and q12 q48) (and q12 (not q48) q52) (and (not q12) ))

We generate and write a big ‘(and …)’ around all the 278 formulas

redef class Automaton
          fun to_z3: String
          do
                  var res = new Array[String]
                  to_z3_work(start, 1, "(and ", res)
                  if res.length > 1 then
                          return "(or " + res.join(" ") + ")"
                  else
                          return res.first
                  end
          end
  
          fun to_z3_work(n: State, len: Int, prefix: String, res: Array[String])
          do
                  var nq = n.trans_char('q')
                  var nr = n.trans_char('r')
                  if nq == nr then
                          if nq == null then
                                  res.add prefix + ")"
                                  return
                          end
                          to_z3_work(nq, len+1, prefix, res)
                          return
                  end
                  if nq != null then
                          to_z3_work(nq, len+1, prefix + " q{len}", res)
                  end
                  if nr != null then
                          to_z3_work(nr, len+1, prefix + " (not q{len}) ", res)
                  end
          end
end

# ...

var t = new Template
for c in [1..64] do t.add("(declare-const q{c} Bool)\n")
t.add("(assert (and\n{z3.join("\n")}\n))\n(check-sat)\n(get-model)\n")
t.write_to_file("system.z3")

Then feed it to z3 and grab a coffee.

$ time z3 system.z3
sat
(model
(define-fun q48 () Bool
true)
(define-fun q52 () Bool
[...]
(define-fun q43 () Bool
false)
)

real 0m0,049s
user 0m0,015s
sys 0m0,008s

That was fast ! No coffee =(.

Now, get the code in ascii. Too bored to parse the z3 output is a sane way, sed and others will do the job…

$ z3 system.z3 | sed 's/q\(.\) /q0\1 /;/Bool/N;s/\n//' | sort | sed 's/.*true.*/q/p;s/.*false.*/r/p;d' | tr -d '\n'
rqrqrrqrqrqqqrrrrqrqqqrqrrqqrqqrqrrrrqqrrrrrqqrqqrqrqqqrrrqrrqrq

NSEC20 QR-Code. part 2/2 - Fri, May 22, 2020 - Jean Privat

QR code challenge part 2

Back to Home