Angluin's style learning for weighted automata
We show that weighted automata over the reals can be learned efficiently in Angluin's supervised scenario, also called the exact learning framework.
This post uses all the notations of the previous post, and the result presented here was proved in this paper.
The scenario is Angluin’s style learning, also called exact learning: a master knows a function f : \Sigma^{\star} \to \mathbb{R}, and can answer two types of queries.
* membership query: when asked about w \in \Sigma^{\star}, the master replies with f(w)
* equivalence query: when asked whether f = \mathcal{A} for a given weighted automaton, the master replies YES or gives a counter-example.
Theorem: (Beimel, Bergadano, Bshouty, Kushilevitz, Varricchio, 2000)
Weighted automata are efficiently learnable.
The algorithm maintains two sets X,Y of finite words of the same size such that the matrix H_f(X,Y) is guaranteed to have full rank, and either X is empty or \varepsilon \in X
The first question is how to initialise the data, and then how to maintain it.
For initialisation, we set X = Y = \emptyset.
We now explain how to extend the data. This goes in two steps:
* constructing an automaton and submitting it to the master,
* expanding the matrix using the counter-example given by the master.
We construct a hypothesis automaton. The first step is to fill in H_f(Xa,Y) and H_f(X, \left\{\varepsilon \right\}) using at most O(n^2) membership queries.
The automaton \mathcal{A} is as follows.
* The set of states is X.
* The initial vector is e_1.
* The final vector is H_f(X, \left\{\varepsilon \right\}).
* The transition is defined to satisfy
H_f(Xa,Y) = \Delta(a) \cdot H_f(X,Y).
Equivalently \Delta(a) = H_f(Xa,y) \cdot H_f(X,Y)^{-1}.
Note that we do not claim that this automaton satisfies \mathcal{A}(xy) = f(xy) for x \in X, y \in Y.
We submit \mathcal{A} as an equivalence query to the master, and assume that we get in return a word z such that \mathcal{A}(z) \neq f(z).
We say that a word w is correct if \Delta(w) \cdot \eta = H_f(Xw, \left\{\varepsilon \right\}).
We observe that \varepsilon is correct and that z is not correct, since this would imply \mathcal{A}(z) = \alpha \cdot \Delta(z) \cdot \eta = f(z).
Let w such that w is correct and aw is not correct. Such a word can be detected by querying the suffixes of z of increasing length, thanks to the two observations above.
By definition w is correct:
\Delta(w) \cdot \eta = H_f(Xw, \left\{\varepsilon \right\}).
And aw is not correct, so there exists x \in X such that
\Delta(aw) \cdot \eta(x) \neq f(xaw).
Lemma:
H_f(X \cup{xa }, Y \cup{w }) has full rank.
Since H_f(X,Y) has full rank, it is enough to show that the last column is linearly independent from the others. Assume towards contradiction that this is not the case, there exists \lambda \in \mathbb{R}^Y such that H_f(X \cup{xa },{ w }) = H_f(X \cup{xa }, Y) \cdot \lambda.
This implies
\begin{array}{ccc}
\Delta(aw) \cdot \eta&=&\Delta(a) \cdot \Delta(w) \cdot \eta \\
& =&\Delta(a) \cdot H_f(Xw,{\varepsilon }) \\
& =&\Delta(a) \cdot H_f(X,{w }) \\
& =&\Delta(a) \cdot H_f(X, Y) \cdot \lambda \\
& =&H_f(Xa, Y) \cdot \lambda
\end{array}
Instantiating in x we get \Delta(aw) \cdot \eta = f(xaw), a contradiction with aw
being not correct.
Note that the lemma implies that xa \notin X and w \notin Y, so we do maintain that the size of X and Y are equal.
How long does the algorithm run?
Let n = \textrm{rank}(H_f), we claim that the automaton computed when X and Y have size n does compute f. Indeed, if it did not, then we could extend X and Y as the algorithm does, obtaining a submatrix of H_f of rank n+1, which is impossible.
It follows that the total number of queries is:
* at most n equivalence queries,
* O(n^3) membership queries.
Note that the algorithm we presented here is based on the construction of the backward automaton (see the previous post). A dual algorithm can be constructed using the forward automaton, which is actually what is done in the original paper.