Weighted automata and matrix factorisations
We discuss extensions of Fliess' theorem which says that the smallest weighted automaton for a function is exactly the rank of its Hankel matrix. The goal is to extend this theorem to subclasses of weighted automata such as probabilistic automata.
Many thanks to the Bellairs Barbados 2019 workshop on Learning and Verification for many discussions on these questions, and in particular Borja Balle, Alexander Clark, Gerco van Heerdt, Pierre Ohlmann, and Joshua Moerman.
Recall that for a formal series f : \Sigma^* \to \mathbb{R}, the following theorem is the basis for weighted automata minimisation as well as learning.
Theorem: (Fliess ’74)
* Any weighted automaton recognising f has at least \textrm{rank}(H_f) many states,
* There exists a weighted automaton recognising f with \textrm{rank}(H_f) many states.
Different matrix factorisations
We will take a slightly different point of view than in the post about minimisation since here the central notion is matrix factorisation.
Let M be a matrix, say in \mathbb{R}^{n \times m}. A factorisation of M of dimension d is M = AW where A \in \mathbb{R}^{n \times d} and W \in \mathbb{R}^{d \times m}. One way to understand this is by looking at the rows of M: the factorisation M = AW says that the rows of M can be expressed as linear combinations of the rows of W. In other words, the rows of M generate the vector space spanned by the rows of W.
We say that a factorisation is
* non-negative if the entries of A and W are non-negative,
* residual if the rows of W are rows of M.
This yields different notions of ranks: for instance the non-negative rank of M is the smallest d such that there exists a non-negative factorisation of M of dimension d. The usual notion of ranks correspond to general factorisations. Note that the residual rank is just the rank: indeed, in choosing a basis for the vector space spanned by the rows of W, we can always choose rows of W. This is not true anymore for non-negative factorisations, so we have three interesting notions of factorisations: general, non-negative, and residual non-negative.
There’s another interesting property of factorisation, called restricted, which requires that the vector space spanned by the rows of W is equal to the vector space spanned by the rows of M. It does not seem to play a role for weighted automata?
Unfortunately, computing the non-negative factorisation of a matrix is hard (NP-hard by Vavasis, see here), which contrasts with the usual definition of rank since a factorisation can be found in polynomial time.
The residual condition was introduced by Donoho and Stodden in 2003. Later, Arora, Ge, Kannan, and Moitra showed in this paper that there is a polynomial-time algorithm for computing the non-negative residual factorisation of a matrix.
The algorithm is actually very simple:
* the first step is to construct W, which means selecting the rows of M (which will be rows of W) to be those which are not convex combinations of other rows of M. This can easily be done in polynomial time since checking whether a vector belongs to a convex combination of other vectors can be reduced to linear programming.
* the second step is to construct A, which is easily done knowing M and W.
Although the first step of the algorithm can indeed be implemented in polynomial time, in practice the following greedy algorithm is often used: the set of rows is picked online, adding at each step the row which is the furthest away from the current convex combination of rows. This algorithm gives an overapproximation of the smallest set of rows, but it is much faster.
Subclasses of weighted automata
In this post we are interested in the following subclasses of weighted automata:
* non-negative automata: all weights are non-negative,
* probabilistic (generative) automata: non-negative and for all states q, we have \sum_{a \in \Sigma, q’ \in Q} \Delta(a)(q,q’) + \beta(q) = 1 and \alpha is a distribution over states which means that f induces a distribution over words,
* residual automata: for all states q, there exists a word w such that \alpha \Delta(w) is 0 everywhere and non-zero in q. We sometimes say that w is an anchor for q.
The notion of anchored appears in this paper by Stratos, Collins and Hsu, called “Unsupervised Part-Of-Speech Tagging with Anchor Hidden Markov Models”.
A closer look at the proof of Fliess’ theorem
Let us get back to the proof of Fliess’ theorem and see how to relate the type of factorisation of H_f we are using to the type of weighted automaton we consider.
The lower bound
Let \mathcal{A} be a weighted automaton recognising f. Define P_\mathcal{A}(\cdot,q) \in \mathbb{R}^{\Sigma^* } the formal series recognised by the weighted automaton \mathcal{A} with e_q as final vector, and S_\mathcal{A}(\cdot,q) \in \mathbb{R}^{\Sigma^* } the formal series recognised by the weighted automaton \mathcal{A} with e_q as initial vector.
We have
H_f = P_\mathcal{A} \cdot S_\mathcal{A}
A few remarks are in order
* without further assumptions, this implies that the rank of H_f is at least |Q|, the number of states of \mathcal{A},
* if \mathcal{A} is non-negative, then this is a non-negative factorisation,
* if \mathcal{A} is residual, then this is a residual factorisation.
The upper bound
We write H_f = A W. We let Q index the inner dimension, i.e. the rows of W. Let w a finite word, we have
H_f(w) = \sum_{q \in Q} A(w,q) W(q),
where H_f(w) is the row corresponding to w in H_f and W(q) the row corresponding to q in W.
Let us assume that the factorisation H_f = AW is residual (and recall that any factorisation can be chosen residual, but not at the same time residual and non-negative). This means that Q is a subset of words!
We construct a weighted automaton \mathcal{A} as follows
* the set of states is Q (did you see that coming?)
* the initial vector is defined by \alpha(q) = A(\varepsilon,q)
* the transition is defined \Delta(a)(q,q’) = A(q a,q’)
* the final vector is the constant vector equal to 1
To see that \mathcal{A} recognises f we do a simple induction on words.
Note that if the factorisation is non-negative, then the automaton \mathcal{A} is non-negative. Also, if H_f is row stochastic, then \mathcal{A} is (generative) probabilistic.
Let us get back to the general case where the factorisation is not assumed to be residual. It is not clear what can be said or how to construct an automaton, because rows of W cannot be interpreted in H_f.
The residual variant of Fliess’ theorem
We have proved the following.
Theorem:
The non-negative residual rank of H_f is equal to the size of the smallest non-negative residual weighted automaton recognising f. If furthermore H_f is row stochastic, then this is equal to the size of the smallest probabilistic (generative) residual automaton recognising f.
Open questions
Can we drop the residual assumption in the previous theorem? Meaning: is the non-negative rank of H_f is equal to the size of the smallest non-negative weighted automaton recognising f? For the lower bound, the proof goes through. I have counterexamples for the upper bound, but it seems to almost hold: it seems that it is enough to add the topological closure of the set of rows. I do not have a proof of this!
Another line of inquiry is how much can the residual theorem be used for learning? It is not yet clear to me how to adapt the Angluin’s style learning algorithm from this post. Note that it is always possible to learn a weighted automaton using the previous algorithm, the goal here would be to directly learn a probabilistic automaton. One solution is to learn a weighted automaton (using polynomially many membership and equivalence queries) and then to turn it into a probabilistic one, but can we do this efficiently (in terms of complexity)? We know that there are examples where the anchor words have exponential length.