Wednesday 28 November 2012

Introduction to Automata Theory

Here, we introduce the field of automata theory.

The most general notion of abstract automaton is called a state transition machine (STM), a formalism used to model machines or computations by a 3-tuple (L, S, T):
  • L is a linguistic structure symbolising the input that will get manipulated by the machine as well as possible output.
  • S is a state structure with a set of states Q, a single starting state in Q and a set of final states in Q.
  • T is an action structure containing a set of labels and a transition function, modelling the behaviour of the machine. That is, given a state in Q and specific constraints from a label (input/output symbols, costs, capacities, probabilities, etc.), the function will derive which states the machine should switch to. 
  Note that there are countably many ways to define a STM, and the literature on STMs is relatively poor and inconsistent. Suffice to say that STMs are a blueprint for constructing any form of automata.

   On the other hand, there are uncountably many different types of automata and, much to our despair, there is no unifying classification for all of them...

   Automata can then be organised by which machinery or dynamic system they are defined to model (cells - cellular automata, neurons - neural nets, computation - Turing machines, stochastic languages - probabilistic automata, etc.); or by the extent of their computational power, which is reflected by how many algorithms they can simulate or, equivalently, how many languages they can accept:
  • Finite State Machines/Automata (FSM/FSA) accept regular languages.
  • Pushdown Automata (PDA) accept context-free languages.
  • Linear Bounded Automata (LBA) accept context-sensitive languages.
  • Turing Machines (TM), including quantum TMs, accept recursively enumerable languages.
  • Stream Automata (\omega-Automata) accept \omega-languages.
  • Inductive Turing Machines (ITM) accept super-recursive languages.
  In the following segment we will choose to focus on finite-state automata, that is: machines with string input, that will parse their input character by character by changing states depending on their transition functions and eventually (since strings are finite) halt; the string will be accepted if the automaton halts in an accepting state and rejected otherwise.

An FSA is a 5-tuple (Q, \Sigma, \delta, q_0, F), where:
  • Q is a finite set of states
  • \Sigma is a finite alphabet
  • \delta is a transition function: Q x \Sigma -> Q or P(Q) (where P(Q) is the powerset of Q)
  • q_0 is the start state
  • F is the set of accepting states
FSAs are characterised by having finite memory that depends only on the finite number of states the automaton can reach. They differ on the form of their transition function:
  • Deterministic FSA (DFSA): \delta : Q x \Sigma -> Q
    • Moore machine: \delta : Q -> Q (next state depends only on current state)
    • Mealy machine: \delta : Q x \Sigma -> Q (next state depends on input)
  • Non-deterministic FSA (NFSA): \delta : Q x \Sigma -> P(Q)
  FSAs can only be in a single state at once (like most non-quantum automata): the state in which a DFSA is, is deterministic (we know which state it is in with a 1 probability), the state in which an NFSA is, is non-deterministic (we don't know which state it is in, it could be in any state within the element of P(Q) returned by \delta). An NFSA will accept an input if it could be in an accepting state.

  Notice that the transition function for NFSAs differs by the fact that its range is the powerset of the range for that of DFSAs; thus it is possible to model any NFSA with a DFSA (the reverse is trivial), but doing so requires a number of states bounded by the cardinality of the powerset of the states in the NFSA - recall that we proved it to be 2^{#states in NFSA}. Hence non-determinism can be advantageous for its complexity aspect (rather than its computability aspect).

  There is a one-to-one correspondence between all languages that can be formed from a regular expression and languages that can be accepted by an FSA, namely both these classes of languages are equivalent - they are the class of regular languages: 
  •  prove L(R) C L(M), use the recursive definition of regular expressions to show that FSAs can be built to accept any language induced by the base cases and by construction.
  •  prove L(M) C L(R), use state elimination on an arbitrary FSA: show how a regular expression can express any intermediary state of the FSA, thus how any FSA can be reduced to a single start state with regular expression transitions to accepting states. The union of all these REs is the regexp generating the laguage the FSA accepts.
  •  conclude that L(M) = L(R), that is: REs and FSAs are equivalently suitable to generate/accept any regular language.

  We provided a small foundation in automata theory, which provides us with powerful formalisms to model computations. We also briefly had a glance at the standard formalism of Computation: Turing Machines, based on automata (using the same STM construct). We tentatively extrapolate our remark about FSAs being as powerful a formalism as REs, to TMs being as powerful a formalism as another formalism for Computation, the Lambda-Calculus.

Monday 26 November 2012

Introduction to Formal Language Theory

Here, we introduce the field of formal language theory.

There are 4 types of languages:

  • natural language - language arising naturally within societies
  • constructed language - language consciously defined within societies
  • artificial language - language arising from artificial situations
  • formal language - language axiomatically defined
We now use set theory to construct the theory of formal languages.

  An alphabet is a set \Sigma whose elements are symbols (characters). We define the object of string (word), akin to tuple, by introducing the concatenation operation o on symbols and strings (the concatenation of symbols (or strings) a, b in \Sigma, a o b = ab is a string).
  For this new type of object, we define empty string \epsilon (akin to empty set), length (akin to cardinality), substring relation (akin to subset relation), and string exponentiation (repeated concatenation).
  Finally, we define a new set operator that makes sense on alphabets or sets of strings, the Kleene star * (akin to infinite powersetting): an operator on a set S defined inductively as constructing the first 'powerset' by forming all possible concatenations of elements in S U {\epsilon} (hence it will contain S and \epsilon properly), then constructing all concatenations within that new set, and doing so iteratively infinitely. Then if S is a finite set of symbols or strings, S* will be a countably infinite set closed under o, namely the set of all possible words that can be made by joining together elements of \sigma.

  We can now define a language L over an alphabet \Sigma as the subset of \Sigma* that contains all valid words in the language. Moreover, all usual set theoretic operations still hold and we can extend the string operations of concatenation and exponentiation to languages.


  This paragraph will be a short digression. Notice that in a language all words are finite, so, in order to avoid insulting Cantor's teachings, we also define an \omega-language over \Sigma as a subset of \Sigma^\omega (constructed by infinitely combining elements in \Sigma). Also, we can denote \Sigma^\inf = \Sigma* U \Sigma^\omega as the set of all finite and infinite words of \Sigma.

  To start categorizing formal languages, we first need a formalism to describe the syntax of a language: the concept of (generative) formal grammars. We informally define a formal grammar as the set of production rules necessary to construct all the words in a language, thus it formalizes its syntax.
  Then we partition the set of all formal languages into 2 major kinds of languages: recursively enumerable and non-recursively enumerable languages - notions central to computability/recursion theory. Suffice to say that to build a non-recursively enumerable language, one has to use non-straight forward techniques like Cantor's diagonalization; which is outside the scope of formal grammars' production rules.

The Chomsky hierarchy provides a way to categorize formal grammars:

  • Unrestricted grammars generate recursively enumerable languages. This class includes all formal grammars and contains properly all context-sensitive grammars.
  • Context-sensitive grammars generate context-sensitive languages. This class contains properly all context-free grammars.
  • Context-free grammars generate context-free languages. This class contains properly all regular grammars.
  • Regular grammars generate regular languages. This class of grammars has the same expressiveness as regular expressions, that is both are equivalently suitable to generate the class of regular languages. 
  We now introduce regular expressions as a convenient, compact notation for representing regular grammars and regular languages. That is, a given regular expression will provide a way to construct/describe all words of a given regular language; the procedure used is known as pattern matching.
  Given a finite alphabet \Sigma, the set of regular expression RE is defined recursively as the smallest set described by:
Basis: empty set, empty string and any string literal (symbol) in \Sigma belong to RE.
Inductive construction: if R,S belong to RE, then R+S (alternation), RoS (concatenation, abbreviated RS) and R* (Kleene star) belong to RE.

  Thus for a given regular language, we can construct a regular expression - a pattern from which every word in the language can be generated. The string operations o and * hold for this concept of pattern as well as alternation (+), it is the analogue of set union: it will match a string that, at a given position within it, matches either of the disjuncted regexps.

  Procedurally: the regexp A*(R+S)BB will match any string that starts with 0 to n literals 'A' followed by a pattern matching either R or S and ending with 2 literals 'B'; for example if \Sigma = {A, B, C} and R = C* and S = CA, the strings BB, AAABB, CCCBB, AAACBB, CABB, AACABB, etc. are in the language L(A*(C*+BA*)BB).

  We provided a brief basis in formal language theory, the field of study concerned with the syntactical aspects of formalised language. This basis is essential to domains in TCS such as programming language theory in which words of a formal language are given specific semantics and computational complexity theory and mathematical logic where this formalism is chosen to express decision problems and logical statements.

Wednesday 7 November 2012

Divide and Conquer

Here, we develop results for divide and conquer algorithms.

  The divide and conquer strategy consists of dividing a problem into subproblems until these become trivial to solve, then combining the results iteratively (bottom-up) to provide the solution to the original problem. The paradigm is very efficient for certain kinds of problems where that design makes sense: sorting (quicksort, mergesort), searching (binary search), multiplying (karatsuba, strassen), etc. This strategy is naturally implemented using recursion: if the problem is trivial, solve it, else make recursive calls to solve subproblems and combine their return values in a specific way.

  The time complexity (function describing the amount of time the algorithm will take to terminate depending on the size of its input) of such algorithms is going to depend on the number of recursive calls, the time each of these recursive calls takes and the time complexity of the function that combines the return values of the recursive calls. Hence, the time complexity is going to take the form of a recurrence relation: for the trivial cases, the time is fixed, for the nontrivial cases, the running time given an input of size n will depend on the time given an input of size n' < n. To obtain a closed form
(depending only on the size of the input) for the time complexity, we use the method of repeated substitution.

  Since the validity of the obtained closed form relies on the assumptions made on k, there are several more steps to take in order to prove the result for all inputs: since we are interested in the time behaviour, we will actually end up using that closed form to set a bound on the time complexity. We can prove that T(n) is a non-decreasing function inductively, then use the closed form to set a lower bound, upper bound and/or tight bound on T. For convenience:

Recall asymptotic complexity bounds (big O notation):
Upper asymptotic bound 
Tight asymptotic bound 
Lower asymptotic bound 

  A generalisation of deriving asymptotic bounds for the recurrence relations arising as time complexities of divide and conquer algorithms is the Master Theorem. Given a relation of the form T(n) = aT(n/b) + f(n) for n bigger than a given threshold and f(n) in \Theta(n^d), where a constitutes the number of recursive calls, b is the dividing factor and f(n) represents the time complexity of the dividing and recombining procedure, the master theorem will provide a tight bound for T as follows:
  • if a < b^d then T(n) in \Theta(n^d)
  • if a = b^d then T(n) in \Theta(log(n)*n^d)
  • if a > b^d then T(n) in \Theta(n^log_b(a))
 Notice that the theorem assumes a relation with a constant b, i.e. the subproblems must be roughly the same size. There is another generalisation of this procedure for deriving bounds for the time complexity of divide and conquer recurrences, that can be used when the subproblems have different sizes, it is called the Akra-Bazzi method.