Lexlog: Theoretical Computer Science: Introduction to Formal Language Theory

Here, we introduce the field of formal language theory.

There are 4 types of languages:

natural language - language arising naturally within societies
constructed language - language consciously defined within societies
artificial language - language arising from artificial situations
formal language - language axiomatically defined

We now use set theory to construct the theory of formal languages.

An alphabet is a set \Sigma whose elements are symbols (characters). We define the object of string (word), akin to tuple, by introducing the concatenation operation o on symbols and strings (the concatenation of symbols (or strings) a, b in \Sigma, a o b = ab is a string).

For this new type of object, we define empty string \epsilon (akin to empty set), length (akin to cardinality), substring relation (akin to subset relation), and string exponentiation (repeated concatenation).

Finally, we define a new set operator that makes sense on alphabets or sets of strings, the Kleene star * (akin to infinite powersetting): an operator on a set S defined inductively as constructing the first 'powerset' by forming all possible concatenations of elements in S U {\epsilon} (hence it will contain S and \epsilon properly), then constructing all concatenations within that new set, and doing so iteratively infinitely. Then if S is a finite set of symbols or strings, S* will be a countably infinite set closed under o, namely the set of all possible words that can be made by joining together elements of \sigma.

We can now define a language L over an alphabet \Sigma as the subset of \Sigma* that contains all valid words in the language. Moreover, all usual set theoretic operations still hold and we can extend the string operations of concatenation and exponentiation to languages.

This paragraph will be a short digression. Notice that in a language all words are finite, so, in order to avoid insulting Cantor's teachings, we also define an \omega-language over \Sigma as a subset of \Sigma^\omega (constructed by infinitely combining elements in \Sigma). Also, we can denote \Sigma^\inf = \Sigma* U \Sigma^\omega as the set of all finite and infinite words of \Sigma.

To start categorizing formal languages, we first need a formalism to describe the syntax of a language: the concept of (generative) formal grammars. We informally define a formal grammar as the set of production rules necessary to construct all the words in a language, thus it formalizes its syntax.

Then we partition the set of all formal languages into 2 major kinds of languages: recursively enumerable and non-recursively enumerable languages - notions central to computability/recursion theory. Suffice to say that to build a non-recursively enumerable language, one has to use non-straight forward techniques like Cantor's diagonalization; which is outside the scope of formal grammars' production rules.

The Chomsky hierarchy provides a way to categorize formal grammars:

Unrestricted grammars generate recursively enumerable languages. This class includes all formal grammars and contains properly all context-sensitive grammars.
Context-sensitive grammars generate context-sensitive languages. This class contains properly all context-free grammars.
Context-free grammars generate context-free languages. This class contains properly all regular grammars.
Regular grammars generate regular languages. This class of grammars has the same expressiveness as regular expressions, that is both are equivalently suitable to generate the class of regular languages.

We now introduce regular expressions as a convenient, compact notation for representing regular grammars and regular languages. That is, a given regular expression will provide a way to construct/describe all words of a given regular language; the procedure used is known as pattern matching.
Given a finite alphabet \Sigma, the set of regular expression RE is defined recursively as the smallest set described by:
Basis: empty set, empty string and any string literal (symbol) in \Sigma belong to RE.
Inductive construction: if R,S belong to RE, then R+S (alternation), RoS (concatenation, abbreviated RS) and R* (Kleene star) belong to RE.

Thus for a given regular language, we can construct a regular expression - a pattern from which every word in the language can be generated. The string operations o and * hold for this concept of pattern as well as alternation (+), it is the analogue of set union: it will match a string that, at a given position within it, matches either of the disjuncted regexps.
Procedurally: the regexp A*(R+S)BB will match any string that starts with 0 to n literals 'A' followed by a pattern matching either R or S and ending with 2 literals 'B'; for example if \Sigma = {A, B, C} and R = C* and S = CA, the strings BB, AAABB, CCCBB, AAACBB, CABB, AACABB, etc. are in the language L(A*(C*+BA*)BB).

We provided a brief basis in formal language theory, the field of study concerned with the syntactical aspects of formalised language. This basis is essential to domains in TCS such as programming language theory in which words of a formal language are given specific semantics and computational complexity theory and mathematical logic where this formalism is chosen to express decision problems and logical statements.

Lexlog: Theoretical Computer Science

Monday, 26 November 2012

Introduction to Formal Language Theory

No comments:

Post a Comment

About Me