Aldrich Mamon: Marso 2011

Huwebes, Marso 31, 2011

Non Regular Expression

Many features found in modern regular expression libraries provide an expressive power that far exceeds the regular languages. For example, many implementations allow grouping subexpressions with parentheses and recalling the value they match in the same expression (backreferences). This means that a pattern can match strings of repeated words like "papa" or "WikiWiki", called squares in formal language theory. The pattern for these strings is (.*)\1.

The language of squares is not regular, nor is it context-free. Pattern matching with an unbounded number of back references, as supported by numerous modern tools, is NP-complete (see,^[11] Theorem 6.2).

However, many tools, libraries, and engines that provide such constructions still use the term regular expression for their patterns. This has led to a nomenclature where the term regular expression has different meanings in formal language theory and pattern matching. For this reason, some people have taken to using the term regex or simply pattern to describe the latter. Larry Wall, author of the Perl programming language, writes in an essay about the design of Perl 6:

“

'Regular expressions' [...] are only marginally related to real regular expressions. Nevertheless, the term has grown with the capabilities of our pattern matching engines, so I'm not going to try to fight linguistic necessity here. I will, however, generally call them "regexes" (or "regexen", when I'm in an Anglo-Saxon mood).

There are at least three different algorithms that decide if and how a given regular expression matches a string.

The oldest and fastest two rely on a result in formal language theory that allows every nondeterministic finite automaton (NFA) to be transformed into a deterministic finite automaton (DFA). The DFA can be constructed explicitly and then run on the resulting input string one symbol at a time. Constructing the DFA for a regular expression of size m has the time and memory cost of O(2^m), but it can be run on a string of size n in time O(n). An alternative approach is to simulate the NFA directly, essentially building each DFA state on demand and then discarding it at the next step, possibly with caching. This keeps the DFA implicit and avoids the exponential construction cost, but running cost rises to O(nm). The explicit approach is called the DFA algorithm and the implicit approach the NFA algorithm. As both can be seen as different ways of executing the same DFA, they are also often called the DFA algorithm without making a distinction. These algorithms are fast, but using them for recalling grouped subexpressions, lazy quantification, and similar features is tricky.^[12]^[13]

The third algorithm is to match the pattern against the input string by backtracking. This algorithm is commonly called NFA, but this terminology can be confusing. Its running time can be exponential, which simple implementations exhibit when matching against expressions like (a|aa)*b that contain both alternation and unbounded quantification and force the algorithm to consider an exponentially increasing number of sub-cases. This behavior can cause a security problem called Regular expression Denial of Service - ReDoS, which might be used by hackers who want to attack a regular expression engine. More complex implementations will often identify and speed up or abort common cases where they would otherwise run slowly.

Although backtracking implementations only give an exponential guarantee in the worst case, they provide much greater flexibility and expressive power. For example, any implementation which allows the use of backreferences, or implements the various extensions introduced by Perl, must use a backtracking implementation.

Some implementations try to provide the best of both algorithms by first running a fast DFA match to see if the string matches the regular expression at all, and only in that case perform a potentially slower backtracking match.

In theoretical terms, any token set can be matched by regular expressions as long as it is pre-defined. In terms of historical implementations, regular expressions were originally written to use ASCII characters as their token set though regular expression libraries have supported numerous other character sets. Many modern regular expression engines offer at least some support for Unicode. In most respects it makes no difference what the character set is, but some issues do arise when extending regular expressions to support Unicode.

Supported encoding. Some regular expression libraries expect to work on some particular encoding instead of on abstract Unicode characters. Many of these require the UTF-8 encoding, while others might expect UTF-16, or UTF-32. In contrast, Perl and Java are agnostic on encodings, instead operating on decoded characters internally.
Supported Unicode range. Many regular expression engines support only the Basic Multilingual Plane, that is, the characters which can be encoded with only 16 bits. Currently, only a few regular expression engines (e.g., Perl's and Java's) can handle the full 21-bit Unicode range.
Extending ASCII-oriented constructs to Unicode. For example, in ASCII-based implementations, character ranges of the form [x-y] are valid wherever x and y are codepoints in the range [0x00,0x7F] and codepoint(x) ≤ codepoint(y). The natural extension of such character ranges to Unicode would simply change the requirement that the endpoints lie in [0x00,0x7F] to the requirement that they lie in [0,0x10FFFF]. However, in practice this is often not the case. Some implementations, such as that of gawk, do not allow character ranges to cross Unicode blocks. A range like [0x61,0x7F] is valid since both endpoints fall within the Basic Latin block, as is [0x0530,0x0560] since both endpoints fall within the Armenian block, but a range like [0x0061,0x0532] is invalid since it includes multiple Unicode blocks. Other engines, such as that of the Vim editor, allow block-crossing but limit the number of characters in a range to 128.
Case insensitivity. Some case-insensitivity flags affect only the ASCII characters. Other flags affect all characters. Some engines have two different flags, one for ASCII, the other for Unicode. Exactly which characters belong to the POSIX classes also varies.
Cousins of case insensitivity. As ASCII has case distinction, case insensitivity became a logical feature in text searching. Unicode introduced alphabetic scripts without case like Devanagari. For these, case sensitivity is not applicable. For scripts like Chinese, another distinction seems logical: between traditional and simplified. In Arabic scripts, insensitivity to initial, medial, final, and isolated position may be desired. In Japanese, insensitivity between hiragana and katakana is sometimes useful.
Normalization. Unicode introduced combining characters. Like old typewriters, plain letters can be followed by one of more non-spacing symbols (usually diacritics like accent marks) to form a single printing character. Consider a letter with both a grave and an acute accent mark. That might be written with the grave appearing before the acute, or vice versa. As a consequence, two different code sequences can result in identical character display.
New control codes. Unicode introduced amongst others, byte order marks and text direction markers. These codes might have to be dealt with in a special way.
Introduction of character classes for Unicode blocks, scripts, and numerous other character properties. Block properties are much less useful than script properties, because a block can have code points from several different scripts, and a script can have code points from several different blocks.^[14] In Perl and the java.util.regex library, properties of the form \p{InX} or \p{Block=X} match characters in block X and \P{InX} or \P{Block=X} matches code points not in that block. Similarly, \p{Armenian}, \p{IsArmenian}, or \p{Script=Armenian} matches any character in the Armenian script. In general, \p{X} matches any character with either the binary propery X or the general category X. For example, \p{Lu}, \p{Uppercase_Letter}, or \p{GC=Lu} matches any upper-case letter. Binary properties that are not general categories include \p{White_Space}, \p{Alphabetic}, \p{Math}, and \p{Dash}. Examples of non-binary properties are \p{Bidi_Class=Right_to_Left}, \p{Word_Break=A_Letter}, and \p{Numeric_Value=10}.

Regular expressions are useful in the production of syntax highlighting systems, data validation, and many other tasks.

While regular expressions would be useful on search engines such as Google, processing them across the entire database could consume excessive computer resources depending on the complexity and design of the regex. Although in many cases system administrators can run regex-based queries internally, most search engines do not offer regex support to the public

Regular Expressions

In computing, a regular expression, also referred to as regex or regexp, provides a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters. A regular expression is written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification.

The following examples illustrate a few specifications that could be expressed in a regular expression:

The sequence of characters "car" appearing consecutively in any context, such as in "car", "cartoon", or "bicarbonate"
The sequence of characters "car" occurring in that order with other characters between them, such as in "Icelander" or "chandler"
The word "car" when it appears as an isolated word
The word "car" when preceded by the word "blue" or "red"
The word "car" when not preceded by the word "motor"
A dollar sign immediately followed by one or more digits, and then optionally a period and exactly two more digits (for example, "$100" or "$245.99").

Regular expressions can be much more complex than these examples.

Regular expressions are used by many text editors, utilities, and programming languages to search and manipulate text based on patterns. Some of these languages, including Perl, Ruby, Awk, and Tcl, have fully integrated regular expressions into the syntax of the core language itself. Others like C, C++, .NET, Java, and Python instead provide access to regular expressions only through libraries. Utilities provided by Unix distributions—including the editor ed and the filter grep—were the first to popularize the concept of regular expressions.

As an example of the syntax, the regular expression \bex can be used to search for all instances of the string "ex" that occur after "word boundaries" (signified by the \b). Thus \bex will find the matching string "ex" in two possible locations, (1) at the beginning of words, and (2) between two characters in a string, where one is a word character and the other is not a word character. For instance, in the string "Texts for experts", \bex matches the "ex" in "experts" but not in "Texts" (because the "ex" occurs inside a word and not immediately after a word boundary).

Many modern computing systems provide wildcard characters in matching filenames from a file system. This is a core capability of many command-line shells and is also known as globbing. Wildcards differ from regular expressions in generally expressing only limited forms of patterns.

A regular expression, often called a pattern, is an expression that describes a set of strings. They are usually used to give a concise description of a set, without having to list all elements. For example, the set containing the three strings "Handel", "Händel", and "Haendel" can be described by the pattern H(ä|ae?)ndel (or alternatively, it is said that the pattern matches each of the three strings). In most formalisms, if there is any regex that matches a particular set then there is an infinite number of such expressions. Most formalisms provide the following operations to construct regular expressions.

Boolean "or": A vertical bar separates alternatives. For example, gray|grey can match "gray" or "grey".
Grouping: Parentheses are used to define the scope and precedence of the operators (among other uses). For example, gray|grey and gr(a|e)y are equivalent patterns which both describe the set of "gray" and "grey".
Quantification: A quantifier after a token (such as a character) or group specifies how often that preceding element is allowed to occur. The most common quantifiers are the question mark ?, the asterisk * (derived from the Kleene star), and the plus sign + (Kleene cross).

`?`	The question mark indicates there is zero or one of the preceding element. For example, `colou?r` matches both "color" and "colour".
`*`	The asterisk indicates there are zero or more of the preceding element. For example, `abc` matches "ac", "abc", "abbc", "abbbc*", and so on.
`+`	The plus sign indicates that there is one or more of the preceding element. For example, `ab+c` matches "abc", "abbc", "abbbc", and so on, but not "ac".

These constructions can be combined to form arbitrarily complex expressions, much like one can construct arithmetical expressions from numbers and the operations +, −, ×, and ÷. For example, H(ae?|ä)ndel and H(a|ae|ä)ndel are both valid patterns which match the same strings as the earlier example, H(ä|ae?)ndel.

The precise syntax for regular expressions varies among tools and with context; more detail is given in the Syntax section.

The origins of regular expressions lie in automata theory and formal language theory, both of which are part of theoretical computer science. These fields study models of computation (automata) and ways to describe and classify formal languages. In the 1950s, mathematician Stephen Cole Kleene described these models using his mathematical notation called regular sets.^[1] The SNOBOL language was an early implementation of pattern matching, but not identical to regular expressions. Ken Thompson built Kleene's notation into the editor QED as a means to match patterns in text files. He later added this capability to the Unix editor ed, which eventually led to the popular search tool grep's use of regular expressions ("grep" is a word derived from the command for regular expression searching in the ed editor: g/re/p where re stands for regular expression^[2]). Since that time, many variations of Thompson's original adaptation of regular expressions have been widely used in Unix and Unix-like utilities including expr, AWK, Emacs, vi, and lex.

Perl and Tcl regular expressions were derived from a regex library written by Henry Spencer, though Perl later expanded on Spencer's library to add many new features.^[3] Philip Hazel developed PCRE (Perl Compatible Regular Expressions), which attempts to closely mimic Perl's regular expression functionality and is used by many modern tools including PHP and Apache HTTP Server. Part of the effort in the design of Perl 6 is to improve Perl's regular expression integration, and to increase their scope and capabilities to allow the definition of parsing expression grammars.^[4] The result is a mini-language called Perl 6 rules, which are used to define Perl 6 grammar as well as provide a tool to programmers in the language. These rules maintain existing features of Perl 5.x regular expressions, but also allow BNF-style definition of a recursive descent parser via sub-rules.

The use of regular expressions in structured information standards for document and database modeling started in the 1960s and expanded in the 1980s when industry standards like ISO SGML (precursored by ANSI "GCA 101-1983") consolidated. The kernel of the structure specification language standards are regular expressions. Simple use is evident in the DTD element group syntax.

Regular expressions describe regular languages in formal language theory. They have thus the same expressive power as regular grammars. Regular expressions consist of constants and operators that denote sets of strings and operations over these sets, respectively. The following definition is standard, and found as such in most textbooks on formal language theory.^[5]^[6] Given a finite alphabet Σ, the following constants are defined:

(empty set) ∅ denoting the set ∅.
(empty string) ε denoting the set containing only the "empty" string, which has no characters at all.
(literal character) a in Σ denoting the set containing only the character a.

The following operations are defined:

(concatenation) RS denoting the set { αβ | α in R and β in S }. For example {"ab", "c"}{"d", "ef"} = {"abd", "abef", "cd", "cef"}.
(alternation) R | S denoting the set union of R and S. For example {"ab", "c"}|{"ab", "d", "ef"} = {"ab", "c", "d", "ef"}.
(Kleene star) R* denoting the smallest superset of R that contains ε and is closed under string concatenation. This is the set of all strings that can be made by concatenating any finite number (including zero) of strings from R. For example, {"0","1"}* is the set of all finite binary strings (including the empty string), and {"ab", "c"}* = {ε, "ab", "c", "abab", "abc", "cab", "cc", "ababab", "abcab", ... }.

To avoid parentheses it is assumed that the Kleene star has the highest priority, then concatenation and then set union. If there is no ambiguity then parentheses may be omitted. For example, (ab)c can be written as abc, and a|(b(c*)) can be written as a|bc*. Many textbooks use the symbols ∪, +, or ∨ for alternation instead of the vertical bar.

Examples:

a|b* denotes {ε, a, b, bb, bbb, ...}
(a|b)* denotes the set of all strings with no symbols other than a and b, including the empty string: {ε, a, b, aa, ab, ba, bb, aaa, ...}
ab*(c|ε) denotes the set of strings starting with a, then zero or more bs and finally optionally a c: {a, ac, ab, abc, abb, abbc, ...}

[edit] Expressive power and compactness

The formal definition of regular expressions is purposely parsimonious and avoids defining the redundant quantifiers ? and +, which can be expressed as follows: a+ = aa*, and a? = (a|ε). Sometimes the complement operator is added, to give a generalized regular expression; here R^c matches all strings over Σ* that do not match R. In principle, the complement operator is redundant, as it can always be circumscribed by using the other operators. However, the process for computing such a representation is complex, and the result may require expressions of a size that is double exponentially larger.^[7]^[8]

Regular expressions in this sense can express the regular languages, exactly the class of languages accepted by deterministic finite automata. There is, however, a significant difference in compactness. Some classes of regular languages can only be described by deterministic finite automata whose size grows exponentially in the size of the shortest equivalent regular expressions. The standard example are here the languages L_k consisting of all strings over the alphabet {a,b} whose k^th-last letter equals a. On the one hand, a regular expression describing L₄ is given by

(a | b) * a (a | b)(a | b)(a | b)

. Generalizing this pattern to L_k gives the expression

$(a|b)^*a\underbrace{(a|b)(a|b)\cdots(a|b)}_{k-1\text{ times}}. \,$

On the other hand, it is known that every deterministic finite automaton accepting the language L_k must have at least 2^k states. Luckily, there is a simple mapping from regular expressions to the more general nondeterministic finite automata (NFAs) that does not lead to such a blowup in size; for this reason NFAs are often used as alternative representations of regular languages. NFAs are a simple variation of the type-3 grammars of the Chomsky hierarchy.^[5]

Finally, it is worth noting that many real-world "regular expression" engines implement features that cannot be described by the regular expressions in the sense of formal language theory; see below for more on this.

[edit] Deciding equivalence of regular expressions

As the examples show, different regular expressions can express the same language: the formalism is redundant.

It is possible to write an algorithm which for two given regular expressions decides whether the described languages are essentially equal, reduces each expression to a minimal deterministic finite state machine, and determines whether they are isomorphic (equivalent).

To what extent can this redundancy be eliminated? Kleene star and set union are required to find an interesting subset of regular expressions that is still fully expressive, but perhaps their use can be restricted. This is a surprisingly difficult problem. As simple as the regular expressions are, there is no method to systematically rewrite them to some normal form. The lack of axiom in the past led to the star height problem. Recently, Dexter Kozen axiomatized regular expressions with Kleene algebra.^[9]A regular expression (regex or regexp for short) is a special text string for describing a search pattern. You can think of regular expressions as wildcards on steroids

Non-deterministic Finite Automata

In the theory of computation, a nondeterministic finite state machine or nondeterministic finite automaton (NFA) is a finite state machine where for each pair of state and input symbol there may be several possible next states. This distinguishes it from the deterministic finite automaton (DFA), where the next possible state is uniquely determined. Although the DFA and NFA have distinct definitions, it may be shown in the formal theory that they are equivalent, in that, for any given NFA, one may construct an equivalent DFA, and vice-versa: this is the powerset construction. Both types of automata recognize only regular languages. Non-deterministic finite state machines are sometimes studied by the name subshifts of finite type. Non-deterministic finite state machines are generalized by probabilistic automata, which assign a probability to each state transition.

Nondeterministic finite automata were introduced in 1959 by Michael O. Rabin and Dana Scott,^[1] who also showed their equivalence to deterministic finite automata.

An NFA, similar to a DFA, consumes a string of input symbols. For each input symbol it transitions to a new state until all input symbols have been consumed.

Unlike a DFA, it is non-deterministic in that, for any input symbol, its next state may be any one of several possible states. Thus, in the formal definition, the next state is an element of the power set of states. This element, itself a set, represents some subset of all possible states to be considered at once.

An extension of the NFA is the NFA-lambda (also known as NFA-epsilon or the NFA with epsilon moves), which allows a transformation to a new state without consuming any input symbols. For example, if it is in state 1, with the next input symbol an a, it can move to state 2 without consuming any input symbols, and thus there is an ambiguity: is the system in state 1, or state 2, before consuming the letter a? Because of this ambiguity, it is more convenient to talk of the set of possible states the system may be in. Thus, before consuming letter a, the NFA-epsilon may be in any one of the states out of the set {1,2}. Equivalently, one may imagine that the NFA is in state 1 and 2 'at the same time': and this gives an informal hint of the powerset construction: the DFA equivalent to an NFA is defined as the one that is in the state q={1,2}. Transformations to new states without consuming an input symbol are called lambda transitions or epsilon transitions. They are usually labeled with the Greek letter λ or ε.

The notion of accepting an input is similar to that for the DFA. When the last input symbol is consumed, the NFA accepts if and only if there is some set of transitions that will take it to an accepting state. Equivalently, it rejects, if, no matter what transitions are applied, it would not end in an accepting state.

Two similar types of NFAs are commonly defined: the NFA and the NFA with ε-moves. The ordinary is defined as a 5-tuple, (Q, Σ, T, q₀, F), consisting of

a finite set of states Q
a finite set of input symbols Σ
a transition function T : Q × Σ → P(Q).
an initial (or start) state q₀ ∈ Q
a set of states F distinguished as accepting (or final) states F ⊆ Q.

Here, P(Q) denotes the power set of Q. The NFA with ε-moves (also sometimes called NFA-epsilon or NFA-lambda) replaces the transition function with one that allows the empty string ε as a possible input, so that one has instead

T : Q × (Σ ∪{ε}) → P(Q).

It can be shown that ordinary NFA and NFA with epsilon moves are equivalent, in that, given either one, one can construct the other, which recognizes the same language.

The machine starts in the specified initial state and reads in a string of symbols from its alphabet. The automaton uses the state transition function T to determine the next state using the current state, and the symbol just read or the empty string. However, "the next state of an NFA depends not only on the current input event, but also on an arbitrary number of subsequent input events. Until these subsequent events occur it is not possible to determine which state the machine is in" ^[2]. If, when the automaton has finished reading, it is in an accepting state, the NFA is said to accept the string, otherwise it is said to reject the string.

The set of all strings accepted by an NFA is the language the NFA accepts. This language is a regular language.

For every NFA a deterministic finite state machine (DFA) can be found that accepts the same language. Therefore it is possible to convert an existing NFA into a DFA for the purpose of implementing a (perhaps) simpler machine. This can be performed using the powerset construction, which may lead to an exponential rise in the number of necessary states. A formal proof of the powerset construction is given here.

There are many ways to implement a NFA:

Convert to the equivalent DFA. In some cases this may cause exponential blowup in the size of the automaton and thus auxiliary space proportional to the number of states in the NFA (as storage of the state value requires at most one bit for every state in the NFA)
Keep a set data structure of all states which the machine might currently be in. On the consumption of the last input symbol, if one of these states is a final state, the machine accepts the string. In the worst case, this may require auxiliary space proportional to the number of states in the NFA; if the set structure uses one bit per NFA state, then this solution is exactly equivalent to the above.
Create multiple copies. For each n way decision, the NFA creates up to $n - 1$ copies of the machine. Each will enter a separate state. If, upon consuming the last input symbol, at least one copy of the NFA is in the accepting state, the NFA will accept. (This, too, requires linear storage with respect to the number of NFA states, as there can be one machine for every NFA state.)
Explicitly propagate tokens through the transition structure of the NFA and match whenever a token reaches the final state. This is sometimes useful when the NFA should encode additional context about the events that triggered the transition. (For an implementation that uses this technique to keep track of object references have a look at Tracematches ^[3].)

The following example explains a NFA M, with a binary alphabet, which determines if the input contains an even number of 0s or an even number of 1s. (Note that 0 occurrences is an even number of occurrences as well.) Let M = (Q, Σ, T, s₀, F) where

Σ = {0, 1},
Q = {s₀, s₁, s₂, s₃, s₄},
E({s₀}) = { s₀, s₁, s₃ }
F = {s₁, s₃}, and
The transition function T can be defined by this state transition table:

	0	1	ε
S₀	{}	{}	{S₁, S₃}
S₁	{S₂}	{S₁}	{}
S₂	{S₁}	{S₂}	{}
S₃	{S₃}	{S₄}	{}
S₄	{S₄}	{S₃}	{}

The state diagram for M is:

M can be viewed as the union of two DFAs: one with states {S₁, S₂} and the other with states {S₃, S₄}.

Finite Automata

Finite automata are computing devices that accept/recognize regular languages and are used to model operations of many systems we find in practice. Their operations can be simulated by a very simple computer program. A kind of systems finite automata can model and a computer program to simulate their operations are discussed.
A finite-state machine (FSM) or finite-state automaton (plural: automata), or simply a state machine, is a mathematical abstraction sometimes used to design digital logic or computer programs. It is a behavior model composed of a finite number of states, transitions between those states, and actions, similar to a flow graph in which one can inspect the way logic runs when certain conditions are met. It has finite internal memory, an input feature that reads symbols in a sequence, one at a time without going backward; and an output feature, which may be in the form of a user interface, once the model is implemented. The operation of an FSM begins from one of the states (called a start state), goes through transitions depending on input to different states and can end in any of those available, however only a certain set of states mark a successful flow of operation (called accept states).
Finite-state machines can solve a large number of problems, among which are electronic design automation, communication protocol design, parsing and other engineering applications. In biology and artificial intelligence research, state machines or hierarchies of state machines are sometimes used to describe neurological systems and in linguistics—to describe the grammars of natural languages.

Fig. 1 Example of a simple finite state machine

A current state is determined by past states of the system. As such, it can be said to record information about the past, i.e., it reflects the input changes from the system start to the present moment. The number and names of the states typically depend on the different possible states of the memory, e.g. if the memory is three bits long, there are 8 possible states. A transition indicates a state change and is described by a condition that would need to be fulfilled to enable the transition. An action is a description of an activity that is to be performed at a given moment. There are several action types:

Entry action: which is performed when entering the state
Exit action: which is performed when exiting the state
Input action: which is performed depending on present state and input conditions
Transition action: which is performed when performing a certain transition

An FSM can be represented using a state diagram (or state transition diagram) as in figure 1 above. Besides this, several state transition table types are used. The most common representation is shown below: the combination of current state (e.g. B) and input (e.g. Y) shows the next state (e.g. C). The complete actions information can be added only using footnotes. An FSM definition including the full actions information is possible using state tables (see also VFSM).

State transition table
Current state → Input ↓	State A	State B	State C
Input X	...	...	...
Input Y	...	State C	...
Input Z	...	...	...

In addition to their use in modeling reactive systems presented here, finite state automata are significant in many different areas, including electrical engineering, linguistics, computer science, philosophy, biology, mathematics, and logic. Finite state machines are a class of automata studied in automata theory and the theory of computation. In computer science, finite state machines are widely used in modeling of application behavior, design of hardware digital systems, software engineering, compilers, network protocols, and the study of computation and languages.

Acceptors and recognizers

Fig. 2 Acceptor FSM: parsing the word "nice"

Acceptors and recognizers (also sequence detectors) produce a binary output, saying either yes or no to answer whether the input is accepted by the machine or not. All states of the FSM are said to be either accepting or not accepting. At the time when all input is processed, if the current state is an accepting state, the input is accepted; otherwise it is rejected. As a rule the input are symbols (characters); actions are not used. The example in figure 2 shows a finite state machine which accepts the word "nice". In this FSM the only accepting state is number 7.
The machine can also be described as defining a language, which would contain every word accepted by the machine but none of the rejected ones; we say then that the language is accepted by the machine. By definition, the languages accepted by FSMs are the regular languages—that is, a language is regular if there is some FSM that accepts it.

[edit] Start state

The start state is usually shown drawn with an arrow "pointing at it from any where" (Sipser (2006) p. 34).

[edit] Accept (or final) states

Fig. 3: Representation of a finite-state machine; this example shows one that determines whether a binary number has an odd or even number of 0's, where

S 1

is an accepting state.

Accept states (also referred to as accepting or final states) are those at which the machine reports that the input string, as processed so far, is a member of the language it accepts. It is usually represented by a double circle.
An example of an accepting state appears in the diagram to the right: a deterministic finite automaton (DFA) that detects whether the binary input string contains an even number of 0's.
S₁ (which is also the start state) indicates the state at which an even number of 0's has been input. S₁ is therefore an accepting state. This machine will finish in an accept state, if the binary string contains an even number of 0's (including any binary string containing no 0's). Examples of strings accepted by this DFA are epsilon (the empty string), 1, 11, 11..., 00, 010, 1010, 10110, etc...

[edit] Transducers

Transducers generate output based on a given input and/or a state using actions. They are used for control applications and in the field of computational linguistics. Here two types are distinguished:

Moore machine: The FSM uses only entry actions, i.e., output depends only on the state. The advantage of the Moore model is a simplification of the behaviour. Consider an elevator door. The state machine recognizes two commands: "command_open" and "command_close" which trigger state changes. The entry action (E:) in state "Opening" starts a motor opening the door, the entry action in state "Closing" starts a motor in the other direction closing the door. States "Opened" and "Closed" stop the motor when fully opened or closed. They signal to the outside world (e.g., to other state machines) the situation: "door is open" or "door is closed".

Fig. 4 Transducer FSM: Mealy model example

Mealy machine: The FSM uses only input actions, i.e., output depends on input and state. The use of a Mealy FSM leads often to a reduction of the number of states. The example in figure 4 shows a Mealy FSM implementing the same behaviour as in the Moore example (the behaviour depends on the implemented FSM execution model and will work, e.g., for virtual FSM but not for event driven FSM). There are two input actions (I:): "start motor to close the door if command_close arrives" and "start motor in the other direction to open the door if command_open arrives". The "opening" and "closing" intermediate states are not shown.

In practice mixed models are often used.
More details about the differences and usage of Moore and Mealy models, including an executable example, can be found in the external technical note "Moore or Mealy model?"

[edit] Determinism

A further distinction is between deterministic (DFA) and non-deterministic (NFA, GNFA) automata. In deterministic automata, every state has exactly one transition for each possible input. In non-deterministic automata, an input can lead to one, more than one or no transition for a given state. This distinction is relevant in practice, but not in theory, as there exists an algorithm (the powerset construction) which can transform any NFA into a more complex DFA with identical functionality.
The FSM with only one state is called a combinatorial FSM and uses only input actions. This concept is useful in cases where a number of FSM are required to work together, and where it is convenient to consider a purely combinatorial part as a form of FSM to suit the design tools.The Unified Modeling Language has a very rich semantics and notation for describing state machines. UML state machines overcome the limitations of traditional finite state machines while retaining their main benefits. UML state machines introduce the new concepts of hierarchically nested states and orthogonal regions, while extending the notion of actions. UML state machines have the characteristics of both Mealy machines and Moore machines. They support actions that depend on both the state of the system and the triggering event, as in Mealy machines, as well as entry and exit actions, which are associated with states rather than transitions, as in Moore machines.

Miyerkules, Marso 30, 2011

Types Of Proofs

Proof by Construction:

In mathematics, a proof is a convincing demonstration (within the accepted standards of the field) that some mathematical statement is necessarily true.^[1]^[2] Proofs are obtained from deductive reasoning, rather than from inductive or empirical arguments. That is, a proof must demonstrate that a statement is true in all cases, without a single exception. An unproven proposition that is believed to be true is known as a conjecture.
The statement that is proved is often called a theorem.^[1] Once a theorem is proved, it can be used as the basis to prove further statements. A theorem may also be referred to as a lemma, especially if it is intended for use as a stepping stone in the proof of another theorem.
Proofs employ logic but usually include some amount of natural language which usually admits some ambiguity. In fact, the vast majority of proofs in written mathematics can be considered as applications of rigorous informal logic. Purely formal proofs, written in symbolic language instead of natural language, are considered in proof theory. The distinction between formal and informal proofs has led to much examination of current and historical mathematical practice, quasi-empiricism in mathematics, and so-called folk mathematics (in both senses of that term). The philosophy of mathematics is concerned with the role of language and logic in proofs, and mathematics as a language.

Proof by Contradiction:

In logic, proof by contradiction is a form of proof that establishes the truth or validity of a proposition by showing that the proposition being false would imply a contradiction. Since by the law of bivalence a proposition must be either true or false, and its falsity has been shown impossible, the proposition must be true.
In other words, to prove by contradiction that

P

, show that $\lnot P \Rightarrow \perp$ or its equivalent $\lnot P \Rightarrow(Q\and\lnot Q)$ . Then, since $\lnot P$ implies a contradiction, conclude

P

.
Proof by contradiction is also known as indirect proof, apagogical argument, reductio ad impossibile. It is a particular kind of the more general form of argument known as reductio ad absurdum.
A classic proof by contradiction from mathematics is the proof that the square root of 2 is irrational. If it were rational, it could be expressed as a fraction a/b in lowest terms, where a and b are integers, at least one of which is odd. But if a/b = √2, then a² = 2b². Therefore a² must be even. Because the square of an odd number is odd, that in turn implies that a is even. This means that b must be odd because a/b is in lowest terms.On the other hand, if a is even, then a² is a multiple of 4. If a² is a multiple of 4 and a² = 2b², then 2b²is a multiple of 4, and therefore b² is even, and so is b.So b is odd and even, a contradiction. Therefore the initial assumption—that √2 can be expressed as a fraction—must be false.
The method of proof by contradiction has also been used to show that for any non-degenerate Right triangle, the length of the hypotenuse is less than the sum of the lengths of the two remaining sides. The proof relies on the Pythagorean theorem. Letting c be the length of the hypotenuse and a and b the lengths of the legs, the claim is that a + b > c. As usual, we start the proof by negating the claim and assuming that a + b ≤ c. The next step is to show that this leads to a contradiction. Squaring both sides, we have (a + b)² ≤ c² or, equivalently, a² + 2ab + b² ≤ c². A triangle is non-degenerate if each edge has positive length, so we may assume that a and b are greater than 0. Therefore, a² + b² < a² + 2ab + b² ≤ c². Taking out the middle term, we have a² + b² < c². We know from the Pythagorean theorem that a² + b² = c². We now have a contradiction since strict inequality and equality are mutually exclusive. The latter was a result of the Pythagorean theorem and the former the assumption that a + b ≤ c. The contradiction means that it is impossible for both to be true and we know that the Pythagorean theorem holds. It follows that our assumption that a + b ≤ c must be false and hence a + b > c, proving the claim.
Proofs by contradiction sometimes end with the word "Contradiction!". Isaac Barrow and Baermann used the notation Q.E.A., for "quod est absurdum" ("which is absurd"), along the lines of Q.E.D., but this notation is rarely used today.^[1]

A graphical symbol sometimes used for contradictions is a downwards zigzag arrow "lightning" symbol (U+21AF: ↯), for example in Davey and Priestley.^[2] Others sometimes used include a pair of opposing arrows (as $\rightarrow\!\leftarrow$ or $\Rightarrow\!\Leftarrow$ ), struck-out arrows ( $\nleftrightarrow$ ), a stylized form of hash (such as U+2A33: ⨳), or the "reference mark" (U+203B: ※).^[3]^[4] The "up tack" symbol (U+22A5: ⊥) used by philosophers and logicians (see contradiction) also appears, but is often avoided due to its usage for orthogonality.he method of proof by contradiction is to assume that a statement is not true and then to show that that assumption leads to a contradiction. In the case of trying to prove $P\Rightarrow Q,$ this is equivalent to assuming that $P\land \lnot Q.$ That is, to assume that

P

is true and

Q

is false. This is known by its latin reductio ad absurdum (reduction to absurdity), since it ends with a statement that cannot be true.For many students, the method of proof by contradiction is a tremendous gift and a trojan horse, both of which follow from how strong the method is. In fact, the apt reader might have already noticed that both the constructive method and contrapositive method can be derived from that of contradiction.

Assume	Prove	Contradiction
$P\land \lnot Q\ which\ implies\ P$	$Q\ (from\ P\ only)$	$Q\land \lnot Q$
$P\land \lnot Q\ which\ implies\ \lnot Q$	$\lnot P\ (from \lnot Q\ only)$	$P\land \lnot P$

However, its reach goes farther than even that, since the contradiction can be anything. Even if we ignore the criticisms from constuctivism, this broad scope hides what you lose; namely, you lose well-defined direction and conclusion, both of which have to be replaced with intuition.

Proof by Induction:

Mathematical induction is a method of mathematical proof typically used to establish that a given statement is true of all natural numbers (non-negative integers). It is done by proving that the first statement in the infinite sequence of statements is true, and then proving that if any one statement in the infinite sequence of statements is true, then so is the next one.
The method can be extended to prove statements about more general well-founded structures, such as trees; this generalization, known as structural induction, is used in mathematical logic and computer science. Mathematical induction in this extended sense is closely related to recursion.
Mathematical induction should not be misconstrued as a form of inductive reasoning, which is considered non-rigorous in mathematics (see Problem of induction for more information). In fact, mathematical induction is a form of rigorous deductive reasoning.^[1]

An informal description of mathematical induction can be illustrated by reference to the sequential effect of falling dominoes.

In 370 BC, Plato's Parmenides may have contained an early example of an implicit inductive proof.^[2] The earliest implicit traces of mathematical induction can be found in Euclid's ^[3] proof that the number of primes is infinite and in Bhaskara's "cyclic method".^[4] An opposite iterated technique, counting down rather than up, is found in the Sorites paradox, where one argued that if 1,000,000 grains of sand formed a heap, and removing one grain from a heap left it a heap, then a single grain of sand (or even no grains) forms a heap.
An implicit proof by mathematical induction for arithmetic sequences was introduced in the al-Fakhri written by al-Karaji around 1000 AD, who used it to prove the binomial theorem and properties of Pascal's triangle.
None of these ancient mathematicians, however, explicitly stated the inductive hypothesis. Another similar case (contrary to what Vacca has written, as Freudenthal carefully showed) was that of Francesco Maurolico in his Arithmeticorum libri duo (1575), who used the technique to prove that the sum of the first n odd integers is n². The first explicit formulation of the principle of induction was given by Pascal in his Traité du triangle arithmétique (1665). Another Frenchman, Fermat, made ample use of a related principle, indirect proof by infinite descent. The inductive hypothesis was also employed by the Swiss Jakob Bernoulli, and from then on it became more or less well known. The modern rigorous and systematic treatment of the principle came only in the 19th century, with George Boole,^[5] Charles Sanders Peirce,^[6] Giuseppe Peano, and Richard Dedekind.^[4]

The simplest and most common form of mathematical induction proves that a statement involving a natural number n holds for all values of n. The proof consists of two steps:

The basis (base case): showing that the statement holds when n is equal to the lowest value that n is given in the question. Usually, n = 0 or n = 1.
The inductive step: showing that if the statement holds for some n, then the statement also holds when n + 1 is substituted for n.

The assumption in the inductive step that the statement holds for some n is called the induction hypothesis (or inductive hypothesis). To perform the inductive step, one assumes the induction hypothesis and then uses this assumption to prove the statement for n + 1.
The choice between n = 0 and n = 1 in the base case is specific to the context of the proof: If 0 is considered a natural number, as is common in the fields of combinatorics and mathematical logic, then n = 0. If, on the other hand, 1 is taken as the first natural number, then the base case is given by n = 1.
This method works by first proving the statement is true for a starting value, and then proving that the process used to go from one value to the next is valid. If these are both proven, then any value can be obtained by performing the process repeatedly. It may be helpful to think of the domino effect; if one is presented with a long row of dominoes standing on end, one can be sure that:

The first domino will fall
Whenever a domino falls, its next neighbor will also fall,

so it is concluded that all of the dominoes will fall, and that this fact is inevitable.