Using Rule number 2 and Rule number 4, we can create regular expressions that consists of any sequence of symbols from our alphabet. Rule number 2 said that if the symbol a is in the alphabet, then a
is a regular expression. Rule number 4 said that if p
and q
are two regular expressions, then the concatenation pq
is a regular expression as well. The concatenation symbol itself is invisible. Just write the two regular expressions right after each other:
'moda'[/m/] #=> "m" – we found the substring s in the string"moda"
'moda'[/o/] #=> "o"
'moda'[/mo/] #=> "mo" - /mo/ is /m/ concatenated with /o/
'moda'[/da/] #=> "da"
'moda'[/moda/] #=> "moda" - /moda/ is /mo/ concatenated with /da/
'moda'[/mado/] #=> nil – no match, since the order was changed
There are some handy terms we usually use for parts of strings:
- Prefix: A prefix is the substring we have left if we remove zero or more symbols from the end of a string. The strings m, mo, mod, and moda are all prefixes of the string moda. Even the empty string ε is a prefix moda.
- Suffix: The suffix is the substring that is left if we remove zero or more symbol from the beginning of the string. The strings moda, oda, da, a, and ε are all suffixes of the string moda.
- Substring: A substring is what we have left if we remove a prefix and a suffix from a string. Note that the prefix and/or the suffix can be ε. Substrings must still be consecutive in the original string. The strings od and moda, but not mda, are substrings of moda.
For any regular expression p
, it’s true that εp = pε = p
, thus we say that the empty string ε is the identity under concatenation. There is no annihilator under concatenation, i.e., there’s no regular expression 0
so that for any regular expression p
it holds that 0p = p0 = 0
. Concatenation is not commutative, since pq
is not equal to qp
, but it’s associative since for any regular expressions p
and q
it’s true that p(qr) = (pq)r
.
If we think of concatenation as a product, then regular expressions also support exponentiation. We write the exponent enclosed in braces to the right of the regular expression:
'aaa'[/aaa/] #=> "aaa"
'aaa'[/a{3}/] #=> "aaa" – yes, the string includes 3 concatenated a
'aaa'[/a{4}/] #=> nil – no, the string doesn't include 4 a
This is obviously just syntactic sugar. All regular expressions that we can write using the exponential operator, can also be unfolded. There are more shortcuts for finite repeated concatenations:
'aa'[/a?/] #=> "a" – the optional operator written as question mark
'b'[/a?/] #=> "" – zero repeats of a matches the empty string
'aa'[/a{,2}/] #=> "aa" – at least two a
'aa'[/a{1,2}/] #=> "aa" – at least one a and at moust two a
'a'[/a{1,2}/] #=> "a"
We will soon see that the concatenation of two regular expressions are not the same as the concatenation of two strings. Remember that a regular expression corresponds to a set of strings. For example, if p = {a, b}
and q = {c, d}
, then pq = {ac, ad, bc, bd}
