Dot — The Regex Barbapapa

Image may be NSFW.
Clik here to view.

Remember Barbapapa — Annette Tison’s and Talus Taylor’s children’s books and films from the 1970s? The hero was a pink, pear-shaped guy with the ability to take on almost any shape whatsoever. The equivalent in Regex is dot ..

Dot is a character class — a generic character. Instead of using literal characters, like 2, a or #, you can use dot to specify that you accept almost any character.

Ruby> 'mama 2 ##'.gsub /a|2|#/, '¤' #=> "m¤m¤ ¤ ¤¤" Ruby> 'mama 2 ##'.gsub /./, '¤' #=> "¤¤¤¤¤¤¤¤¤"

There are two cultural problems with the dot, that is important to be aware of:

The character class dot . and the closure function * are together and separately the most abused features of Regex. If you use them perfunctory, you’ll often end up with too general regexes — sometimes even incorrect. Every time you intend to write ., * or even .* You should consider if you really mean something more specific.
A majority of Regex books, including the most popular one, are unclear or even entirely incorrect, as they claim that “dot matches any character.” In most cases dot matches “any character except line breaks.” It’s a very, very important difference.

Why doesn’t dot normally match line breaks?
The original implementations of Regex operated line by line. Programs like grep, handles one line at a time. Trailing line breaks are filtered out before processing. Hence, there are no line breaks. NASA engineer Larry Wall created Perl In the 1980s — the programming language that evangelized Regex more than anyhting else. The original purpose was to make report processing easier. What would then be more natural than to continue on the path of line-oriented work? Another argument is that the idiom .* would change meaning if dot matches line breaks. Perl set the agenda and now, a few decades later, we can only accept that dot typically don’t match line breaks, no matter what you and I believe is logical.

Ruby> "grey gr y gr\ny gray gr\ry".scan /gr.y/ #=> ["grey", "gr y", "gray"]

How can you force dot to match line breaks?
You set a flag. Unfortunately, this flag has different names in different Regex dialects. In Perl, it’s called single-line mode. Imagine what happens if dot matches all characters, including line breaks. Input data becomes a long line where the line break is a character like any other — hence the name. Single-line mode should not be confused with what in Perl is called multi-line mode. Multi-line mode affects the anchors $ and ^ and it’s orthogonal with single-line mode. To add more confusion, Ruby use the term multi-line line, when they mean Perl’s single-line mode. And the real multi-line mode is mandatory in Ruby — no flag available there. The best approach to this mess is if you and I call the flag Dot match all, no matter how it is written syntactically in different dialects. By the way, in Ruby you add m next to the Regex literal when we want the dot to match any character.

Ruby> "grey gr y gr\ny gray gr\ry".scan /gr.y/ #=> ["grey", "gr y", "gray"] Ruby> "grey gr y gr\ny gray gr\ry".scan /gr.y/m #=> ["grey", "gr y", "gr\ny", "gray", "gr\ry"]

And if there is no flag?
In some Regex dialects, most notably JavaScript, there’s no flag for dot match all. An workaround is to replace the dot with the idiom [\ s\ S]. This idiom matches exactly one character — either white space or anything that is not whitespace. These two classes are of course 100% of all the characters — including line breaks.

JavaScript> 'grey gr y gr\ny gray gr\ry'.match(/gr.y/g); [ 'grey', 'gr y', 'gray' ] JavaScript> 'grey gr y gr\ny gray gr\ry'.match(/gr[\s\S]y/g); [ 'grey', 'gr y', 'gr\ny', 'gray', 'gr\ry' ]

Is dot to general?
I also argued above that the dot is often abused in our community. What does that mean? Imagine that you want to find all time strings in a text. You’ve got the following specification:

Time always includes hours and minutes, sometimes even seconds.
Hours, minutes and seconds are always written with two digits
You don’t have to ignore impossible numbers, such as minute 61.
Inbetween hours, minutes and seconds you’ll find one of the separators dot . or colon :.

The results of a simple regex like \d\d.\d\d(.\d\d)? might surprise you:

Ruby> "12:34 09.00 24.56.33".scan /(\d\d.\d\d(.\d\d)?)/ #=> "12:34 09", "00 24.56"

That’s not what you wished for. Dot matches space! If you replace the item with the more specific character class [.:] you aim closer to target. You mustn’t forget that dot inside a character class means that you literally want to match the dot character ..

Ruby> "12:34 09.00 24.56.33".scan /(\d\d[.:]\d\d([.:]\d\d)?)/ #=> "12:34", "09.00", "24.56.33"

Image may be NSFW.
Clik here to view.

Dot — The Regex Barbapapa

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112