Ruby 正则表达式 让正则表达式来优化字符串处理
声明: 下列内容基于ruby 2.0.0p247 (2013-06-27) [x86_64-linux]
一堆小例子
基本使用
   /hay/ =~ 'haystack'   #=> 0
%r[y].match('haystack') #=> #<MatchData "y">
主要概念
New a object
   # matching check
=~
# MatchData
obj=regexpObject.match(stringobj)
# Regexp class
r1 = Regexp.new('^a-z+:\s+\w+') #=> /^a-z+:\s+\w+/
r2 = Regexp.new('cat', true)     #=> /cat/i
r3 = Regexp.new(r2)              #=> /cat/i
r4 = Regexp.new('dog', Regexp::EXTENDED | Regexp::IGNORECASE) #=> /dog/ix
字符组 Character Classes
通用表示
| 符号 | English | 备注 | 
|---|---|---|
 /./ | 
Any character except a newline. | 除新行外的所有字符 | 
 /./m  | 
Any character (the m modifier enables multiline mode) | 带多行支持后,表示所有字符 | 
 /\w/  | 
 A word character ([a-zA-Z0-9_])  | 
|
 /\W/  | 
 A non-word character ([^a-zA-Z0-9_]). Please take a look at Bug #4044 if using /\W/ with the /i modifier.  | 
|
 /\d/  | 
 A digit character ([0-9])  | 
|
 /\D/  | 
 A non-digit character ([^0-9])  | 
|
 /\h/  | 
 A hexdigit character ([0-9a-fA-F])  | 
|
 /\H/  | 
 A non-hexdigit character ([^0-9a-fA-F])  | 
|
 /\s/  | 
 A whitespace character: /[ \t\r\n\f]/  | 
|
 /\S/  | 
 A non-whitespace character: /[^ \t\r\n\f]/  | 
POSIX
| 符号 | English | 备注 | 
|---|---|---|
 /[[:alnum:]]/ | 
Alphabetic and numeric character | |
 /[[:alpha:]]/ | 
Alphabetic character | |
 /[[:blank:]]/ | 
Space or tab | |
 /[[:cntrl:]]/ | 
Control character | |
 /[[:digit:]]/ | 
Digit | |
 /[[:graph:]]/ | 
Non-blank character (excludes spaces, control characters, and similar) | |
 /[[:lower:]]/ | 
Lowercase alphabetical character | |
 /[[:print:]]/ | 
  Like [:graph:], but includes the space character  | 
|
 /[[:punct:]]/ | 
Punctuation character | |
 /[[:space:]]/ | 
  Whitespace character ([:blank:], newline, carriage return, etc.)  | 
|
 /[[:upper:]]/ | 
Uppercase alphabetical | |
 /[[:xdigit:]]/ | 
Digit allowed in a hexadecimal number (i.e., 0-9a-fA-F) | 
non-POSIX
| 符号 | English | 备注 | 
|---|---|---|
 /[[:word:]]/ | 
A character in one of the following Unicode general categories Letter, Mark, Number, Connector_Punctuation | |
 /[[:ascii:]]/ | 
A character in the ASCII character set | 
扩展字符集合 Character Properties
| 符号 | English | 备注 | 
|---|---|---|
 /\p{Alnum}/  | 
Alphabetic and numeric character | |
 /\p{Alpha}/  | 
Alphabetic character | |
 /\p{Blank}/  | 
Space or tab | |
 /\p{Cntrl}/  | 
Control character | |
 /\p{Digit}/  | 
Digit | |
 /\p{Graph}/  | 
Non-blank character (excludes spaces, control characters, and similar) | |
 /\p{Lower}/  | 
Lowercase alphabetical character | |
 /\p{Print}/  | 
Like \p{Graph}, but includes the space character | |
 /\p{Punct}/  | 
Punctuation character | |
 /\p{Space}/  | 
Whitespace character ([:blank:], newline, carriage return, etc.) | |
 /\p{Upper}/  | 
Uppercase alphabetical | |
 /\p{XDigit}/  | 
Digit allowed in a hexadecimal number (i.e., 0-9a-fA-F) | |
 /\p{Word}/  | 
A member of one of the following Unicode general category Letter, Mark, Number, Connector_Punctuation | |
 /\p{ASCII}/  | 
A character in the ASCII character set | |
 /\p{Any}/  | 
Any Unicode character (including unassigned characters) | |
 /\p{Assigned}/  | 
An assigned character | |
 /\p{L}/  | 
'Letter' | |
 /\p{Ll}/  | 
'Letter: Lowercase' | |
 /\p{Lm}/  | 
'Letter: Mark' | |
 /\p{Lo}/  | 
'Letter: Other' | |
 /\p{Lt}/  | 
'Letter: Titlecase' | |
 /\p{Lu}/  | 
'Letter: Uppercase | |
 /\p{Lo}/  | 
'Letter: Other' | |
 /\p{M}/  | 
'Mark' | |
 /\p{Mn}/  | 
'Mark: Nonspacing' | |
 /\p{Mc}/  | 
'Mark: Spacing Combining' | |
 /\p{Me}/  | 
'Mark: Enclosing' | |
 /\p{N}/  | 
'Number' | |
 /\p{Nd}/  | 
'Number: Decimal Digit' | |
 /\p{Nl}/  | 
'Number: Letter' | |
 /\p{No}/  | 
'Number: Other' | |
 /\p{P}/  | 
'Punctuation' | |
 /\p{Pc}/  | 
'Punctuation: Connector' | |
 /\p{Pd}/  | 
'Punctuation: Dash' | |
 /\p{Ps}/  | 
'Punctuation: Open' | |
 /\p{Pe}/  | 
'Punctuation: Close' | |
 /\p{Pi}/  | 
'Punctuation: Initial Quote' | |
 /\p{Pf}/  | 
'Punctuation: Final Quote' | |
 /\p{Po}/  | 
'Punctuation: Other' | |
 /\p{S}/  | 
'Symbol' | |
 /\p{Sm}/  | 
'Symbol: Math' | |
 /\p{Sc}/  | 
'Symbol: Currency' | |
 /\p{Sc}/  | 
'Symbol: Currency' | |
 /\p{Sk}/  | 
'Symbol: Modifier' | |
 /\p{So}/  | 
'Symbol: Other' | |
 /\p{Z}/  | 
'Separator' | |
 /\p{Zs}/  | 
'Separator: Space' | |
 /\p{Zl}/  | 
'Separator: Line' | |
 /\p{Zp}/  | 
'Separator: Paragraph' | |
 /\p{C}/  | 
'Other' | |
 /\p{Cc}/  | 
'Other: Control' | |
 /\p{Cf}/  | 
'Other: Format' | |
 /\p{Cn}/  | 
'Other: Not Assigned' | |
 /\p{Co}/  | 
'Other: Private Use' | |
 /\p{Cs}/  | 
'Other: Surrogate' | 
   Lastly, `\p{}` matches a character’s Unicode script. The following scripts are supported: Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic, Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Inherited, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lycian, Lydian, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Ol_Chiki, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Saurashtra, Shavian, Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai, and Yi.
对于其它字符的支持:
   /\p{Arabic}/.match("\u06E9") #=> #<MatchData "\u06E9">
/\p{^Ll}/.match("A") #=> #<MatchData "A">
复用 Repetition
结构/量词
| 符号/量词 | English | 备注 | 
|---|---|---|
| * | Zero or more times | |
| + | One or more times | |
| ? | Zero or one times (optional) | |
| {n} | Exactly n times | |
| {n,} | n or more times | |
| {,m} | m or less times | |
| {n,m} | At least n and at most m times | 
模式
- 默认为贪婪型, 最多成功匹配
 - 非贪婪/懒惰型, 最少成功匹配
- 量词后面添加
? - 涉及量词: 
* + {n,} 
 - 量词后面添加
 
   /<(.+)>/.match("<a><b>") # => #<MatchData "<a><b>" 1:"a><b">
/<(.+?)>/.match("<a><b>") # => #<MatchData "<a>" 1:"a">
/<(.+?)>/.match("<abc><b>") # => #<MatchData "<abc>" 1:"abc">
/<(.{1,}?)>/.match("<abc><b>") # => #<MatchData "<abc>" 1:"abc">
/<(.{1,})>/.match("<abc><b>") # => #<MatchData "<abc><b>" 1:"abc><b">
分组 Grouping
分组应该算是对上面东西的结构化. 从分组到归类,又是引用.
捕捉/获取 Capturing
主要涉及两种操作: 捕捉与引用
   /(?<vowel>[aeiou]).\k<vowel>.\k<vowel>/.match('ototomy')
    #=> #<MatchData "ototo" vowel:"o">
捕捉
以()包含的一个Regular串是一个捕捉组, 从前到后依次为1,2,3,......
有名字的组,以如下模式进行包含
   (?<name>)
# or
(?'name')
原子分组/捕捉
通过(?>pat)定义的分组是原子分组.
   在正则表达式的底层实现中, 通过原子分组, 可以取消匹配过程中的回溯.
   /"(.*)"/.match('"Quote"')     #=> #<MatchData "\"Quote\"" 1:"Quote">
/"(?>.*)"/.match('"Quote"') #=> nil
# 失败原因: .* 由于贪婪的原则, 匹配了", 后续正则式中的"无法再进行匹配, 导致出错.
# 上面的成功是产生的回溯.
取消捕捉
   (?:regular)
引用
直接可以使用\1,\2,\k<name>等进行引用
   \1
# with name
\k<name>
有名变量化
如果正则表达式在=~左侧, 会按名字产生局部变量.
   dollars = 'abc'
/\$(?<dollars>\d+)\.(?<cents>\d+)/ =~ "$3.67" #=> 0
dollars #=> "3"
注意局部变量会被修改
   /[aeiou]\w{2}/.match("Caenorhabditis elegans") #=> #<MatchData "aen">
/([aeiou]\w){2}/.match("Caenorhabditis elegans")
    #=> #<MatchData "enor" 1:"or">
/I(n)ves(ti)ga\2ons/.match("Investigations")
    #=> #<MatchData "Investigations" 1:"n" 2:"ti">
/I(?:n)ves(ti)ga\1ons/.match("Investigations")
    #=> #<MatchData "Investigations" 1:"ti">
分组共用 子表达式共用 Subexpression Calls
通过\g<name>进行表达式的复用
   /\A(?<paren>\(\g<paren>*\))*\z/ .match '' # => #<MatchData "" paren:nil>
/\A(?<paren>\(\g<paren>*\))*\z/ .match '()' # => #<MatchData "()" paren:"()">
/\A(?<paren>\(\g<paren>*\))*\z/ .match '(())' # => #<MatchData "(())" paren:"(())">
# ^1 字符串开始
#      ^2 Regular(paren)实际内容是 ()
#           ^3 实际字符 (
#                 ^4 复用Regular(paren)
#                      ^7 多个Regular(paren)
#                       ^^8 实际字符 )
#                           ^10 字符串结束
组内数据多选一 Alternation
通过|分割多个Regular
   /\w(and|or)\w/.match("Feliformia") #=> #<MatchData "form" 1:"or">
/\w(and|or)\w/.match("furandi")    #=> #<MatchData "randi" 1:"and">
/\w(and|or)\w/.match("dissemblance") #=> nil
锚点Anchors
用于后续正则表达式的定位, 不参加匹配内容
| 符号 | English | 备注 | 
|---|---|---|
 ^  | 
Matches beginning of line | |
 $  | 
Matches end of line | |
 \A  | 
Matches beginning of string. | |
 \Z  | 
Matches end of string. If string ends with a newline, it matches just before newline | |
 \z  | 
Matches end of string | |
 \G  | 
Matches point where last match finished | |
 \b  | 
Matches word boundaries when outside brackets; backspace (0x08) when inside brackets | 单词分割符 | 
 \B  | 
Matches non-word boundaries | |
 (?=pat)  | 
Positive lookahead assertion: ensures that the following characters match pat, but doesn't include those characters in the matched text | 零宽度正预测先行断言 | 
 (?!pat)  | 
Negative lookahead assertion: ensures that the following characters do not match pat, but doesn't include those characters in the matched text | 零宽度负预测先行断言 | 
 (?<=pat)  | 
Positive lookbehind assertion: ensures that the preceding characters match pat, but doesn't include those characters in the matched text | 零宽度正回顾后发断言 | 
 (?<!pat)  | 
Negative lookbehind assertion: ensures that the preceding characters do not match pat, but doesn't include those characters in the matched text | 零宽度负回顾后发断言 | 
其中涉及断言机制, 具体名称可以再参见正则表达式断言.
下面以零宽度正预测先行断言为例子,看看效果:
   /(\w+)(?=abc)/.match 'defabcdef' # => #<MatchData "def" 1:"def">
#      ^ 用于定位
#                        ^ 发现abc
#                     ^^^ (\w+) 的匹配, 位于指定位置前的数据
/(?=abc)(\w+)/.match 'defabcdef' # => #<MatchData "abcdef" 1:"abcdef">
整体配置
| 符号 | English | 备注 | 
|---|---|---|
 /pat/i  | 
Ignore case | |
 /pat/m  | 
Treat a newline as a character matched by . | |
 /pat/x  | 
Ignore whitespace and comments in the pattern | 通过这个参数,可以在正则表达式中写注释了 | 
 /pat/o  | 
Perform #{} interpolation only once | 
   float_pat = /\A
    [[:digit:]]+ # 1 or more digits before the decimal point
    (\.          # Decimal point
        [[:digit:]]+ # 1 or more digits after the decimal point
    )? # The decimal point and following digits are optional
\Z/x
float_pat.match('3.14') #=> #<MatchData "3.14" 1:".14">
编码 Encoding
| 符号 | English | 备注 | 
|---|---|---|
 /pat/u  | 
UTF-8 | |
 /pat/e  | 
EUC-JP | |
 /pat/s  | 
Windows-31J | |
 /pat/n  | 
ASCII-8BIT | 
Ruby 特色的全局变量
| 符号 | English | 备注 | 
|---|---|---|
 $~  | 
is equivalent to ::last_match; | |
 $&  | 
contains the complete matched text; | |
| $` | contains string before match; | |
 $'  | 
contains string after match; | |
 $1, $2 and so  | 
on contain text matching first, second, etc capture group; | |
 $+  | 
contains last capture group. | 
参考资料
文章目录 | 
创建@
2014-01-15
    
        最后修改@
2014-01-16