声明: 下列内容基于ruby 2.0.0p247 (2013-06-27) [x86_64-linux]

一堆小例子

基本使用

   /hay/ =~ 'haystack'   #=> 0
%r[y].match('haystack') #=> #<MatchData "y">

主要概念

New a object

   # matching check
=~

# MatchData
obj=regexpObject.match(stringobj)

# Regexp class
r1 = Regexp.new('^a-z+:\s+\w+') #=> /^a-z+:\s+\w+/
r2 = Regexp.new('cat', true)     #=> /cat/i
r3 = Regexp.new(r2)              #=> /cat/i
r4 = Regexp.new('dog', Regexp::EXTENDED | Regexp::IGNORECASE) #=> /dog/ix

字符组 Character Classes

通用表示

符号	English	备注
`/./`	Any character except a newline.	除新行外的所有字符
`/./m`	Any character (the m modifier enables multiline mode)	带多行支持后,表示所有字符
`/\w/`	A word character (`[a-zA-Z0-9_]`)
`/\W/`	A non-word character (`[^a-zA-Z0-9_]`). Please take a look at Bug #4044 if using /\W/ with the /i modifier.
`/\d/`	A digit character (`[0-9]`)
`/\D/`	A non-digit character (`[^0-9]`)
`/\h/`	A hexdigit character (`[0-9a-fA-F]`)
`/\H/`	A non-hexdigit character (`[^0-9a-fA-F]`)
`/\s/`	A whitespace character: /`[ \t\r\n\f]`/
`/\S/`	A non-whitespace character: /`[^ \t\r\n\f]`/

POSIX

符号	English	备注
`/[[:alnum:]]/`	Alphabetic and numeric character
`/[[:alpha:]]/`	Alphabetic character
`/[[:blank:]]/`	Space or tab
`/[[:cntrl:]]/`	Control character
`/[[:digit:]]/`	Digit
`/[[:graph:]]/`	Non-blank character (excludes spaces, control characters, and similar)
`/[[:lower:]]/`	Lowercase alphabetical character
`/[[:print:]]/`	Like `[:graph:]`, but includes the space character
`/[[:punct:]]/`	Punctuation character
`/[[:space:]]/`	Whitespace character (`[:blank:]`, newline, carriage return, etc.)
`/[[:upper:]]/`	Uppercase alphabetical
`/[[:xdigit:]]/`	Digit allowed in a hexadecimal number (i.e., 0-9a-fA-F)

non-POSIX

符号	English	备注
`/[[:word:]]/`	A character in one of the following Unicode general categories Letter, Mark, Number, Connector_Punctuation
`/[[:ascii:]]/`	A character in the ASCII character set

扩展字符集合 Character Properties

符号	English	备注
`/\p{Alnum}/`	Alphabetic and numeric character
`/\p{Alpha}/`	Alphabetic character
`/\p{Blank}/`	Space or tab
`/\p{Cntrl}/`	Control character
`/\p{Digit}/`	Digit
`/\p{Graph}/`	Non-blank character (excludes spaces, control characters, and similar)
`/\p{Lower}/`	Lowercase alphabetical character
`/\p{Print}/`	Like \p{Graph}, but includes the space character
`/\p{Punct}/`	Punctuation character
`/\p{Space}/`	Whitespace character ([:blank:], newline, carriage return, etc.)
`/\p{Upper}/`	Uppercase alphabetical
`/\p{XDigit}/`	Digit allowed in a hexadecimal number (i.e., 0-9a-fA-F)
`/\p{Word}/`	A member of one of the following Unicode general category Letter, Mark, Number, Connector_Punctuation
`/\p{ASCII}/`	A character in the ASCII character set
`/\p{Any}/`	Any Unicode character (including unassigned characters)
`/\p{Assigned}/`	An assigned character
`/\p{L}/`	'Letter'
`/\p{Ll}/`	'Letter: Lowercase'
`/\p{Lm}/`	'Letter: Mark'
`/\p{Lo}/`	'Letter: Other'
`/\p{Lt}/`	'Letter: Titlecase'
`/\p{Lu}/`	'Letter: Uppercase
`/\p{Lo}/`	'Letter: Other'
`/\p{M}/`	'Mark'
`/\p{Mn}/`	'Mark: Nonspacing'
`/\p{Mc}/`	'Mark: Spacing Combining'
`/\p{Me}/`	'Mark: Enclosing'
`/\p{N}/`	'Number'
`/\p{Nd}/`	'Number: Decimal Digit'
`/\p{Nl}/`	'Number: Letter'
`/\p{No}/`	'Number: Other'
`/\p{P}/`	'Punctuation'
`/\p{Pc}/`	'Punctuation: Connector'
`/\p{Pd}/`	'Punctuation: Dash'
`/\p{Ps}/`	'Punctuation: Open'
`/\p{Pe}/`	'Punctuation: Close'
`/\p{Pi}/`	'Punctuation: Initial Quote'
`/\p{Pf}/`	'Punctuation: Final Quote'
`/\p{Po}/`	'Punctuation: Other'
`/\p{S}/`	'Symbol'
`/\p{Sm}/`	'Symbol: Math'
`/\p{Sc}/`	'Symbol: Currency'
`/\p{Sc}/`	'Symbol: Currency'
`/\p{Sk}/`	'Symbol: Modifier'
`/\p{So}/`	'Symbol: Other'
`/\p{Z}/`	'Separator'
`/\p{Zs}/`	'Separator: Space'
`/\p{Zl}/`	'Separator: Line'
`/\p{Zp}/`	'Separator: Paragraph'
`/\p{C}/`	'Other'
`/\p{Cc}/`	'Other: Control'
`/\p{Cf}/`	'Other: Format'
`/\p{Cn}/`	'Other: Not Assigned'
`/\p{Co}/`	'Other: Private Use'
`/\p{Cs}/`	'Other: Surrogate'

   Lastly, `\p{}` matches a character’s Unicode script. The following scripts are supported: Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic, Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Inherited, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lycian, Lydian, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Ol_Chiki, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Saurashtra, Shavian, Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai, and Yi.

对于其它字符的支持:

   /\p{Arabic}/.match("\u06E9") #=> #<MatchData "\u06E9">
/\p{^Ll}/.match("A") #=> #<MatchData "A">

复用 Repetition

结构/量词

符号/量词	English	备注
*	Zero or more times
+	One or more times
?	Zero or one times (optional)
{n}	Exactly n times
{n,}	n or more times
{,m}	m or less times
{n,m}	At least n and at most m times

模式

默认为贪婪型, 最多成功匹配
非贪婪/懒惰型, 最少成功匹配
- 量词后面添加?
- 涉及量词: * + {n,}

   /<(.+)>/.match("<a><b>") # => #<MatchData "<a><b>" 1:"a><b">
/<(.+?)>/.match("<a><b>") # => #<MatchData "<a>" 1:"a">
/<(.+?)>/.match("<abc><b>") # => #<MatchData "<abc>" 1:"abc">
/<(.{1,}?)>/.match("<abc><b>") # => #<MatchData "<abc>" 1:"abc">
/<(.{1,})>/.match("<abc><b>") # => #<MatchData "<abc><b>" 1:"abc><b">

分组 Grouping

分组应该算是对上面东西的结构化. 从分组到归类,又是引用.

捕捉/获取 Capturing

主要涉及两种操作: 捕捉与引用

   /(?<vowel>[aeiou]).\k<vowel>.\k<vowel>/.match('ototomy')
    #=> #<MatchData "ototo" vowel:"o">

捕捉

以()包含的一个Regular串是一个捕捉组, 从前到后依次为1,2,3,......

有名字的组,以如下模式进行包含

   (?<name>)
# or
(?'name')

原子分组/捕捉

通过(?>pat)定义的分组是原子分组.

   在正则表达式的底层实现中, 通过原子分组, 可以取消匹配过程中的回溯.

   /"(.*)"/.match('"Quote"')     #=> #<MatchData "\"Quote\"" 1:"Quote">
/"(?>.*)"/.match('"Quote"') #=> nil
# 失败原因: .* 由于贪婪的原则, 匹配了", 后续正则式中的"无法再进行匹配, 导致出错.
# 上面的成功是产生的回溯.

取消捕捉

   (?:regular)

引用

直接可以使用\1,\2,\k<name>等进行引用

   \1
# with name
\k<name>

有名变量化

如果正则表达式在=~左侧, 会按名字产生局部变量.

   dollars = 'abc'
/\$(?<dollars>\d+)\.(?<cents>\d+)/ =~ "$3.67" #=> 0
dollars #=> "3"

注意局部变量会被修改

   /[aeiou]\w{2}/.match("Caenorhabditis elegans") #=> #<MatchData "aen">
/([aeiou]\w){2}/.match("Caenorhabditis elegans")
    #=> #<MatchData "enor" 1:"or">
/I(n)ves(ti)ga\2ons/.match("Investigations")
    #=> #<MatchData "Investigations" 1:"n" 2:"ti">
/I(?:n)ves(ti)ga\1ons/.match("Investigations")
    #=> #<MatchData "Investigations" 1:"ti">

分组共用子表达式共用 Subexpression Calls

通过\g<name>进行表达式的复用

   /\A(?<paren>\(\g<paren>*\))*\z/ .match '' # => #<MatchData "" paren:nil>

/\A(?<paren>\(\g<paren>*\))*\z/ .match '()' # => #<MatchData "()" paren:"()">

/\A(?<paren>\(\g<paren>*\))*\z/ .match '(())' # => #<MatchData "(())" paren:"(())">
# ^1 字符串开始
#      ^2 Regular(paren)实际内容是 ()
#           ^3 实际字符 (
#                 ^4 复用Regular(paren)
#                      ^7 多个Regular(paren)
#                       ^^8 实际字符 )
#                           ^10 字符串结束

组内数据多选一 Alternation

通过|分割多个Regular

   /\w(and|or)\w/.match("Feliformia") #=> #<MatchData "form" 1:"or">
/\w(and|or)\w/.match("furandi")    #=> #<MatchData "randi" 1:"and">
/\w(and|or)\w/.match("dissemblance") #=> nil

锚点Anchors

用于后续正则表达式的定位, 不参加匹配内容

符号	English	备注
`^`	Matches beginning of line
`$`	Matches end of line
`\A`	Matches beginning of string.
`\Z`	Matches end of string. If string ends with a newline, it matches just before newline
`\z`	Matches end of string
`\G`	Matches point where last match finished
`\b`	Matches word boundaries when outside brackets; backspace (0x08) when inside brackets	单词分割符
`\B`	Matches non-word boundaries
`(?=pat)`	Positive lookahead assertion: ensures that the following characters match pat, but doesn't include those characters in the matched text	零宽度正预测先行断言
`(?!pat)`	Negative lookahead assertion: ensures that the following characters do not match pat, but doesn't include those characters in the matched text	零宽度负预测先行断言
`(?<=pat)`	Positive lookbehind assertion: ensures that the preceding characters match pat, but doesn't include those characters in the matched text	零宽度正回顾后发断言
`(?<!pat)`	Negative lookbehind assertion: ensures that the preceding characters do not match pat, but doesn't include those characters in the matched text	零宽度负回顾后发断言

其中涉及断言机制, 具体名称可以再参见正则表达式断言.

下面以零宽度正预测先行断言为例子,看看效果:

   /(\w+)(?=abc)/.match 'defabcdef' # => #<MatchData "def" 1:"def">
#      ^ 用于定位
#                        ^ 发现abc
#                     ^^^ (\w+) 的匹配, 位于指定位置前的数据
/(?=abc)(\w+)/.match 'defabcdef' # => #<MatchData "abcdef" 1:"abcdef">

整体配置

符号	English	备注
`/pat/i`	Ignore case
`/pat/m`	Treat a newline as a character matched by .
`/pat/x`	Ignore whitespace and comments in the pattern	通过这个参数,可以在正则表达式中写注释了
`/pat/o`	Perform #{} interpolation only once

   float_pat = /\A
    [[:digit:]]+ # 1 or more digits before the decimal point
    (\.          # Decimal point
        [[:digit:]]+ # 1 or more digits after the decimal point
    )? # The decimal point and following digits are optional
\Z/x
float_pat.match('3.14') #=> #<MatchData "3.14" 1:".14">

编码 Encoding

符号	English	备注
`/pat/u`	UTF-8
`/pat/e`	EUC-JP
`/pat/s`	Windows-31J
`/pat/n`	ASCII-8BIT

Ruby 特色的全局变量

符号	English	备注
`$~`	is equivalent to ::last_match;
`$&`	contains the complete matched text;
$`	contains string before match;
`$'`	contains string after match;
`$1, $2 and so`	on contain text matching first, second, etc capture group;
`$+`	contains last capture group.

参考资料

创建@

2014-01-15

最后修改@

2014-01-16

标签/Tags

Regular 1

Ruby 正则表达式让正则表达式来优化字符串处理

一堆小例子

基本使用

主要概念

New a object

字符组 Character Classes

通用表示

POSIX

non-POSIX

扩展字符集合 Character Properties

复用 Repetition

结构/量词

模式

分组 Grouping

捕捉/获取 Capturing

捕捉

原子分组/捕捉

取消捕捉

引用

有名变量化

分组共用子表达式共用 Subexpression Calls

组内数据多选一 Alternation

锚点Anchors

整体配置

编码 Encoding

Ruby 特色的全局变量

参考资料

文章目录

创建@

最后修改@

标签/Tags

相同类目/Categories

Ruby 正则表达式 让正则表达式来优化字符串处理

一堆小例子

基本使用

主要概念

New a object

字符组 Character Classes

通用表示

POSIX

non-POSIX

扩展字符集合 Character Properties

复用 Repetition

结构/量词

模式

分组 Grouping

捕捉/获取 Capturing

捕捉

原子分组/捕捉

取消捕捉

引用

有名变量化

分组共用 子表达式共用 Subexpression Calls

组内数据多选一 Alternation

锚点Anchors

整体配置

编码 Encoding

Ruby 特色的全局变量

参考资料

文章目录

创建@

最后修改@

标签/Tags

相同类目/Categories

Ruby 正则表达式让正则表达式来优化字符串处理

分组共用子表达式共用 Subexpression Calls