Python 正则表达式

在本教程中，您将学习正则表达式（RegEx），并使用 Python 的 re 模块来处理 RegEx（借助示例）。

A Reg 普通的Ex pression (RegEx) 是定义搜索模式的字符序列。例如，

^a...s$

上面的代码定义了一个 RegEx 模式。模式是：任何以 a 开头的五个字母的字符串并以 s 结尾 .

使用 RegEx 定义的模式可用于匹配字符串。

表达式	字符串	匹配？
`^a...s$`	`abs`	不匹配
	`alias`	匹配
	`abyss`	匹配
	`Alias`	不匹配
	`An abacus`	不匹配

Python 有一个名为 re 的模块使用正则表达式。举个例子：

import re

pattern = '^a...s$'
test_string = 'abyss'
result = re.match(pattern, test_string)

if result:
  print("Search successful.")
else:
  print("Search unsuccessful.")

在这里，我们使用 re.match() 搜索pattern的函数在 test_string 中 .如果搜索成功，该方法返回一个匹配对象。如果不是，则返回 None .

re 中还定义了其他几个函数与 RegEx 一起工作的模块。在我们探讨之前，让我们了解一下正则表达式本身。

如果您已经了解 RegEx 的基础知识，请跳转到 Python RegEx。

使用正则表达式指定模式

为了指定正则表达式，使用元字符。在上面的例子中，^ 和 $ 是元字符。

元字符

元字符是由 RegEx 引擎以特殊方式解释的字符。以下是元字符列表：

[] 。 ^ $ * + ? {} () \ |

[] - 方括号

方括号指定您希望匹配的一组字符。

表达式	字符串	匹配？
`[abc]`	`a`	1 场比赛
	`ac`	2 个匹配项
	`Hey Jude`	不匹配
	`abc de ca`	5 个匹配项

这里，[abc] 如果您尝试匹配的字符串包含任何 a 将匹配 , b 或 c .

您还可以使用 - 指定字符范围方括号内。

[a-e] 与 [abcde] 相同 .
[1-4] 与 [1234] 相同 .
[0-39] 与 [01239] 相同 .

您可以使用插入符号 ^ 来补充（反转）字符集方括号开头的符号。

[^abc] 表示除 a 之外的任何字符或 b 或 c .
[^0-9] 表示任何非数字字符。

. - 期间

句点匹配任何单个字符（换行符 '\n' 除外 )。

表达式	字符串	匹配？
`..`	`a`	不匹配
	`ac`	1 场比赛
	`acd`	1 场比赛
	`acde`	2 个匹配项（包含 4 个字符）

^ - 插入符号

插入符号 ^ 用于检查字符串是否以开头某个角色。

表达式	字符串	匹配？
`^a`	`a`	1 场比赛
	`abc`	1 场比赛
	`bac`	不匹配
`^ab`	`abc`	1 场比赛
`^ab`	`acb`	不匹配（以 `a` 开头但后面没有`b` )

$ - 美元

美元符号 $ 用于检查字符串是否以结尾某个角色。

表达式	字符串	匹配？
`a$`	`a`	1 场比赛
	`formula`	1 场比赛
	`cab`	不匹配

* - 明星

星号* 匹配零次或多次出现 留给它的模式。

表达式	字符串	匹配？
`ma*n`	`mn`	1 场比赛
	`man`	1 场比赛
	`maaan`	1 场比赛
	`main`	不匹配（`a` 后面没有`n` )
	`woman`	1 场比赛

+ - 加号

加号 + 匹配一次或多次出现 留给它的模式。

表达式	字符串	匹配？
`ma+n`	`mn`	不匹配（没有 `a` 字符）
	`man`	1 场比赛
	`maaan`	1 场比赛
	`main`	不匹配（a 后面没有 n）
	`woman`	1 场比赛

? - 问号

问号符号? 匹配 0 次或 1 次出现 留给它的模式。

表达式	字符串	匹配？
`ma?n`	`mn`	1 场比赛
	`man`	1 场比赛
	`maaan`	不匹配（多个`a` 字符）
	`main`	不匹配（a 后面没有 n）
	`woman`	1 场比赛

{} - 大括号

考虑以下代码：{n,m} .这意味着至少 n , 最多 m 重复留给它的模式。

表达式	字符串	匹配？
`a{2,3}`	`abc dat`	不匹配
	`abc daat`	1 个匹配项（在 `daat` )
	`aabc daaat`	2 个匹配项（在 `aabc` 和 `daaat` )
	`aabc daaaat`	2 个匹配项（在 `aabc` 和 `daaaat` )

让我们再试一个例子。这个正则表达式 [0-9]{2, 4} 匹配至少 2 位但不超过 4 位

表达式	字符串	匹配？
`[0-9]{2,4}`	`ab123csde`	1 匹配（匹配在 `ab123csde` )
	`12 and 345673`	3 个匹配项（`12` , `3456` , `73` )
	`1 and 2`	不匹配

| - 交替

竖条 | 用于交替（or 运算符）。

表达式	字符串	匹配？
`a\|b`	`cde`	不匹配
	`ade`	1 匹配（匹配在 `ade` )
	`acdbea`	3 个匹配项（在 `acdbea` )

这里，a|b 匹配任何包含 a 的字符串或 b

() - 组

括号 () 用于对子模式进行分组。例如，(a|b|c)xz 匹配任何匹配 a 的字符串或 b 或 c 其次是 xz

表达式	字符串	匹配？
`(a\|b\|c)xz`	`ab xz`	不匹配
	`abxz`	1 匹配（匹配在 `abxz` )
	`axz cabxz`	2 个匹配项（在 `axzbc cabxz` )

\ - 反斜杠

反冲 \ 用于转义各种字符，包括所有元字符。例如，

\$a 如果字符串包含 $ 则匹配后跟 a .这里，$ 不会被 RegEx 引擎以特殊方式解释。

如果你不确定一个字符是否有特殊含义，你可以输入 \ 在它面前。这样可以确保不会以特殊方式处理该字符。

特殊序列

特殊序列使常用模式更易于编写。以下是特殊序列的列表：

\A - 如果指定的字符位于字符串的开头，则匹配。

表达式	字符串	匹配？
`\Athe`	`the sun`	匹配
`\Athe`	`In the sun`	不匹配

\b - 匹配指定字符是否位于单词的开头或结尾。

表达式	字符串	匹配？
`\bfoo`	`football`	匹配
	`a football`	匹配
	`afootball`	不匹配
`foo\b`	`the foo`	匹配
	`the afoo test`	匹配
	`the afootest`	不匹配

\B - \b 的对面 .如果指定的字符是 not 则匹配在单词的开头或结尾。

表达式	字符串	匹配？
`\Bfoo`	`football`	不匹配
	`a football`	不匹配
	`afootball`	匹配
`foo\B`	`the foo`	不匹配
	`the afoo test`	不匹配
	`the afootest`	匹配

\d - 匹配任何十进制数字。相当于 [0-9]

表达式	字符串	匹配？
`\d`	`12abc3`	3 个匹配项（在 `12abc3` )
`\d`	`Python`	不匹配

\D - 匹配任何非十进制数字。相当于 [^0-9]

表达式	字符串	匹配？
`\D`	`1ab34"50`	3 个匹配项（在 `1ab34"50` )
`\D`	`1345`	不匹配

\s - 匹配字符串包含任何空白字符的位置。相当于 [ \t\n\r\f\v] .

表达式	字符串	匹配？
`\s`	`Python RegEx`	1 场比赛
`\s`	`PythonRegEx`	不匹配

\S - 匹配字符串包含任何非空白字符的位置。相当于 [^ \t\n\r\f\v] .

表达式	字符串	匹配？
`\S`	`a b`	2 个匹配项（在 `a b` )
`\S`		不匹配

\w - 匹配任何字母数字字符（数字和字母）。相当于 [a-zA-Z0-9_] .对了，下划线_ 也被视为字母数字字符。

表达式	字符串	匹配？
`\w`	`12&": ;c`	3 个匹配项（在 `12&": ;c` )
`\w`	`%"> !`	不匹配

\W - 匹配任何非字母数字字符。相当于 [^a-zA-Z0-9_]

表达式	字符串	匹配？
`\W`	`1a2%c`	1 个匹配项（在 `1a2%c` )
`\W`	`Python`	不匹配

\Z - 如果指定的字符在字符串的末尾，则匹配。

表达式	字符串	匹配？
`Python\Z`	`I like Python`	1 场比赛
	`I like Python Programming`	不匹配
	`Python is fun.`	不匹配

提示： 要构建和测试正则表达式，您可以使用 RegEx 测试工具，例如 regex101。该工具不仅可以帮助您创建正则表达式，还可以帮助您学习它。

现在你了解了 RegEx 的基础知识，让我们讨论如何在 Python 代码中使用 RegEx。

Python 正则表达式

Python 有一个名为 re 的模块使用正则表达式。要使用它，我们需要导入模块。

import re

该模块定义了几个函数和常量来使用 RegEx。

re.findall()

re.findall() 方法返回一个包含所有匹配项的字符串列表。

示例 1：re.findall()


# Program to extract numbers from a string

import re

string = 'hello 12 hi 89. Howdy 34'
pattern = '\d+'

result = re.findall(pattern, string) 
print(result)

# Output: ['12', '89', '34']

如果没有找到该模式，re.findall() 返回一个空列表。

re.split()

re.split 方法将匹配的字符串拆分，并返回发生拆分的字符串列表。

示例 2：re.split()


import re

string = 'Twelve:12 Eighty nine:89.'
pattern = '\d+'

result = re.split(pattern, string) 
print(result)

# Output: ['Twelve:', ' Eighty nine:', '.']

如果没有找到该模式，re.split() 返回一个包含原始字符串的列表。

你可以通过maxsplit re.split() 的参数方法。这是将发生的最大拆分次数。


import re

string = 'Twelve:12 Eighty nine:89 Nine:9.'
pattern = '\d+'

# maxsplit = 1
# split only at the first occurrence
result = re.split(pattern, string, 1) 
print(result)

# Output: ['Twelve:', ' Eighty nine:89 Nine:9.']

顺便说一下maxsplit的默认值为 0；表示所有可能的分裂。

re.sub()

re.sub() 的语法是：

re.sub(pattern, replace, string)

该方法返回一个字符串，其中匹配的匹配项被替换为 replace 的内容变量。

示例 3：re.sub()


# Program to remove all whitespaces
import re

# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'

# empty string
replace = ''

new_string = re.sub(pattern, replace, string) 
print(new_string)

# Output: abc12de23f456

如果没有找到该模式，re.sub() 返回原始字符串。

你可以通过 count 作为 re.sub() 的第四个参数方法。如果省略，则结果为 0。这将替换所有匹配项。


import re

# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'
replace = ''

new_string = re.sub(r'\s+', replace, string, 1) 
print(new_string)

# Output:
# abc12de 23
# f45 6

re.subn()

re.subn() 类似于 re.sub() 除了它返回一个包含新字符串和替换次数的 2 个项目的元组。

示例 4：re.subn()


# Program to remove all whitespaces
import re

# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'

# empty string
replace = ''

new_string = re.subn(pattern, replace, string) 
print(new_string)

# Output: ('abc12de23f456', 4)

re.search()

re.search() 方法有两个参数：一个模式和一个字符串。该方法查找 RegEx 模式与字符串产生匹配的第一个位置。

如果搜索成功，re.search() 返回一个匹配对象；如果不是，则返回 None .

match = re.search(pattern, str)

示例 5：re.search()


import re

string = "Python is fun"

# check if 'Python' is at the beginning
match = re.search('\APython', string)

if match:
  print("pattern found inside the string")
else:
  print("pattern not found")  

# Output: pattern found inside the string

在这里，匹配 包含一个匹配对象。

匹配对象

您可以使用 dir() 函数获取匹配对象的方法和属性。

匹配对象的一些常用方法和属性有：

match.group()

group() 方法返回字符串中匹配的部分。

示例 6：匹配对象


import re

string = '39801 356, 2102 1111'

# Three digit number followed by space followed by two digit number
pattern = '(\d{3}) (\d{2})'

# match variable contains a Match object.
match = re.search(pattern, string) 

if match:
  print(match.group())
else:
  print("pattern not found")

# Output: 801 35

在这里，匹配 变量包含一个匹配对象。

我们的模式 (\d{3}) (\d{2}) 有两个子组 (\d{3}) 和 (\d{2}) .您可以获得这些带括号的子组的字符串部分。方法如下：

>>> match.group(1)
'801'

>>> match.group(2)
'35'
>>> match.group(1, 2)
('801', '35')

>>> match.groups()
('801', '35')

match.start()、match.end()和match.span()

start() 函数返回匹配子字符串的开始索引。同样，end() 返回匹配子串的结束索引。

>>> match.start()
2
>>> match.end()
8

span() 函数返回一个包含匹配部分的开始和结束索引的元组。

>>> match.span()
(2, 8)

match.re 和 match.string

re 匹配对象的属性返回一个正则表达式对象。同样，string 属性返回传递的字符串。

>>> match.re
re.compile('(\\d{3}) (\\d{2})')

>>> match.string
'39801 356, 2102 1111'

我们已经涵盖了 re 中定义的所有常用方法模块。如果您想了解更多信息，请访问 Python 3 re 模块。

在正则表达式前使用 r 前缀

当 r 或 R 前缀用于正则表达式之前，表示原始字符串。例如，'\n' 是一个新行，而 r'\n' 表示两个字符：一个反斜杠 \ 后跟 n .

反冲 \ 用于转义各种字符，包括所有元字符。但是，使用 r 前缀使 \ 视为普通角色。

示例 7：使用 r 前缀的原始字符串


import re

string = '\n and \r are escape sequences.'

result = re.findall(r'[\n\r]', string) 
print(result)

# Output: ['\n', '\r']

Python @property 装饰器 Python 日期时间

Python