正则表达式基本用法简介

正则表达式很有用，有些书专门用整本书来讲这个，可见其博大精深。有人的地方就有江湖，有字符串的地方就有正则表达式。所谓的正则表达式，不过是一种模式/形式罢了，说白了，就是一个字符串形式，没那么玄乎其玄。

我们之前介绍过的grep,sed和awk是一种文本/字符串处理工具，而正则表达式却不同，它只是一种字符串形式。我们可以用grep,sed和awk对正则表达式进行处理。为了方便集中介绍正则表达式，我们用最简单的grep来做处理工具。

正则表达式也不同于通配符，尽管也有类似的地方。在正则表达式中的*和通配符中的*就不是一个意思，这一点务必引起重视。

除了基本的正则表达式外，实际上还有扩展的正则表达式，比如+,?,()等小编，此时我们要用egrep或者grep -E,在本文中，我们用egrep.

以实践操作为荣，以只看不练为耻。在本文中，我们仅仅进行基本介绍，后续有新东西添加到本博客中，我们一起来玩玩吧：

1. ^xxx表示以xxx开始的行，如下：

Administrator@51B6904C3C8A485 ~/reg
$ cat test.txt
good good study
daydayup

Administrator@51B6904C3C8A485 ~/reg
$ grep ^go test.txt
good good study

Administrator@51B6904C3C8A485 ~/reg
$

2. xxx$表示以xxx结尾的行，如下：

Administrator@51B6904C3C8A485 ~/reg
$ cat test.txt
good good study
daydayup

Administrator@51B6904C3C8A485 ~/reg
$ grep up$ test.txt
daydayup

Administrator@51B6904C3C8A485 ~/reg
$

顺便说一下， ^$可以表示空行，这个显而易见。

3. 点表示除了换行符之外的任意一个字符，如下：

Administrator@51B6904C3C8A485 ~/reg
$ cat test.txt
good good study
daydayup

Administrator@51B6904C3C8A485 ~/reg
$ grep d.y test.txt
daydayup

Administrator@51B6904C3C8A485 ~/reg
$

再来看一个错误的用法：

Administrator@51B6904C3C8A485 ~/reg
$ echo "w.x.y.z" | grep "w.x"
w.x.y.z

Administrator@51B6904C3C8A485 ~/reg
$

结果虽然侥幸正确，但这仅仅是碰巧而已，不信，请看：

Administrator@51B6904C3C8A485 ~/reg
$ echo "w_x_y_z" | grep "w.x"
w_x_y_z

Administrator@51B6904C3C8A485 ~/reg
$

为什么过滤w.x的时候，却把w_x_y_z过滤出来了呢？原来，在正则表达式中，点不再是普通的点了，点表示的是换行符之外的任意一个字符。但倘若我们就是要过滤w.x这个串，怎么办呢？那就必须要用到\来转义了，我们即将会介绍，先来热热身：

Administrator@51B6904C3C8A485 ~/reg
$ echo "www.x.y.z" | grep "w\.x"
www.x.y.z

Administrator@51B6904C3C8A485 ~/reg
$ echo "w_x_y_z" | grep "w\.x"

Administrator@51B6904C3C8A485 ~/reg
$

这就对了。

4. *表示前面模式的0次或者多次重复，如下：

Administrator@51B6904C3C8A485 ~/reg
$ cat test.txt
good good study
daydayup

Administrator@51B6904C3C8A485 ~/reg
$ grep d.*y test.txt
good good study
daydayup

Administrator@51B6904C3C8A485 ~/reg
$ grep s.*t test.txt
good good study

Administrator@51B6904C3C8A485 ~/reg
$ grep s.t test.txt

Administrator@51B6904C3C8A485 ~/reg

比如注意， s.t并无法过滤出对应的行，而s.*t却可以,因为.*表示0个或者多个字符。

5. []用来指定一个字符所述的集合，要注意，[]只会匹配其中的某个字符，如下：

Administrator@51B6904C3C8A485 ~/reg
$ cat test.txt
good good study
daydayup

Administrator@51B6904C3C8A485 ~/reg
$ grep s[rst]u test.txt
good good study

Administrator@51B6904C3C8A485 ~/reg
$ grep s[abc]u test.txt

Administrator@51B6904C3C8A485 ~/reg
$

如果要表示所有的英文字母，那该用怎样的集合呢？如下：

Administrator@51B6904C3C8A485 ~/reg
$ cat test.txt
good good study
daydayup

Administrator@51B6904C3C8A485 ~/reg
$ grep s[a-zA-Z]u test.txt
good good study

Administrator@51B6904C3C8A485 ~/reg
$

如果是数字，那就用[0-9]限定就行了，很简单，一笔带过。

我们继续继续看如下：

Administrator@51B6904C3C8A485 ~/reg
$ cat test1.txt
good good study
day day up
daydayup

Administrator@51B6904C3C8A485 ~/reg
$ grep [a-z]ay test1.txt
day day up
daydayup

Administrator@51B6904C3C8A485 ~/reg
$

可以看到， [a-z]ay把daydayup这一行也过滤出来了，倘若我们只要过滤出含有day单词的行，那该怎么办呢？我们下面就会讲。

6. \ 表示转义，比如\<xxx就是以xxx开头的单词， xxx\>表示以xxx结尾的单词，如下(正则表达式最好都加上双引号吧)：

Administrator@51B6904C3C8A485 ~/reg
$ cat test1.txt
good good study
day day up
daydayup

Administrator@51B6904C3C8A485 ~/reg
$ grep "\<[a-z]ay\>" test1.txt
day day up

Administrator@51B6904C3C8A485 ~/reg
$

对，别忘了，我们的-w选项也可以过滤单词行，如下：

Administrator@51B6904C3C8A485 ~/reg
$ cat test1.txt
good good study
day day up
daydayup

Administrator@51B6904C3C8A485 ~/reg
$ grep -w [a-z]ay test1.txt
day day up

Administrator@51B6904C3C8A485 ~/reg
$

但是，下面的结果可能会出乎你我的意料：

Administrator@51B6904C3C8A485 ~/reg
$ cat test2.txt
good good study
day day up
oh day'abc oh
daydayup

Administrator@51B6904C3C8A485 ~/reg
$ grep "\<[a-z]ay\>" test2.txt
day day up
oh day'abc oh

Administrator@51B6904C3C8A485 ~/reg
$ grep -w [a-z]ay test2.txt
day day up
oh day'abc oh

Administrator@51B6904C3C8A485 ~/reg
$

为什么day'abc所在的行也被过滤出来了呢？这就涉及到正则表达式对单词的定义了， '和空格符一样，都是分割符号。

7. []中的^表示反义

我们知道， ^xxx表示以xxx开头的行，但是，在[]中的^,就表示取反了，如下：

Administrator@51B6904C3C8A485 ~/reg
$ cat test.txt
good good study
daydayup

Administrator@51B6904C3C8A485 ~/reg
$ grep "d[^abcd]" test.txt
good good study

Administrator@51B6904C3C8A485 ~/reg
$ grep "d[^abcdy]" test.txt
good good study

Administrator@51B6904C3C8A485 ~/reg
$ grep "d[^abcdy ]" test.txt

Administrator@51B6904C3C8A485 ~/reg
$

8. 一些字符类

比如[[:lower:]]等价于[a-z]

比如[[:upper:]]等价于[A-Z]

其余的还有不少，我们不一一列举。来看一个上述[[:lower:]]的应用，如下：

Administrator@51B6904C3C8A485 ~/reg
$ grep "^[[:lower:]]" test.txt
good good study
daydayup

Administrator@51B6904C3C8A485 ~/reg
$ grep "^[[:upper:]]" test.txt

Administrator@51B6904C3C8A485 ~/reg
$

9. 不得不说的重复

前面我们已经说过， *表示对前面的字符重复0次或者多次，我们再来复习一下：

Administrator@51B6904C3C8A485 ~/reg
$ cat test.txt
good good study
daydayup

Administrator@51B6904C3C8A485 ~/reg
$ grep "d.*u" test.txt
good good study
daydayup

Administrator@51B6904C3C8A485 ~/reg
$ grep "d.*p" test.txt
daydayup

Administrator@51B6904C3C8A485 ~/reg
$

如果要重复一次或者多次，那就用+,如下(注意，如下要用扩展的正则表达式,用grep -E或者直接用egrep)：如下：

Administrator@51B6904C3C8A485 ~/reg
$ echo "gd" | egrep "go*d"
gd

Administrator@51B6904C3C8A485 ~/reg
$ echo "gd" | egrep "go+d"

Administrator@51B6904C3C8A485 ~/reg
$ echo "god" | egrep "go+d"
god

Administrator@51B6904C3C8A485 ~/reg
$ echo "good" | egrep "go+d"
good

Administrator@51B6904C3C8A485 ~/reg
$

那要表示仅仅重复0次或者1次，该怎么搞呢？用问号就行了，如下：

Administrator@51B6904C3C8A485 ~/reg
$ echo "good" | grep "go+d"

Administrator@51B6904C3C8A485 ~/reg
$ echo "gd" | egrep "go?d"
gd

Administrator@51B6904C3C8A485 ~/reg
$ echo "god" | egrep "go?d"
god

Administrator@51B6904C3C8A485 ~/reg
$ echo "good" | egrep "go?d"

Administrator@51B6904C3C8A485 ~/reg
$

那要是指定重复4次，该怎么搞呢？如下：

Administrator@51B6904C3C8A485 ~/reg
$ echo "goood" | egrep "go{4}d"

Administrator@51B6904C3C8A485 ~/reg
$ echo "gooood" | egrep "go{4}d"
gooood

Administrator@51B6904C3C8A485 ~/reg
$ echo "goooood" | egrep "go{4}d"

Administrator@51B6904C3C8A485 ~/reg
$

那要是重复4次或以上，该怎么搞呢？如下：

Administrator@51B6904C3C8A485 ~/reg
$ echo "goood" | egrep "go{4,}d"

Administrator@51B6904C3C8A485 ~/reg
$ echo "gooood" | egrep "go{4,}d"
gooood

Administrator@51B6904C3C8A485 ~/reg
$ echo "goooood" | egrep "go{4,}d"
goooood

Administrator@51B6904C3C8A485 ~/reg
$

那要是重复4次到6次，该怎么搞呢？如下：

Administrator@51B6904C3C8A485 ~/reg
$ echo "goood" | egrep "go{4,6}d"

Administrator@51B6904C3C8A485 ~/reg
$ echo "gooood" | egrep "go{4,6}d"
gooood

Administrator@51B6904C3C8A485 ~/reg
$ echo "goooood" | egrep "go{4,6}d"
goooood

Administrator@51B6904C3C8A485 ~/reg
$ echo "gooooood" | egrep "go{4,6}d"
gooooood

Administrator@51B6904C3C8A485 ~/reg
$ echo "goooooood" | egrep "go{4,6}d"

Administrator@51B6904C3C8A485 ~/reg
$

说道这里，关于重复的问题是很明显了，下面来小结一下：

x*表示0个或者多个x

x+表示1个或者多个x

x?表示0和或者1个x

x{4}比表示4个x

x{4,}表示4个或4个以上的x

x{4,6}表示有有4个或者5个或者6个x

既然已经学了这么多，那想想如何匹配出一个5位的正整数呢？如下：

Administrator@51B6904C3C8A485 ~/reg
$ echo "123456" | egrep "[1-9][0-9]{4}"
123456

Administrator@51B6904C3C8A485 ~/reg
$ echo "1234" | egrep "\<[1-9][0-9]{4}\>"

Administrator@51B6904C3C8A485 ~/reg
$ echo "12345" | egrep "\<[1-9][0-9]{4}\>"
12345

Administrator@51B6904C3C8A485 ~/reg
$ echo "123456" | egrep "\<[1-9][0-9]{4}\>"

Administrator@51B6904C3C8A485 ~/reg
$

10. ()表示整体，如下：

Administrator@51B6904C3C8A485 ~/reg
$ echo "abababc" | egrep "(ab){3,}"
abababc

Administrator@51B6904C3C8A485 ~/reg
$ echo "abababc" | egrep "(ab){4,}"

Administrator@51B6904C3C8A485 ~/reg
$

另外还要注意，有时候会用()表示空(不是空格哈)，如：

Administrator@51B6904C3C8A485 ~/reg
$ echo "ab" | egrep "a b"

Administrator@51B6904C3C8A485 ~/reg
$ echo "ab" | egrep "a()b"
ab

Administrator@51B6904C3C8A485 ~/reg
$ echo "ab" | egrep "a b"

Administrator@51B6904C3C8A485 ~/reg
$ echo "a b" | egrep "a()b"

Administrator@51B6904C3C8A485 ~/reg
$ echo "a b" | egrep "a b"
a b

Administrator@51B6904C3C8A485 ~/reg
$

11. |表示或，很好理解。

Administrator@51B6904C3C8A485 ~/reg
$ egrep "^g|p$" test.txt
good good study
daydayup

Administrator@51B6904C3C8A485 ~/reg
$

如上就表示以g开头或者以p结尾的行。

OK,先说这么多，后续有新小编，再补充。

正则表达式基本用法简介

相关推荐