正则表达式实例.

正则表达式实例. 下面介绍grep 的零宽断言和sed 的单行多段提取。零宽断言(zero-length-assertion)又叫环视(look around)，是正则表达式的精华所在. 零宽断言并不匹配具体字符，而是匹配一个位置。 ^匹配行首，$匹配行尾，就是只匹配一个位置，零宽断言可以匹配任意一个位置，这个位置由特定的匹配表达式表示(?=...)(?!...)(?<=...)(?<!...) 其中...是表达式的意思，不是三个点。 1.(?=...) 正向前视(positive look ahead),寻找一个位置点,这个位置点的右侧与表达式...相匹配; 2.(?!...) 逆向前视(negative look ahead) 寻找一个位置点,这个位置点的右侧与表达式...不相匹配; 3.(?<=...) 正向后视(positive look behind) 寻找一个位置点,这个位置点的左侧与表达式...相匹配; 4.(?<!...) 逆向后视(negative look behind) 寻找一个位置点,这个位置点的左侧与表达式...不相匹配; 举个例子就明白了: cat train.txt "train_no":"5l0000G10241","station_train_code":"G102","start_station_telecode":"AOH","start_station_name":"上海虹桥 ","end_station_telecode":"VNP","end_station_name":"北京南","from_station_telecode":"AOH","from_station_name":"上海虹桥","to_station_telecode":"VNP","to_station_name":"北京南","start_time":"06:43","arrive_time":"12:17","day_difference":"0","train_class_name":"","lishi":"05:34","canWebBuy":"Y","lishiValue":"334","yp_info":"O055300630M0933000649174800008","control_train_day":"20300303","start_train_date":"20150718","seat_feature":"O3M393","yp_ex":"O0M090","train_seat_feature":"3","seat_types":"OM9","location_code":"H1","from_station_no":"01","to_station_no":"09","control_day":59,"sale_time":"1330","is_support_card":"1","note":"","gg_num":"--","gr_num":"--","qt_num":"--","rw_num":"--","rz_num":"--","tz_num":"--","wz_num":"--","yb_num":"--","yw_num":"--","yz_num":"--","ze_num":"630","zy_num":"64","swz_num":"8" a. 我们要提取列车车次 G102 它具有以下特征： 1. 左侧为:station_train_code":" 2. 右侧为:" 3. 匹配的部分是任意字符，但不能包含" 用grep 写的正则表达式如下: 用到了2个环视: #cat train.txt | grep -oP '(?<=station_train_code":")[^"]+(?=")' G102 b. 我们想提取G102,AOH,上海虹桥,北京南几个字符 #cat train.txt | grep -oP '(?<=station_train_code":")[^"]+(?=").*(?<=start_station_name":")[^"]+(?=")' G102","start_station_name":"上海虹桥 start_station_telecode,start_station_name 并不是我们想要的，但是由于它们被.* 所匹配，所以也被输出了。有没有办法输出匹配中的一部分内容呢？有，这就是反向引用，用\1,\2,...\9 代替前面()中的部分。 3. sed 登场，从一行中提取多段内容(最多9段)。把待提取的匹配的内容用括号括起来。然后引用 $ cat train.txt|sed -r 's/.*station_train_code":"([^"]+).*start_station_name":"([^"]+).*end_station_name":"([^"]+).*start_time":"([^"]+).*arrive_time":"([^"]+).*ze_num":"([^"]+).*zy_num":"([^"]+).*swz_num":"([^"]+).*/\1 \2 \3 \4 \5 \6 \7 \8/' G102 上海虹桥北京南 06:43 12:17 630 64 8 4. 若果一行中想提取超过9个字段，就只能编程序了。支持正则的最好，不支持就split and search 了。都不算复杂。

相关推荐