perl多行正则表达式以分隔段落中的注释

如何解决perl多行正则表达式以分隔段落中的注释

下面的脚本有效，但是需要一个合并。用“ kludge”表示一行代码，使脚本可以执行我想要的---但我不明白为什么需要该行。显然，我不完全了解以/mg结尾的多行正则表达式替换在做什么。

有没有更优雅的方式来完成任务？

该脚本按段落读取文件。它将每个段落分为两个子集：$text和$cmnt。 $text包括每行的左侧部分，即从第一列到第一行%（如果存在），或者到行尾（如果没有）。 $cmnt包括其余部分。

动机：要读取的文件是LaTeX标记，其中%宣布注释的开始。如果我们正在阅读Perl脚本，则可以将$breaker的值更改为等于#。将$text与$cmnt分开后，可以跨诸如

print "match" if ($text =~ /WOLF\s*DOG/s);

请参阅标有“ kludge”的行。如果没有该行，则记录的最后一个%之后会发生一些有趣的事情。如果有$text行（在记录的最后一条注释行之后（%未注释掉的材料），这些行将同时包含在$cmnt的末尾和$text中。

在下面的示例中，这意味着在记录2中，在没有kudge的情况下，“ cat lion”既包含在其所属的$text中，也包含在$cmnt中。

（纠缠导致不必要的%出现在每个非空$cmnt的末尾。这是因为纠缠粘贴的%宣布了最后一个虚构的空注释行。）

根据https://perldoc.perl.org/perlre.html#Modifiers，/m正则表达式修饰符是

将要匹配的字符串视为多行。也就是说，将“ ^”和“ $”从匹配字符串的第一行的开头和最后一行的末尾更改为匹配字符串中每行的开始和结尾。

因此，我希望第二场比赛

s/^([^$breaker]*)($breaker.*?)$/$2/mg

从第一个%开始，延伸到行尾，然后到此为止。因此，即使没有kudge，它也不应在记录2中包括“猫狮子”吗？但是显然是这样，因此我误读或丢失了文档的某些部分。我怀疑这与/g正则表达式修饰符有关吗？

#!/usr/bin/perl
use strict; use warnings;
my $count_record = 0;
my $breaker = '%';
$/ = ''; # one paragraph at a time
while(<>)
{
    $count_record++; 
    my $text = $_; 
    my $cmnt;
    s/[\n]*\z/$breaker/; # kludge
    s/[\n]*\z/\n/; # guarantee each record ends with exactly one newline==LF==lineFeed
    if ($text =~ s/^([^$breaker]*)($breaker.*?)$/$1/mg)  # non-greedy
    {
        $cmnt    = $_; 
        die "cmnt does not match" unless ($cmnt =~ s/^([^$breaker]*)($breaker.*?)$/$2/mg);  # non-greedy
    }
    else
    {
        $cmnt    = ''; 
    }
    print "\nRECORD $count_record:\n";
    print "******** text==";
    print "\n|";
    print $text;
    print "|\n";
    print "******** cmnt==|";
    print $cmnt;
    print "|\n";
}

要在其上运行的示例文件：

dog wolf % flea 
DOG WOLF % FLEA 
DOG WOLLLLLLF % FLLLLLLEA 


% what was that?
 cat lion


no comments in this line




%The last paragraph of this file is nothing but a single-line comment.

解决方法

您还必须从$cmnt删除不包含注释的行：

use feature qw(say);
use strict;
use warnings;

my $count_record = 0;
my $breaker = '%';
$/ = ''; # one paragraph at a time
while(<>)
{
    $count_record++;
    my $text = $_;
    my $cmnt;
    s/[\n]*\z/\n/; # guarantee each record ends with exactly one newline==LF==linefeed
    if ($text =~ s/^([^$breaker]*)($breaker.*?)$/$1/mg)  # non-greedy
    {
        $cmnt    = $_;
        $cmnt =~ s/^[^$breaker]*?$//mg;
        die "cmnt does not match" unless ($cmnt =~ s/^([^$breaker]*)($breaker.*?)$/$2/mg);  # non-greedy
    }
    else
    {
        $cmnt    = '';
    }
    print "\nRECORD $count_record:\n";
    print "******** text==";
    print "\n|";
    print $text;
    print "|\n";
    print "******** cmnt==|";
    print $cmnt;
    print "|\n";
}

输出：

RECORD 1:
******** text==
|dog wolf 
DOG WOLF 
DOG WOLLLLLLF 

|
******** cmnt==|% flea 
% FLEA 
% FLLLLLLEA 
|

RECORD 2:
******** text==
|
 cat lion

|
******** cmnt==|% what was that?

|

RECORD 3:
******** text==
|no comments in this line

|
******** cmnt==||

RECORD 4:
******** text==
||
******** cmnt==|%The last paragraph of this file is nothing but a single-line comment.
|

我的主要困惑是无法区分

是否匹配整个记录-在这里，一条记录可能是多行段落，并且
记录的内部行是否匹配。

以下脚本结合了其他人提供的两个答案的见解，并包含广泛的解释。

#!/usr/bin/perl
use strict; use warnings;
my $count_record = 0;
my $breaker = '%';

$/ = ''; # one paragraph at a time
while(<DATA>)
{
    $count_record++; 
    my $text = $_; 
    my $cmnt;
    s/[\n]*\z/\n/; # guarantee each record ends with exactly one newline==LF==linefeed
    print "RECORD $count_record:";
    print "\n|"; print $_; print "|\n";
    # https://perldoc.perl.org/perlre.html#Modifiers
    # the following regex:
    # ^                     /m: ^==start of line,not of record
    # ([^$breaker]*)        zero or more characters that are not $breaker
    # ($breaker.*?)         non-greedy: the first instance of $breaker,followed by everything after $breaker
    # $                     /m: $==end   of line,not of record
    #                       /g: "globally match the pattern repeatedly in the string"
    if ($text =~ s/^([^$breaker]*)($breaker.*?)$/$1/mg)
    {
        $cmnt    = $_; 
        # In at least one line of this record,the pattern above has matched.
        # But this does not mean every line matches. There may be any number of
        # lines inside the record that do not match /$breaker/; for these lines,# in spite of /g,there will be no match,and thus the exclusion of $1 and printing only of $2,# in the substitution below,will not take place. Thus,those particular lines must be deleted from $cmnt. 
        # Thus:
        $cmnt =~ s/^[^$breaker]*?$/\n/mg; # remove entire line if it does not match /$breaker/
        # recall that /m guarantees that ^ and $ match the start and end of the line,not of the record.
        die "code error: cmnt does not match this record" unless ($cmnt =~ s/^([^$breaker]*)($breaker.*?)$/$2/mg);
        if ( $text =~ /\S/ )
        {
            print "|text|==\n|$text|\n";
        }
        else
        {
            print "NO text found\n";
        }
        print "|cmnt|==\n|$cmnt|\n";
    }
    else
    {
        print "NO comment found\n";
    }
}

__DATA__
one dogs% one comment %d**n lies %statistics
two %two comment
thuh-ree
fower
fi-yiv % (he means 5)
SIX 66 % ¿666==antichrist?
seven % the seventh seal,the seven days
ate
niner
ten

As Douglass said to Lincoln ... 

%Darryl Pinckney

正则表达式修饰符 mg 假定它应用于的字符串包含多行（字符串中包含\n）。它指示正则表达式浏览字符串中的所有行。

请研究以下代码，这些代码应可以简化问题的解决方案。

use strict;
use warnings;
use feature 'say';

use Data::Dumper;

my $breaker = '%';
my @records = do { local $/ = ''; <DATA> };

for( @records ) {
    my %hash = ( /(.*?)$breaker(.*)/mg );
    next unless %hash;
    say Dumper(\%hash);
}

__DATA__
dog wolf % flea 
DOG WOLF % FLEA 
DOG WOLLLLLLF % FLLLLLLEA 


% what was that?
 cat lion


no comments in this line




%The last paragraph of this file is nothing but a single-line comment.

输出

$VAR1 = {
          'DOG WOLF ' => ' FLEA ','dog wolf ' => ' flea ','DOG WOLLLLLLF ' => ' FLLLLLLEA '
        };

$VAR1 = {
          '' => ' what was that?'
        };

$VAR1 = {
          '' => 'The last paragraph of this file is nothing but a single-line comment.'
        };