deblob-psed-disabled

   1 #!/usr/bin/perl
   2     eval 'exec /usr/bin/perl -S $0 ${1+"$@"}'
   3         if $running_under_some_shell;
   4 my $startperl;
   5 my $perlpath;
   6 ($startperl = <<'/../') =~ s/\s*\z//;
   7 #!/usr/bin/perl
   8 /../
   9 ($perlpath = <<'/../') =~ s/\s*\z//;
  10 /usr/bin/perl
  11 /../
  12
  13 $0 =~ s/^.*?(\w+)[\.\w]*$/$1/;
  14
  15 # (p)sed - a stream editor
  16 # History:  Aug 12 2000: Original version.
  17 #           Mar 25 2002: Rearrange generated Perl program.
  18 #           Jul 23 2007: Fix bug in regex stripping (M.Thorland)
  19
  20 use strict;
  21 use integer;
  22 use Symbol;
  23
  24 =head1 NAME
  25
  26 psed - a stream editor
  27
  28 =head1 SYNOPSIS
  29
  30    psed [-an] script [file ...]
  31    psed [-an] [-e script] [-f script-file] [file ...]
  32
  33    s2p  [-an] [-e script] [-f script-file]
  34
  35 =head1 DESCRIPTION
  36
  37 A stream editor reads the input stream consisting of the specified files
  38 (or standard input, if none are given), processes is line by line by
  39 applying a script consisting of edit commands, and writes resulting lines
  40 to standard output. The filename `C<->' may be used to read standard input.
  41
  42 The edit script is composed from arguments of B<-e> options and
  43 script-files, in the given order. A single script argument may be specified
  44 as the first parameter.
  45
  46 If this program is invoked with the name F<s2p>, it will act as a
  47 sed-to-Perl translator. See L<"sed Script Translation">.
  48
  49 B<sed> returns an exit code of 0 on success or >0 if an error occurred.
  50
  51 =head1 OPTIONS
  52
  53 =over 4
  54
  55 =item B<-a>
  56
  57 A file specified as argument to the B<w> edit command is by default
  58 opened before input processing starts. Using B<-a>, opening of such
  59 files is delayed until the first line is actually written to the file.
  60
  61 =item B<-e> I<script>
  62
  63 The editing commands defined by I<script> are appended to the script.
  64 Multiple commands must be separated by newlines.
  65
  66 =item B<-f> I<script-file>
  67
  68 Editing commands from the specified I<script-file> are read and appended
  69 to the script.
  70
  71 =item B<-n>
  72
  73 By default, a line is written to standard output after the editing script
  74 has been applied to it. The B<-n> option suppresses automatic printing.
  75
  76 =back
  77
  78 =head1 COMMANDS
  79
  80 B<sed> command syntax is defined as
  81
  82 Z<> Z<> Z<> Z<>[I<address>[B<,>I<address>]][B<!>]I<function>[I<argument>]
  83
  84 with whitespace being permitted before or after addresses, and between
  85 the function character and the argument. The I<address>es and the
  86 address inverter (C<!>) are used to restrict the application of a
  87 command to the selected line(s) of input.
  88
  89 Each command must be on a line of its own, except where noted in
  90 the synopses below.
  91
  92 The edit cycle performed on each input line consist of reading the line
  93 (without its trailing newline character) into the I<pattern space>,
  94 applying the applicable commands of the edit script, writing the final
  95 contents of the pattern space and a newline to the standard output.
  96 A I<hold space> is provided for saving the contents of the
  97 pattern space for later use.
  98
  99 =head2 Addresses
 100
 101 A sed address is either a line number or a pattern, which may be combined
 102 arbitrarily to construct ranges. Lines are numbered across all input files.
 103
 104 Any address may be followed by an exclamation mark (`C<!>'), selecting
 105 all lines not matching that address.
 106
 107 =over 4
 108
 109 =item I<number>
 110
 111 The line with the given number is selected.
 112
 113 =item B<$>
 114
 115 A dollar sign (C<$>) is the line number of the last line of the input stream.
 116
 117 =item B</>I<regular expression>B</>
 118
 119 A pattern address is a basic regular expression (see
 120 L<"Basic Regular Expressions">), between the delimiting character C</>.
 121 Any other character except C<\> or newline may be used to delimit a
 122 pattern address when the initial delimiter is prefixed with a
 123 backslash (`C<\>').
 124
 125 =back
 126
 127 If no address is given, the command selects every line.
 128
 129 If one address is given, it selects the line (or lines) matching the
 130 address.
 131
 132 Two addresses select a range that begins whenever the first address
 133 matches, and ends (including that line) when the second address matches.
 134 If the first (second) address is a matching pattern, the second
 135 address is not applied to the very same line to determine the end of
 136 the range. Likewise, if the second address is a matching pattern, the
 137 first address is not applied to the very same line to determine the
 138 begin of another range. If both addresses are line numbers,
 139 and the second line number is less than the first line number, then
 140 only the first line is selected.
 141
 142
 143 =head2 Functions
 144
 145 The maximum permitted number of addresses is indicated with each
 146 function synopsis below.
 147
 148 The argument I<text> consists of one or more lines following the command.
 149 Embedded newlines in I<text> must be preceded with a backslash.  Other
 150 backslashes in I<text> are deleted and the following character is taken
 151 literally.
 152
 153 =over 4
 154
 155 =cut
 156
 157 my %ComTab;
 158 my %GenKey;
 159 #--------------------------------------------------------------------------
 160 $ComTab{'a'}=[ 1, 'txt', \&Emit,       '{ push( @Q, <<'."'TheEnd' ) }\n" ]; #ok
 161
 162 =item [1addr]B<a\> I<text>
 163
 164 Write I<text> (which must start on the line following the command)
 165 to standard output immediately before reading the next line
 166 of input, either by executing the B<N> function or by beginning a new cycle.
 167
 168 =cut
 169
 170 #--------------------------------------------------------------------------
 171 $ComTab{'b'}=[ 2, 'str', \&Branch,     '{ goto XXX; }'                   ]; #ok
 172
 173 =item [2addr]B<b> [I<label>]
 174
 175 Branch to the B<:> function with the specified I<label>. If no label
 176 is given, branch to the end of the script.
 177
 178 =cut
 179
 180 #--------------------------------------------------------------------------
 181 $ComTab{'c'}=[ 2, 'txt', \&Change,     <<'-X-'                           ]; #ok
 182 { print <<'TheEnd'; } $doPrint = 0; goto EOS;
 183 -X-
 184 ### continue OK => next CYCLE;
 185
 186 =item [2addr]B<c\> I<text>
 187
 188 The line, or range of lines, selected by the address is deleted.
 189 The I<text> (which must start on the line following the command)
 190 is written to standard output. With an address range, this occurs at
 191 the end of the range.
 192
 193 =cut
 194
 195 #--------------------------------------------------------------------------
 196 $ComTab{'d'}=[ 2, '',    \&Emit,       <<'-X-'                           ]; #ok
 197 { $doPrint = 0;
 198   goto EOS;
 199 }
 200 -X-
 201 ### continue OK => next CYCLE;
 202
 203 =item [2addr]B<d>
 204
 205 Deletes the pattern space and starts the next cycle.
 206
 207 =cut
 208
 209 #--------------------------------------------------------------------------
 210 $ComTab{'D'}=[ 2, '',    \&Emit,       <<'-X-'                           ]; #ok
 211 { s/^.*\n?//;
 212   if(length($_)){ goto BOS } else { goto EOS }
 213 }
 214 -X-
 215 ### continue OK => next CYCLE;
 216
 217 =item [2addr]B<D>
 218
 219 Deletes the pattern space through the first embedded newline or to the end.
 220 If the pattern space becomes empty, a new cycle is started, otherwise
 221 execution of the script is restarted.
 222
 223 =cut
 224
 225 #--------------------------------------------------------------------------
 226 $ComTab{'g'}=[ 2, '',    \&Emit,       '{ $_ = $Hold };'                 ]; #ok
 227
 228 =item [2addr]B<g>
 229
 230 Replace the contents of the pattern space with the hold space.
 231
 232 =cut
 233
 234 #--------------------------------------------------------------------------
 235 $ComTab{'G'}=[ 2, '',    \&Emit,       '{ $_ .= "\n"; $_ .= $Hold };'    ]; #ok
 236
 237 =item [2addr]B<G>
 238
 239 Append a newline and the contents of the hold space to the pattern space.
 240
 241 =cut
 242
 243 #--------------------------------------------------------------------------
 244 $ComTab{'h'}=[ 2, '',    \&Emit,       '{ $Hold = $_ }'                  ]; #ok
 245
 246 =item [2addr]B<h>
 247
 248 Replace the contents of the hold space with the pattern space.
 249
 250 =cut
 251
 252 #--------------------------------------------------------------------------
 253 $ComTab{'H'}=[ 2, '',    \&Emit,       '{ $Hold .= "\n"; $Hold .= $_; }' ]; #ok
 254
 255 =item [2addr]B<H>
 256
 257 Append a newline and the contents of the pattern space to the hold space.
 258
 259 =cut
 260
 261 #--------------------------------------------------------------------------
 262 $ComTab{'i'}=[ 1, 'txt', \&Emit,       '{ print <<'."'TheEnd' }\n"       ]; #ok
 263
 264 =item [1addr]B<i\> I<text>
 265
 266 Write the I<text> (which must start on the line following the command)
 267 to standard output.
 268
 269 =cut
 270
 271 #--------------------------------------------------------------------------
 272 $ComTab{'l'}=[ 2, '',    \&Emit,       '{ _l() }'                        ]; #okUTF8
 273
 274 =item [2addr]B<l>
 275
 276 Print the contents of the pattern space: non-printable characters are
 277 shown in C-style escaped form; long lines are split and have a trailing
 278 `C<\>' at the point of the split; the true end of a line is marked with
 279 a `C<$>'. Escapes are: `\a', `\t', `\n', `\f', `\r', `\e' for
 280 BEL, HT, LF, FF, CR, ESC, respectively, and `\' followed by a three-digit
 281 octal number for all other non-printable characters.
 282
 283 =cut
 284
 285 #--------------------------------------------------------------------------
 286 $ComTab{'n'}=[ 2, '',    \&Emit,       <<'-X-'                           ]; #ok
 287 { print $_, "\n" if $doPrint;
 288   printQ() if @Q;
 289   $CondReg = 0;
 290   last CYCLE unless getsARGV();
 291   chomp();
 292 }
 293 -X-
 294
 295 =item [2addr]B<n>
 296
 297 If automatic printing is enabled, write the pattern space to the standard
 298 output. Replace the pattern space with the next line of input. If
 299 there is no more input, processing is terminated.
 300
 301 =cut
 302
 303 #--------------------------------------------------------------------------
 304 $ComTab{'N'}=[ 2, '',    \&Emit,       <<'-X-'                           ]; #ok
 305 { printQ() if @Q;
 306   $CondReg = 0;
 307   last CYCLE unless getsARGV( $h );
 308   chomp( $h );
 309   $_ .= "\n$h";
 310 }
 311 -X-
 312
 313 =item [2addr]B<N>
 314
 315 Append a newline and the next line of input to the pattern space. If
 316 there is no more input, processing is terminated.
 317
 318 =cut
 319
 320 #--------------------------------------------------------------------------
 321 $ComTab{'p'}=[ 2, '',    \&Emit,       '{ print $_, "\n"; }'             ]; #ok
 322
 323 =item [2addr]B<p>
 324
 325 Print the pattern space to the standard output. (Use the B<-n> option
 326 to suppress automatic printing at the end of a cycle if you want to
 327 avoid double printing of lines.)
 328
 329 =cut
 330
 331 #--------------------------------------------------------------------------
 332 $ComTab{'P'}=[ 2, '',    \&Emit,       <<'-X-'                           ]; #ok
 333 { if( /^(.*)/ ){ print $1, "\n"; } }
 334 -X-
 335
 336 =item [2addr]B<P>
 337
 338 Prints the pattern space through the first embedded newline or to the end.
 339
 340 =cut
 341
 342 #--------------------------------------------------------------------------
 343 $ComTab{'q'}=[ 1, 'str',    \&Emit,       <<'-X-'                           ]; #ok
 344 { print $_, "\n" if $doPrint;
 345   $exitstatus = '-X-';
 346   last CYCLE;
 347 }
 348 -X-
 349
 350 =item [1addr]B<q>
 351
 352 Branch to the end of the script and quit without starting a new cycle.
 353
 354 =cut
 355
 356 #--------------------------------------------------------------------------
 357 $ComTab{'r'}=[ 1, 'str', \&Emit,       "{ _r( '-X-' ) }"                 ]; #ok
 358
 359 =item [1addr]B<r> I<file>
 360
 361 Copy the contents of the I<file> to standard output immediately before
 362 the next attempt to read a line of input. Any error encountered while
 363 reading I<file> is silently ignored.
 364
 365 =cut
 366
 367 #--------------------------------------------------------------------------
 368 $ComTab{'s'}=[ 2, 'sub', \&Emit,       ''                                ]; #ok
 369
 370 =item [2addr]B<s/>I<regular expression>B</>I<replacement>B</>I<flags>
 371
 372 Substitute the I<replacement> string for the first substring in
 373 the pattern space that matches the I<regular expression>.
 374 Any character other than backslash or newline can be used instead of a
 375 slash to delimit the regular expression and the replacement.
 376 To use the delimiter as a literal character within the regular expression
 377 and the replacement, precede the character by a backslash (`C<\>').
 378
 379 Literal newlines may be embedded in the replacement string by
 380 preceding a newline with a backslash.
 381
 382 Within the replacement, an ampersand (`C<&>') is replaced by the string
 383 matching the regular expression. The strings `C<\1>' through `C<\9>' are
 384 replaced by the corresponding subpattern (see L<"Basic Regular Expressions">).
 385 To get a literal `C<&>' or `C<\>' in the replacement text, precede it
 386 by a backslash.
 387
 388 The following I<flags> modify the behaviour of the B<s> command:
 389
 390 =over 8
 391
 392 =item B<g>
 393
 394 The replacement is performed for all matching, non-overlapping substrings
 395 of the pattern space.
 396
 397 =item B<1>..B<9>
 398
 399 Replace only the n-th matching substring of the pattern space.
 400
 401 =item B<p>
 402
 403 If the substitution was made, print the new value of the pattern space.
 404
 405 =item B<w> I<file>
 406
 407 If the substitution was made, write the new value of the pattern space
 408 to the specified file.
 409
 410 =back
 411
 412 =cut
 413
 414 #--------------------------------------------------------------------------
 415 $ComTab{'t'}=[ 2, 'str', \&Branch,     '{ goto XXX if _t() }'            ]; #ok
 416
 417 =item [2addr]B<t> [I<label>]
 418
 419 Branch to the B<:> function with the specified I<label> if any B<s>
 420 substitutions have been made since the most recent reading of an input line
 421 or execution of a B<t> function. If no label is given, branch to the end of
 422 the script.
 423
 424
 425 =cut
 426
 427 #--------------------------------------------------------------------------
 428 $ComTab{'w'}=[ 2, 'str', \&Write,      "{ _w( '-X-' ) }"                 ]; #ok
 429
 430 =item [2addr]B<w> I<file>
 431
 432 The contents of the pattern space are written to the I<file>.
 433
 434 =cut
 435
 436 #--------------------------------------------------------------------------
 437 $ComTab{'x'}=[ 2, '',    \&Emit,       '{ ($Hold, $_) = ($_, $Hold) }'   ]; #ok
 438
 439 =item [2addr]B<x>
 440
 441 Swap the contents of the pattern space and the hold space.
 442
 443 =cut
 444
 445 #--------------------------------------------------------------------------
 446 $ComTab{'y'}=[ 2, 'tra', \&Emit,       ''                                ]; #ok
 447 =item [2addr]B<y>B</>I<string1>B</>I<string2>B</>
 448
 449 In the pattern space, replace all characters occuring in I<string1> by the
 450 character at the corresponding position in I<string2>. It is possible
 451 to use any character (other than a backslash or newline) instead of a
 452 slash to delimit the strings.  Within I<string1> and I<string2>, a
 453 backslash followed by any character other than a newline is that literal
 454 character, and a backslash followed by an `n' is replaced by a newline
 455 character.
 456
 457 =cut
 458
 459 #--------------------------------------------------------------------------
 460 $ComTab{'='}=[ 1, '',    \&Emit,       '{ print "$.\n" }'                ]; #ok
 461
 462 =item [1addr]B<=>
 463
 464 Prints the current line number on the standard output.
 465
 466 =cut
 467
 468 #--------------------------------------------------------------------------
 469 $ComTab{':'}=[ 0, 'str', \&Label,      ''                                ]; #ok
 470
 471 =item [0addr]B<:> [I<label>]
 472
 473 The command specifies the position of the I<label>. It has no other effect.
 474
 475 =cut
 476
 477 #--------------------------------------------------------------------------
 478 $ComTab{'{'}=[ 2, '',    \&BeginBlock, '{'                               ]; #ok
 479 $ComTab{'}'}=[ 0, '',    \&EndBlock,   ';}'                              ]; #ok
 480 # ';' to avoid warning on empty {}-block
 481
 482 =item [2addr]B<{> [I<command>]
 483
 484 =item [0addr]B<}>
 485
 486 These two commands begin and end a command list. The first command may
 487 be given on the same line as the opening B<{> command. The commands
 488 within the list are jointly selected by the address(es) given on the
 489 B<{> command (but may still have individual addresses).
 490
 491 =cut
 492
 493 #--------------------------------------------------------------------------
 494 $ComTab{'#'}=[ 0, 'str', \&Comment,    ''                                ]; #ok
 495
 496 =item [0addr]B<#> [I<comment>]
 497
 498 The entire line is ignored (treated as a comment). If, however, the first
 499 two characters in the script are `C<#n>', automatic printing of output is
 500 suppressed, as if the B<-n> option were given on the command line.
 501
 502 =back
 503
 504 =cut
 505
 506 use vars qw{ $isEOF $Hold %wFiles @Q $CondReg $doPrint };
 507
 508 my $useDEBUG    = exists( $ENV{PSEDDEBUG} );
 509 my $useEXTBRE   = $ENV{PSEDEXTBRE} || '';
 510 $useEXTBRE =~ s/[^<>wWyB]//g; # gawk RE's handle these
 511
 512 my $doAutoPrint = 1;          # automatic printing of pattern space (-n => 0)
 513 my $doOpenWrite = 1;          # open w command output files at start (-a => 0)
 514 my $svOpenWrite = 0;          # save $doOpenWrite
 515
 516 # lower case $0 below as a VMSism.  The VMS build procedure creates the
 517 # s2p file traditionally in upper case on the disk.  When VMS is in a
 518 # case preserved or case sensitive mode, $0 will be returned in the exact
 519 # case which will be on the disk, and that is not predictable at this time.
 520
 521 my $doGenerate  = lc($0) eq 's2p';
 522
 523 # Collected and compiled script
 524 #
 525 my( @Commands, %Defined, @BlockStack, %Label, $labNum, $Code, $Func );
 526 $Code = '';
 527
 528 ##################
 529 #  Compile Time
 530 #
 531 # Labels
 532 #
 533 # Error handling
 534 #
 535 sub Warn($;$){
 536     my( $msg, $loc ) = @_;
 537     $loc ||= '';
 538     $loc .= ': ' if length( $loc );
 539     warn( "$0: $loc$msg\n" );
 540 }
 541
 542 $labNum = 0;
 543 sub newLabel(){
 544     return 'L_'.++$labNum;
 545 }
 546
 547 # safeHere: create safe here delimiter and  modify opcode and argument
 548 #
 549 sub safeHere($$){
 550     my( $codref, $argref ) = @_;
 551     my $eod = 'EOD000';
 552     while( $$argref =~ /^$eod$/m ){
 553         $eod++;
 554     }
 555     $$codref =~ s/TheEnd/$eod/e;
 556     $$argref .= "$eod\n";
 557 }
 558
 559 # Emit: create address logic and emit command
 560 #
 561 sub Emit($$$$$$){
 562     my( $addr1, $addr2, $negated, $opcode, $arg, $fl ) = @_;
 563     my $cond = '';
 564     if( defined( $addr1 ) ){
 565         if( defined( $addr2 ) ){
 566             $addr1 .= $addr2 =~ /^\d+$/ ? "..$addr2" : "...$addr2";
 567         } else {
 568             $addr1 .= ' == $.' if $addr1 =~ /^\d+$/;
 569         }
 570         $cond = $negated ? "unless( $addr1 )\n" : "if( $addr1 )\n";
 571     }
 572
 573     if( $opcode eq '' ){
 574         $Code .= "$cond$arg\n";
 575
 576     } elsif( $opcode =~ s/-X-/$arg/e ){
 577         $Code .= "$cond$opcode\n";
 578
 579     } elsif( $opcode =~ /TheEnd/ ){
 580         safeHere( \$opcode, \$arg );
 581         $Code .= "$cond$opcode$arg";
 582
 583     } else {
 584         $Code .= "$cond$opcode\n";
 585     }
 586     0;
 587 }
 588
 589 # Write (w command, w flag): store pathname
 590 #
 591 sub Write($$$$$$){
 592     my( $addr1, $addr2, $negated, $opcode, $path, $fl ) = @_;
 593     $wFiles{$path} = '';
 594     Emit( $addr1, $addr2, $negated, $opcode, $path, $fl );
 595 }
 596
 597
 598 # Label (: command): label definition
 599 #
 600 sub Label($$$$$$){
 601     my( $addr1, $addr2, $negated, $opcode, $lab, $fl ) = @_;
 602     my $rc = 0;
 603     $lab =~ s/\s+//;
 604     if( length( $lab ) ){
 605         my $h;
 606         if( ! exists( $Label{$lab} ) ){
 607             $h = $Label{$lab}{name} = newLabel();
 608         } else {
 609             $h = $Label{$lab}{name};
 610             if( exists( $Label{$lab}{defined} ) ){
 611                 my $dl = $Label{$lab}{defined};
 612                 Warn( "duplicate label $lab (first defined at $dl)", $fl );
 613                 $rc = 1;
 614             }
 615         }
 616         $Label{$lab}{defined} = $fl;
 617         $Code .= "$h:;\n";
 618     }
 619     $rc;
 620 }
 621
 622 # BeginBlock ({ command): push block start
 623 #
 624 sub BeginBlock($$$$$$){
 625     my( $addr1, $addr2, $negated, $opcode, $arg, $fl ) = @_;
 626     push( @BlockStack, [ $fl, $addr1, $addr2, $negated ] );
 627     Emit( $addr1, $addr2, $negated, $opcode, $arg, $fl );
 628 }
 629
 630 # EndBlock (} command): check proper nesting
 631 #
 632 sub EndBlock($$$$$$){
 633     my( $addr1, $addr2, $negated, $opcode, $arg, $fl ) = @_;
 634     my $rc;
 635     my $jcom = pop( @BlockStack );
 636     if( defined( $jcom ) ){
 637         $rc = Emit( $addr1, $addr2, $negated, $opcode, $arg, $fl );
 638     } else {
 639         Warn( "unexpected `}'", $fl );
 640         $rc = 1;
 641     }
 642     $rc;
 643 }
 644
 645 # Branch (t, b commands): check or create label, substitute default
 646 #
 647 sub Branch($$$$$$){
 648     my( $addr1, $addr2, $negated, $opcode, $lab, $fl ) = @_;
 649     $lab =~ s/\s+//; # no spaces at end
 650     my $h;
 651     if( length( $lab ) ){
 652         if( ! exists( $Label{$lab} ) ){
 653             $h = $Label{$lab}{name} = newLabel();
 654         } else {
 655             $h = $Label{$lab}{name};
 656         }
 657         push( @{$Label{$lab}{used}}, $fl );
 658     } else {
 659         $h = 'EOS';
 660     }
 661     $opcode =~ s/XXX/$h/e;
 662     Emit( $addr1, $addr2, $negated, $opcode, '', $fl );
 663 }
 664
 665 # Change (c command): is special due to range end watching
 666 #
 667 sub Change($$$$$$){
 668     my( $addr1, $addr2, $negated, $opcode, $arg, $fl ) = @_;
 669     my $kwd = $negated ? 'unless' : 'if';
 670     if( defined( $addr2 ) ){
 671         $addr1 .= $addr2 =~ /^\d+$/ ? "..$addr2" : "...$addr2";
 672         if( ! $negated ){
 673             $addr1  = '$icnt = ('.$addr1.')';
 674             $opcode = 'if( $icnt =~ /E0$/ )' . $opcode;
 675         }
 676     } else {
 677         $addr1 .= ' == $.' if $addr1 =~ /^\d+$/;
 678     }
 679     safeHere( \$opcode, \$arg );
 680     $Code .= "$kwd( $addr1 ){\n  $opcode$arg}\n";
 681     0;
 682 }
 683
 684
 685 # Comment (# command): A no-op. Who would've thought that!
 686 #
 687 sub Comment($$$$$$){
 688     my( $addr1, $addr2, $negated, $opcode, $arg, $fl ) = @_;
 689 ### $Code .= "# $arg\n";
 690     0;
 691 }
 692
 693 # stripRegex from the current command. If we're in the first
 694 # part of s///, trailing spaces have to be kept as the initial
 695 # part of the replacement string.
 696 #
 697 sub stripRegex($$;$){
 698     my( $del, $sref, $sub ) = @_;
 699     my $regex = $del;
 700     print "stripRegex:$del:$$sref:\n" if $useDEBUG;
 701     while( $$sref =~ s{^(.*?)(\\*)\Q$del\E(\s*)}{}s ){
 702         my ($lead, $sl, $rest) = ($1, $2, $3);
 703         $regex .= $lead.$sl.$del;
 704         if( !($lead =~ /(?:^|[^\\])\[^?\]?[^]]*$/) && length( $sl ) % 2 == 0 ){
 705             if( $sub && (length( $rest ) > 0) ){
 706                 $$sref = $rest . $$sref;
 707             }
 708             return $regex;
 709         }
 710         $regex .= $rest;
 711     }
 712     undef();
 713 }
 714
 715 # stripTrans: take a <del> terminated string from y command
 716 #   honoring and cleaning up of \-escaped <del>'s
 717 #
 718 sub stripTrans($$){
 719     my( $del, $sref ) = @_;
 720     my $t = '';
 721     print "stripTrans:$del:$$sref:\n" if $useDEBUG;
 722     while( $$sref =~ s{^(.*?)(\\*)\Q$del\E}{}s ){
 723         my $sl = $2;
 724         $t .= $1;
 725         if( length( $sl ) % 2 == 0 ){
 726             $t .= $sl;
 727             $t =~ s/\\\\/\\/g;
 728             return $t;
 729         }
 730         chop( $sl );
 731         $t .= $sl.$del.$3;
 732     }
 733     undef();
 734 }
 735
 736 # makey - construct Perl y/// from sed y///
 737 #
 738 sub makey($$$){
 739     my( $fr, $to, $fl ) = @_;
 740     my $error = 0;
 741
 742     # Ensure that any '-' is up front.
 743     # Diagnose duplicate contradicting mappings
 744     my %tr;
 745     for( my $i = 0; $i < length($fr); $i++ ){
 746         my $fc = substr($fr,$i,1);
 747         my $tc = substr($to,$i,1);
 748         if( exists( $tr{$fc} ) && $tr{$fc} ne $tc ){
 749             Warn( "ambiguous translation for character `$fc' in `y' command",
 750                   $fl );
 751             $error++;
 752         }
 753         $tr{$fc} = $tc;
 754     }
 755     $fr = $to = '';
 756     if( exists( $tr{'-'} ) ){
 757         ( $fr, $to ) = ( '-', $tr{'-'} );
 758         delete( $tr{'-'} );
 759     } else {
 760         $fr = $to = '';
 761     }
 762     # might just as well sort it...
 763     for my $fc ( sort keys( %tr ) ){
 764         $fr .= $fc;
 765         $to .= $tr{$fc};
 766     }
 767     # make embedded delimiters and newlines safe
 768     $fr =~ s/([{}])/\$1/g;
 769     $to =~ s/([{}])/\$1/g;
 770     $fr =~ s/\n/\\n/g;
 771     $to =~ s/\n/\\n/g;
 772     return $error ? undef() : "{ y{$fr}{$to}; }";
 773 }
 774
 775 ######
 776 # makes - construct Perl s/// from sed s///
 777 #
 778 sub makes($$$$$$$){
 779     my( $regex, $subst, $path, $global, $print, $nmatch, $fl ) = @_;
 780
 781     # make embedded newlines safe
 782     $regex =~ s/\n/\\n/g;
 783     $subst =~ s/\n/\\n/g;
 784
 785     my $code;
 786     # n-th occurrence
 787     #
 788     if( length( $nmatch ) ){
 789         $code = <<TheEnd;
 790 { \$n = $nmatch;
 791   while( --\$n && ( \$s = m ${regex}g ) ){}
 792   \$s = ( substr( \$_, pos() ) =~ s ${regex}${subst}s ) if \$s;
 793   \$CondReg ||= \$s;
 794 TheEnd
 795     } else {
 796         $code = <<TheEnd;
 797 { \$s = s ${regex}${subst}s${global};
 798   \$CondReg ||= \$s;
 799 TheEnd
 800     }
 801     if( $print ){
 802         $code .= '  print $_, "\n" if $s;'."\n";
 803     }
 804     if( defined( $path ) ){
 805         $wFiles{$path} = '';
 806         $code .= " _w( '$path' ) if \$s;\n";
 807         $GenKey{'w'} = 1;
 808     }
 809     $code .= "}";
 810 }
 811
 812 =head1 BASIC REGULAR EXPRESSIONS
 813
 814 A I<Basic Regular Expression> (BRE), as defined in POSIX 1003.2, consists
 815 of I<atoms>, for matching parts of a string, and I<bounds>, specifying
 816 repetitions of a preceding atom.
 817
 818 =head2 Atoms
 819
 820 The possible atoms of a BRE are: B<.>, matching any single character;
 821 B<^> and B<$>, matching the null string at the beginning or end
 822 of a string, respectively; a I<bracket expressions>, enclosed
 823 in B<[> and B<]> (see below); and any single character with no
 824 other significance (matching that character). A B<\> before one
 825 of: B<.>, B<^>, B<$>, B<[>, B<*>, B<\>, matching the character
 826 after the backslash. A sequence of atoms enclosed in B<\(> and B<\)>
 827 becomes an atom and establishes the target for a I<backreference>,
 828 consisting of the substring that actually matches the enclosed atoms.
 829 Finally, B<\> followed by one of the digits B<0> through B<9> is a
 830 backreference.
 831
 832 A B<^> that is not first, or a B<$> that is not last does not have
 833 a special significance and need not be preceded by a backslash to
 834 become literal. The same is true for a B<]>, that does not terminate
 835 a bracket expression.
 836
 837 An unescaped backslash cannot be last in a BRE.
 838
 839 =head2 Bounds
 840
 841 The BRE bounds are: B<*>, specifying 0 or more matches of the preceding
 842 atom; B<\{>I<count>B<\}>, specifying that many repetitions;
 843 B<\{>I<minimum>B<,\}>, giving a lower limit; and
 844 B<\{>I<minimum>B<,>I<maximum>B<\}> finally defines a lower and upper
 845 bound.
 846
 847 A bound appearing as the first item in a BRE is taken literally.
 848
 849 =head2 Bracket Expressions
 850
 851 A I<bracket expression> is a list of characters, character ranges
 852 and character classes enclosed in B<[> and B<]> and matches any
 853 single character from the represented set of characters.
 854
 855 A character range is written as two characters separated by B<-> and
 856 represents all characters (according to the character collating sequence)
 857 that are not less than the first and not greater than the second.
 858 (Ranges are very collating-sequence-dependent, and portable programs
 859 should avoid relying on them.)
 860
 861 A character class is one of the class names
 862
 863    alnum     digit     punct
 864    alpha     graph     space
 865    blank     lower     upper
 866    cntrl     print     xdigit
 867
 868 enclosed in B<[:> and B<:]> and represents the set of characters
 869 as defined in ctype(3).
 870
 871 If the first character after B<[> is B<^>, the sense of matching is
 872 inverted.
 873
 874 To include a literal `C<^>', place it anywhere else but first. To
 875 include a literal 'C<]>' place it first or immediately after an
 876 initial B<^>. To include a literal `C<->' make it the first (or
 877 second after B<^>) or last character, or the second endpoint of
 878 a range.
 879
 880 The special bracket expression constructs C<[[:E<lt>:]]> and C<[[:E<gt>:]]>
 881 match the null string at the beginning and end of a word respectively.
 882 (Note that neither is identical to Perl's `\b' atom.)
 883
 884 =head2 Additional Atoms
 885
 886 Since some sed implementations provide additional regular expression
 887 atoms (not defined in POSIX 1003.2), B<psed> is capable of translating
 888 the following backslash escapes:
 889
 890 =over 4
 891
 892 =item B<\E<lt>> This is the same as C<[[:E<gt>:]]>.
 893
 894 =item B<\E<gt>> This is the same as C<[[:E<lt>:]]>.
 895
 896 =item B<\w> This is an abbreviation for C<[[:alnum:]_]>.
 897
 898 =item B<\W> This is an abbreviation for C<[^[:alnum:]_]>.
 899
 900 =item B<\y> Match the empty string at a word boundary.
 901
 902 =item B<\B> Match the empty string between any two either word or non-word characters.
 903
 904 =back
 905
 906 To enable this feature, the environment variable PSEDEXTBRE must be set
 907 to a string containing the requested characters, e.g.:
 908 C<PSEDEXTBRE='E<lt>E<gt>wW'>.
 909
 910 =cut
 911
 912 #####
 913 # bre2p - convert BRE to Perl RE
 914 #
 915 sub peek(\$$){
 916     my( $pref, $ic ) = @_;
 917     $ic < length($$pref)-1 ? substr( $$pref, $ic+1, 1 ) : '';
 918 }
 919
 920 sub bre2p($$$){
 921     my( $del, $pat, $fl ) = @_;
 922     my $led = $del;
 923     $led =~ tr/{([</})]>/;
 924     $led = '' if $led eq $del;
 925
 926     $pat = substr( $pat, 1, length($pat) - 2 );
 927     my $res = '';
 928     my $bracklev = 0;
 929     my $backref  = 0;
 930     my $parlev = 0;
 931     for( my $ic = 0; $ic < length( $pat ); $ic++ ){
 932         my $c = substr( $pat, $ic, 1 );
 933         if( $c eq '\\' ){
 934             ### backslash escapes
 935             my $nc = peek($pat,$ic);
 936             if( $nc eq '' ){
 937                 Warn( "`\\' cannot be last in pattern", $fl );
 938                 return undef();
 939             }
 940             $ic++;
 941             if( $nc eq $del ){ ## \<pattern del> => \<pattern del>
 942                 $res .= "\\$del";
 943
 944             } elsif( $nc =~ /([[.*\\n])/ ){
 945                 ## check for \-escaped magics and \n:
 946                 ## \[ \. \* \\ \n stay as they are
 947                 $res .= '\\'.$nc;
 948
 949             } elsif( $nc eq '(' ){ ## \( => (
 950                 $parlev++;
 951                 $res .= '(';
 952
 953             } elsif( $nc eq ')' ){ ## \) => )
 954                 $parlev--;
 955                 $backref++;
 956                 if( $parlev < 0 ){
 957                     Warn( "unmatched `\\)'", $fl );
 958                     return undef();
 959                 }
 960                 $res .= ')';
 961
 962             } elsif( $nc eq '{' ){ ## repetition factor \{<i>[,[<j>]]\}
 963                 my $endpos = index( $pat, '\\}', $ic );
 964                 if( $endpos < 0 ){
 965                     Warn( "unmatched `\\{'", $fl );
 966                     return undef();
 967                 }
 968                 my $rep = substr( $pat, $ic+1, $endpos-($ic+1) );
 969                 $ic = $endpos + 1;
 970
 971                 if( $res =~ /^\^?$/ ){
 972                     $res .= "\\{$rep\}";
 973                 } elsif( $rep =~ /^(\d+)(,?)(\d*)?$/ ){
 974                     my $min = $1;
 975                     my $com = $2 || '';
 976                     my $max = $3;
 977                     if( length( $max ) ){
 978                         if( $max < $min ){
 979                             Warn( "maximum less than minimum in `\\{$rep\\}'",
 980                                   $fl );
 981                             return undef();
 982                         }
 983                     } else {
 984                         $max = '';
 985                     }
 986                     # simplify some
 987                     if( $min == 0 && $max eq '1' ){
 988                         $res .= '?';
 989                     } elsif( $min == 1 && "$com$max" eq ',' ){
 990                         $res .= '+';
 991                     } elsif( $min == 0 && "$com$max" eq ',' ){
 992                         $res .= '*';
 993                     } else {
 994                         $res .= "{$min$com$max}";
 995                     }
 996                 } else {
 997                     Warn( "invalid repeat clause `\\{$rep\\}'", $fl );
 998                     return undef();
 999                 }
1000
1001             } elsif( $nc =~ /^[1-9]$/ ){
1002                 ## \1 .. \9 => \1 .. \9, but check for a following digit
1003                 if( $nc > $backref ){
1004                     Warn( "invalid backreference ($nc)", $fl );
1005                     return undef();
1006                 }
1007                 $res .= "\\$nc";
1008                 if( peek($pat,$ic) =~ /[0-9]/ ){
1009                     $res .= '(?:)';
1010                 }
1011
1012             } elsif( $useEXTBRE && ( $nc =~ /[$useEXTBRE]/ ) ){
1013                 ## extensions - at most <>wWyB - not in POSIX
1014                 if(      $nc eq '<' ){ ## \< => \b(?=\w), be precise
1015                     $res .= '\\b(?<=\\W)';
1016                 } elsif( $nc eq '>' ){ ## \> => \b(?=\W), be precise
1017                     $res .= '\\b(?=\\W)';
1018                 } elsif( $nc eq 'y' ){ ## \y => \b
1019                     $res .= '\\b';
1020                 } else {               ## \B, \w, \W remain the same
1021                     $res .= "\\$nc";
1022                 }
1023             } elsif( $nc eq $led ){
1024                 ## \<closing bracketing-delimiter> - keep '\'
1025                 $res .= "\\$nc";
1026
1027             } else { ## \ <char> => <char> ("as if `\' were not present")
1028                 $res .= $nc;
1029             }
1030
1031         } elsif( $c eq '.' ){ ## . => .
1032             $res .= $c;
1033
1034         } elsif( $c eq '*' ){ ## * => * but \* if there's nothing preceding it
1035             if( $res =~ /^\^?$/ ){
1036                 $res .= '\\*';
1037             } elsif( substr( $res, -1, 1 ) ne '*' ){
1038                 $res .= $c;
1039             }
1040
1041         } elsif( $c eq '[' ){
1042             ## parse []: [^...] [^]...] [-...]
1043             my $add = '[';
1044             if( peek($pat,$ic) eq '^' ){
1045                 $ic++;
1046                 $add .= '^';
1047             }
1048             my $nc = peek($pat,$ic);
1049             if( $nc eq ']' || $nc eq '-' ){
1050                 $add .= $nc;
1051                 $ic++;
1052             }
1053             # check that [ is not trailing
1054             if( $ic >= length( $pat ) - 1 ){
1055                 Warn( "unmatched `['", $fl );
1056                 return undef();
1057             }
1058             # look for [:...:] and x-y
1059             my $rstr = substr( $pat, $ic+1 );
1060             if( $rstr =~ /^((?:\[:\(\w+|[><]\):\]|[^]-](?:-[^]])?)*)/ ){
1061                 my $cnt = $1;
1062                 $ic += length( $cnt );
1063                 $cnt =~ s/([\/\$])/\\$1/g; # `/', `$' are magic in Perl []
1064                 # try some simplifications
1065                 my $red = $cnt;
1066                 if( $red =~ s/0-9// ){
1067                     $cnt = $red.'\d';
1068                     if( $red =~ s/A-Z// && $red =~ s/a-z// && $red =~ s/_// ){
1069                         $cnt = $red.'\w';
1070                     }
1071                 }
1072                 $add .= $cnt;
1073
1074                 # POSIX 1003.2 has this (optional) for begin/end word
1075                 $add = '\\b(?=\\W)'  if $add eq '[[:<:]]';
1076                 $add = '\\b(?<=\\W)' if $add eq '[[:>:]]';
1077
1078             }
1079
1080             ## may have a trailing `-' before `]'
1081             if( $ic < length($pat) - 1 &&
1082                 substr( $pat, $ic+1 ) =~ /^(-?])/ ){
1083                 $ic += length( $1 );
1084                 $add .= $1;
1085                 # another simplification
1086                 $add =~ s/^\[(\^?)(\\[dw])]$/ $1 eq '^' ? uc($2) : $2 /e;
1087                 $res .= $add;
1088             } else {
1089                 Warn( "unmatched `['", $fl );
1090                 return undef();
1091             }
1092
1093         } elsif( $c eq $led ){ ## unescaped <closing bracketing-delimiter>
1094             $res .= "\\$c";
1095
1096         } elsif( $c eq ']' ){ ## unmatched ] is not magic
1097             $res .= ']';
1098
1099         } elsif( $c =~ /[|+?{}()@#\/]/ ){ ## not magic in BRE, but in Perl: \-quote
1100             $res .= "\\$c";
1101
1102         } elsif( $c eq '^' ){ ## not magic unless 1st, but in Perl: \-quote
1103             $res .= length( $res ) ? '\\^' : '^';
1104
1105         } elsif( $c eq '$' ){ ## not magic unless last, but in Perl: \-quote
1106             $res .= $ic == length( $pat ) - 1 ? '$' : '\\$';
1107
1108         } else {
1109             $res .= $c;
1110         }
1111     }
1112
1113     if( $parlev ){
1114        Warn( "unmatched `\\('", $fl );
1115        return undef();
1116     }
1117
1118     # final cleanup: eliminate raw HTs
1119     $res =~ s/\t/\\t/g;
1120     return $del . $res . ( $led ? $led : $del );
1121 }
1122
1123
1124 #####
1125 # sub2p - convert sed substitution to Perl substitution
1126 #
1127 sub sub2p($$$){
1128     my( $del, $subst, $fl ) = @_;
1129     my $led = $del;
1130     $led =~ tr/{([</})]>/;
1131     $led = '' if $led eq $del;
1132
1133     $subst = substr( $subst, 1, length($subst) - 2 );
1134     my $res = '';
1135
1136     for( my $ic = 0; $ic < length( $subst ); $ic++ ){
1137         my $c = substr( $subst, $ic, 1 );
1138         if( $c eq '\\' ){
1139             ### backslash escapes
1140             my $nc = peek($subst,$ic);
1141             if( $nc eq '' ){
1142                 Warn( "`\\' cannot be last in substitution", $fl );
1143                 return undef();
1144             }
1145             $ic++;
1146             if( $nc =~ /[\\$del$led]/ ){ ## \ and delimiter
1147                 $res .= '\\' . $nc;
1148             } elsif( $nc =~ /[1-9]/ ){ ## \1 - \9 => ${1} - ${9}
1149                 $res .= '${' . $nc . '}';
1150             } else { ## everything else (includes &): omit \
1151                 $res .= $nc;
1152             }
1153         } elsif( $c eq '&' ){ ## & => $&
1154             $res .= '$&';
1155         } elsif( $c =~ /[\$\@$led]/ ){ ## magic in Perl's substitution string
1156             $res .= '\\' . $c;
1157         } else {
1158             $res .= $c;
1159         }
1160     }
1161
1162     # final cleanup: eliminate raw HTs
1163     $res =~ s/\t/\\t/g;
1164     return ( $led ? $del : $led ) . $res . ( $led ? $led : $del );
1165 }
1166
1167
1168 sub Parse(){
1169     my $error = 0;
1170     my( $pdef, $pfil, $plin );
1171     for( my $icom = 0; $icom < @Commands; $icom++ ){
1172         my $cmd = $Commands[$icom];
1173         print "Parse:$cmd:\n" if $useDEBUG;
1174         $cmd =~ s/^\s+//;
1175         next unless length( $cmd );
1176         my $scom = $icom;
1177         if( exists( $Defined{$icom} ) ){
1178             $pdef = $Defined{$icom};
1179             if( $pdef =~ /^ #(\d+)/ ){
1180                 $pfil = 'expression #';
1181                 $plin = $1;
1182             } else {
1183                 $pfil = "$pdef l.";
1184                 $plin = 1;
1185             }
1186         } else {
1187             $plin++;
1188         }
1189         my $fl = "$pfil$plin";
1190
1191         # insert command as comment in gnerated code
1192         #
1193         $Code .= "# $cmd\n" if $doGenerate;
1194
1195         # The Address(es)
1196         #
1197         my( $negated, $naddr, $addr1, $addr2 );
1198         $naddr = 0;
1199         if(      $cmd =~ s/^(\d+)\s*// ){
1200             $addr1 = "$1"; $naddr++;
1201         } elsif( $cmd =~ s/^\$\s*// ){
1202             $addr1 = 'eofARGV()'; $naddr++;
1203         } elsif( $cmd =~ s{^(/)}{} || $cmd =~ s{^\\(.)}{} ){
1204             my $del = $1;
1205             my $regex = stripRegex( $del, \$cmd );
1206             if( defined( $regex ) ){
1207                 $addr1 = 'm '.bre2p( $del, $regex, $fl ).'s';
1208                 $naddr++;
1209             } else {
1210                 Warn( "malformed regex, 1st address", $fl );
1211                 $error++;
1212                 next;
1213             }
1214         }
1215         if( defined( $addr1 ) && $cmd =~ s/,\s*// ){
1216             if(      $cmd =~ s/^(\d+)\s*// ){
1217                 $addr2 = "$1"; $naddr++;
1218             } elsif( $cmd =~ s/^\$\s*// ){
1219                 $addr2 = 'eofARGV()'; $naddr++;
1220             } elsif( $cmd =~ s{^(/)}{} || $cmd =~ s{^\\(.)}{} ){
1221                 my $del = $1;
1222                 my $regex = stripRegex( $del, \$cmd );
1223                 if( defined( $regex ) ){
1224                     $addr2 = 'm '. bre2p( $del, $regex, $fl ).'s';
1225                     $naddr++;
1226                 } else {
1227                     Warn( "malformed regex, 2nd address", $fl );
1228                     $error++;
1229                     next;
1230                 }
1231             } else {
1232                 Warn( "invalid address after `,'", $fl );
1233                 $error++;
1234                 next;
1235             }
1236         }
1237
1238         # address modifier `!'
1239         #
1240         $negated = $cmd =~ s/^!\s*//;
1241         if( defined( $addr1 ) ){
1242             print "Parse: addr1=$addr1" if $useDEBUG;
1243             if( defined( $addr2 ) ){
1244                 print ", addr2=$addr2 " if $useDEBUG;
1245                 # both numeric and addr1 > addr2 => eliminate addr2
1246                 undef( $addr2 ) if $addr1 =~ /^\d+$/ &&
1247                                    $addr2 =~ /^\d+$/ && $addr1 > $addr2;
1248             }
1249         }
1250         print 'negated' if $useDEBUG && $negated;
1251         print " command:$cmd\n" if $useDEBUG;
1252
1253         # The Command
1254         #
1255         if( $cmd !~ s/^([:#={}abcdDgGhHilnNpPqrstwxy])\s*// ){
1256             my $h = substr( $cmd, 0, 1 );
1257             Warn( "unknown command `$h'", $fl );
1258             $error++;
1259             next;
1260         }
1261         my $key = $1;
1262
1263         my $tabref = $ComTab{$key};
1264         $GenKey{$key} = 1;
1265         if( $naddr > $tabref->[0] ){
1266             Warn( "excess address(es)", $fl );
1267             $error++;
1268             next;
1269         }
1270
1271         my $arg = '';
1272         if(      $tabref->[1] eq 'str' ){
1273             # take remainder - don't care if it is empty
1274             $arg = $cmd;
1275             $cmd = '';
1276
1277         } elsif( $tabref->[1] eq 'txt' ){
1278             # multi-line text
1279             my $goon = $cmd =~ /(.*)\\$/;
1280             if( length( $1 ) ){
1281                 Warn( "extra characters after command ($cmd)", $fl );
1282                 $error++;
1283             }
1284             while( $goon ){
1285                 $icom++;
1286                 if( $icom > $#Commands ){
1287                     Warn( "unexpected end of script", $fl );
1288                     $error++;
1289                     last;
1290                 }
1291                 $cmd = $Commands[$icom];
1292                 $Code .= "# $cmd\n" if $doGenerate;
1293                 $goon = $cmd =~ s/\\$//;
1294                 $cmd =~ s/\\(.)/$1/g;
1295                 $arg .= "\n" if length( $arg );
1296                 $arg .= $cmd;
1297             }
1298             $arg .= "\n" if length( $arg );
1299             $cmd = '';
1300
1301         } elsif( $tabref->[1] eq 'sub' ){
1302             # s///
1303             if( ! length( $cmd ) ){
1304                 Warn( "`s' command requires argument", $fl );
1305                 $error++;
1306                 next;
1307             }
1308             if( $cmd =~ s{^([^\\\n])}{} ){
1309                 my $del = $1;
1310                 my $regex = stripRegex( $del, \$cmd, "s" );
1311                 if( ! defined( $regex ) ){
1312                     Warn( "malformed regular expression", $fl );
1313                     $error++;
1314                     next;
1315                 }
1316                 $regex = bre2p( $del, $regex, $fl );
1317
1318                 # a trailing \ indicates embedded NL (in replacement string)
1319                 while( $cmd =~ s/(?<!\\)\\$/\n/ ){
1320                     $icom++;
1321                     if( $icom > $#Commands ){
1322                         Warn( "unexpected end of script", $fl );
1323                         $error++;
1324                         last;
1325                     }
1326                     $cmd .= $Commands[$icom];
1327                     $Code .= "# $Commands[$icom]\n" if $doGenerate;
1328                 }
1329
1330                 my $subst = stripRegex( $del, \$cmd );
1331                 if( ! defined( $regex ) ){
1332                     Warn( "malformed substitution expression", $fl );
1333                     $error++;
1334                     next;
1335                 }
1336                 $subst = sub2p( $del, $subst, $fl );
1337
1338                 # parse s/// modifier: g|p|0-9|w <file>
1339                 my( $global, $nmatch, $print, $write ) =
1340                   ( '',      '',      0,      undef );
1341                 while( $cmd =~ s/^([gp0-9])// ){
1342                     $1 eq 'g' ? ( $global = 'g' ) :
1343                     $1 eq 'p' ? ( $print  = $1  ) : ( $nmatch .= $1 );
1344                 }
1345                 $write = $1 if $cmd =~ s/w\s*(.*)$//;
1346                 ### $nmatch =~ s/^(\d)\1*$/$1/; ### may be dangerous?
1347                 if( $global && length( $nmatch ) || length( $nmatch ) > 1 ){
1348                     Warn( "conflicting flags `$global$nmatch'", $fl );
1349                     $error++;
1350                     next;
1351                 }
1352
1353                 $arg = makes( $regex, $subst,
1354                               $write, $global, $print, $nmatch, $fl );
1355                 if( ! defined( $arg ) ){
1356                     $error++;
1357                     next;
1358                 }
1359
1360             } else {
1361                 Warn( "improper delimiter in s command", $fl );
1362                 $error++;
1363                 next;
1364             }
1365
1366         } elsif( $tabref->[1] eq 'tra' ){
1367             # y///
1368             # a trailing \ indicates embedded newline
1369             while( $cmd =~ s/(?<!\\)\\$/\n/ ){
1370                 $icom++;
1371                 if( $icom > $#Commands ){
1372                     Warn( "unexpected end of script", $fl );
1373                     $error++;
1374                     last;
1375                 }
1376                 $cmd .= $Commands[$icom];
1377                 $Code .= "# $Commands[$icom]\n" if $doGenerate;
1378             }
1379             if( ! length( $cmd ) ){
1380                 Warn( "`y' command requires argument", $fl );
1381                 $error++;
1382                 next;
1383             }
1384             my $d = substr( $cmd, 0, 1 ); $cmd = substr( $cmd, 1 );
1385             if( $d eq '\\' ){
1386                 Warn( "`\\' not valid as delimiter in `y' command", $fl );
1387                 $error++;
1388                 next;
1389             }
1390             my $fr = stripTrans( $d, \$cmd );
1391             if( ! defined( $fr ) || ! length( $cmd ) ){
1392                 Warn( "malformed `y' command argument", $fl );
1393                 $error++;
1394                 next;
1395             }
1396             my $to = stripTrans( $d, \$cmd );
1397             if( ! defined( $to ) ){
1398                 Warn( "malformed `y' command argument", $fl );
1399                 $error++;
1400                 next;
1401             }
1402             if( length($fr) != length($to) ){
1403                 Warn( "string lengths in `y' command differ", $fl );
1404                 $error++;
1405                 next;
1406             }
1407             if( ! defined( $arg = makey( $fr, $to, $fl ) ) ){
1408                 $error++;
1409                 next;
1410             }
1411
1412         }
1413
1414         # $cmd must be now empty - exception is {
1415         if( $cmd !~ /^\s*$/ ){
1416             if( $key eq '{' ){
1417                 # dirty hack to process command on '{' line
1418                 $Commands[$icom--] = $cmd;
1419             } else {
1420                 Warn( "extra characters after command ($cmd)", $fl );
1421                 $error++;
1422                 next;
1423             }
1424         }
1425
1426         # Make Code
1427         #
1428         if( &{$tabref->[2]}( $addr1, $addr2, $negated,
1429                              $tabref->[3], $arg, $fl ) ){
1430             $error++;
1431         }
1432     }
1433
1434     while( @BlockStack ){
1435         my $bl = pop( @BlockStack );
1436         Warn( "start of unterminated `{'", $bl );
1437         $error++;
1438     }
1439
1440     for my $lab ( keys( %Label ) ){
1441         if( ! exists( $Label{$lab}{defined} ) ){
1442             for my $used ( @{$Label{$lab}{used}} ){
1443                 Warn( "undefined label `$lab'", $used );
1444                 $error++;
1445             }
1446         }
1447     }
1448
1449     exit( 1 ) if $error;
1450 }
1451
1452
1453 ##############
1454 #### MAIN ####
1455 ##############
1456
1457 sub usage(){
1458     print STDERR "Usage: sed [-an] command [file...]\n";
1459     print STDERR "           [-an] [-e command] [-f script-file] [file...]\n";
1460 }
1461
1462 ###################
1463 # Here we go again...
1464 #
1465 my $expr = 0;
1466 while( @ARGV && $ARGV[0] =~ /^-(.)(.*)$/ ){
1467     my $opt = $1;
1468     my $arg = $2;
1469     shift( @ARGV );
1470     if(      $opt eq 'e' ){
1471         if( length( $arg ) ){
1472             push( @Commands, split( "\n", $arg ) );
1473         } elsif( @ARGV ){
1474             push( @Commands, shift( @ARGV ) );
1475         } else {
1476             Warn( "option -e requires an argument" );
1477             usage();
1478             exit( 1 );
1479         }
1480         $expr++;
1481         $Defined{$#Commands} = " #$expr";
1482         next;
1483     }
1484     if( $opt eq 'f' ){
1485         my $path;
1486         if( length( $arg ) ){
1487             $path = $arg;
1488         } elsif( @ARGV ){
1489             $path = shift( @ARGV );
1490         } else {
1491             Warn( "option -f requires an argument" );
1492             usage();
1493             exit( 1 );
1494         }
1495         my $fst = $#Commands + 1;
1496         open( SCRIPT, "<$path" ) || die( "$0: $path: could not open ($!)\n" );
1497         my $cmd;
1498         while( defined( $cmd = <SCRIPT> ) ){
1499             chomp( $cmd );
1500             push( @Commands, $cmd );
1501         }
1502         close( SCRIPT );
1503         if( $#Commands >= $fst ){
1504             $Defined{$fst} = "$path";
1505         }
1506         next;
1507     }
1508     if( $opt eq '-' && $arg eq '' ){
1509         last;
1510     }
1511     if( $opt eq 'h' || $opt eq '?' ){
1512         usage();
1513         exit( 0 );
1514     }
1515     if( $opt eq 'n' ){
1516         $doAutoPrint = 0;
1517     } elsif( $opt eq 'a' ){
1518         $doOpenWrite = 0;
1519     } else {
1520         Warn( "illegal option `$opt'" );
1521         usage();
1522         exit( 1 );
1523     }
1524     if( length( $arg ) ){
1525         unshift( @ARGV, "-$arg" );
1526     }
1527 }
1528
1529 # A singleton command may be the 1st argument when there are no options.
1530 #
1531 if( @Commands == 0 ){
1532     if( @ARGV == 0 ){
1533         Warn( "no script command given" );
1534         usage();
1535         exit( 1 );
1536     }
1537     push( @Commands, split( "\n", shift( @ARGV ) ) );
1538     $Defined{0} = ' #1';
1539 }
1540
1541 print STDERR "Files: @ARGV\n" if $useDEBUG;
1542
1543 # generate leading code
1544 #
1545 $Func = <<'[TheEnd]';
1546
1547 # openARGV: open 1st input file
1548 #
1549 sub openARGV(){
1550     unshift( @ARGV, '-' ) unless @ARGV;
1551     my $file = shift( @ARGV );
1552     open( ARG, "<$file" )
1553     || die( "$0: can't open $file for reading ($!)\n" );
1554     $isEOF = 0;
1555 }
1556
1557 # getsARGV: Read another input line into argument (default: $_).
1558 #           Move on to next input file, and reset EOF flag $isEOF.
1559 sub getsARGV(;\$){
1560     my $argref = @_ ? shift() : \$_;
1561     while( $isEOF || ! defined( $$argref = <ARG> ) ){
1562         close( ARG );
1563         return 0 unless @ARGV;
1564         my $file = shift( @ARGV );
1565         open( ARG, "<$file" )
1566         || die( "$0: can't open $file for reading ($!)\n" );
1567         $isEOF = 0;
1568     }
1569     1;
1570 }
1571
1572 # eofARGV: end-of-file test
1573 #
1574 sub eofARGV(){
1575     return @ARGV == 0 && ( $isEOF = eof( ARG ) );
1576 }
1577
1578 # makeHandle: Generates another file handle for some file (given by its path)
1579 #             to be written due to a w command or an s command's w flag.
1580 sub makeHandle($){
1581     my( $path ) = @_;
1582     my $handle;
1583     if( ! exists( $wFiles{$path} ) || $wFiles{$path} eq '' ){
1584         $handle = $wFiles{$path} = gensym();
1585         if( $doOpenWrite ){
1586             if( ! open( $handle, ">$path" ) ){
1587                 die( "$0: can't open $path for writing: ($!)\n" );
1588             }
1589         }
1590     } else {
1591         $handle = $wFiles{$path};
1592     }
1593     return $handle;
1594 }
1595
1596 # printQ: Print queued output which is either a string or a reference
1597 #         to a pathname.
1598 sub printQ(){
1599     for my $q ( @Q ){
1600         if( ref( $q ) ){
1601             # flush open w files so that reading this file gets it all
1602             if( exists( $wFiles{$$q} ) && $wFiles{$$q} ne '' ){
1603                 open( $wFiles{$$q}, ">>$$q" );
1604             }
1605             # copy file to stdout: slow, but safe
1606             if( open( RF, "<$$q" ) ){
1607                 while( defined( my $line = <RF> ) ){
1608                     print $line;
1609                 }
1610                 close( RF );
1611             }
1612         } else {
1613             print $q;
1614         }
1615     }
1616     undef( @Q );
1617 }
1618
1619 [TheEnd]
1620
1621 # generate the sed loop
1622 #
1623 $Code .= <<'[TheEnd]';
1624 sub openARGV();
1625 sub getsARGV(;\$);
1626 sub eofARGV();
1627 sub printQ();
1628
1629 # Run: the sed loop reading input and applying the script
1630 #
1631 sub Run(){
1632     my( $h, $icnt, $s, $n );
1633     # hack (not unbreakable :-/) to avoid // matching an empty string
1634     my $z = "\000"; $z =~ /$z/;
1635     my $exitstatus = 0;
1636     # Initialize.
1637     openARGV();
1638     $Hold    = '';
1639     $CondReg = 0;
1640     $doPrint = $doAutoPrint;
1641 CYCLE:
1642     while( getsARGV() ){
1643         chomp();
1644         $CondReg = 0;   # cleared on t
1645 BOS:;
1646 [TheEnd]
1647
1648     # parse - avoid opening files when doing s2p
1649     #
1650     ( $svOpenWrite, $doOpenWrite ) = (  $doOpenWrite, $svOpenWrite )
1651       if $doGenerate;
1652     Parse();
1653     ( $svOpenWrite, $doOpenWrite ) = (  $doOpenWrite, $svOpenWrite )
1654       if $doGenerate;
1655
1656     # append trailing code
1657     #
1658     $Code .= <<'[TheEnd]';
1659 EOS:    if( $doPrint ){
1660             print $_, "\n";
1661         } else {
1662             $doPrint = $doAutoPrint;
1663         }
1664         printQ() if @Q;
1665     }
1666
1667     exit( $exitstatus );
1668 }
1669 [TheEnd]
1670
1671
1672 # append optional functions, prepend prototypes
1673 #
1674 my $Proto = "# prototypes\n";
1675 if( $GenKey{'l'} ){
1676     $Proto .= "sub _l();\n";
1677     $Func .= <<'[TheEnd]';
1678 # _l: l command processing
1679 #
1680 sub _l(){
1681     my $h = $_;
1682     my $mcpl = 70;
1683     # transform non printing chars into escape notation
1684     $h =~ s/\\/\\\\/g;
1685     if( $h =~ /[^[:print:]]/ ){
1686         $h =~ s/\a/\\a/g;
1687         $h =~ s/\f/\\f/g;
1688         $h =~ s/\n/\\n/g;
1689         $h =~ s/\t/\\t/g;
1690         $h =~ s/\r/\\r/g;
1691         $h =~ s/\e/\\e/g;
1692         $h =~ s/([^[:print:]])/sprintf("\\%03o", ord($1))/ge;
1693     }
1694     # split into lines of length $mcpl
1695     while( length( $h ) > $mcpl ){
1696         my $l = substr( $h, 0, $mcpl-1 );
1697         $h = substr( $h, $mcpl );
1698         # remove incomplete \-escape from end of line
1699         if( $l =~ s/(?<!\\)(\\[0-7]{0,2})$// ){
1700             $h = $1 . $h;
1701         }
1702         print $l, "\\\n";
1703     }
1704     print "$h\$\n";
1705 }
1706
1707 [TheEnd]
1708 }
1709
1710 if( $GenKey{'r'} ){
1711     $Proto .= "sub _r(\$);\n";
1712     $Func .= <<'[TheEnd]';
1713 # _r: r command processing: Save a reference to the pathname.
1714 #
1715 sub _r($){
1716     my $path = shift();
1717     push( @Q, \$path );
1718 }
1719
1720 [TheEnd]
1721 }
1722
1723 if( $GenKey{'t'} ){
1724     $Proto .= "sub _t();\n";
1725     $Func .= <<'[TheEnd]';
1726 # _t: t command - condition register test/reset
1727 #
1728 sub _t(){
1729     my $res = $CondReg;
1730     $CondReg = 0;
1731     $res;
1732 }
1733
1734 [TheEnd]
1735 }
1736
1737 if( $GenKey{'w'} ){
1738     $Proto .= "sub _w(\$);\n";
1739     $Func .= <<'[TheEnd]';
1740 # _w: w command and s command's w flag - write to file
1741 #
1742 sub _w($){
1743     my $path   = shift();
1744     my $handle = $wFiles{$path};
1745     if( ! $doOpenWrite && ! defined( fileno( $handle ) ) ){
1746         open( $handle, ">$path" )
1747         || die( "$0: $path: cannot open ($!)\n" );
1748     }
1749     print $handle $_, "\n";
1750 }
1751
1752 [TheEnd]
1753 }
1754
1755 $Code = $Proto . $Code;
1756
1757 # magic "#n" - same as -n option
1758 #
1759 $doAutoPrint = 0 if substr( $Commands[0], 0, 2 ) eq '#n';
1760
1761 # eval code - check for errors
1762 #
1763 print "Code:\n$Code$Func" if $useDEBUG;
1764 eval $Code . $Func;
1765 if( $@ ){
1766     print "Code:\n$Code$Func";
1767     die( "$0: internal error - generated incorrect Perl code: $@\n" );
1768 }
1769
1770 if( $doGenerate ){
1771
1772     # write full Perl program
1773     #
1774
1775     # bang line, declarations, prototypes
1776     print <<TheEnd;
1777 #!$perlpath -w
1778 eval 'exec $perlpath -S \$0 \${1+"\$@"}'
1779   if 0;
1780 \$0 =~ s/^.*?(\\w+)\[\\.\\w+\]*\$/\$1/;
1781
1782 use strict;
1783 use Symbol;
1784 use vars qw{ \$isEOF \$Hold \%wFiles \@Q \$CondReg
1785              \$doAutoPrint \$doOpenWrite \$doPrint };
1786 \$doAutoPrint = $doAutoPrint;
1787 \$doOpenWrite = $doOpenWrite;
1788 TheEnd
1789
1790     my $wf = "'" . join( "', '",  keys( %wFiles ) ) . "'";
1791     if( $wf ne "''" ){
1792         print <<TheEnd;
1793 sub makeHandle(\$);
1794 for my \$p ( $wf ){
1795    exit( 1 ) unless makeHandle( \$p );
1796 }
1797 TheEnd
1798    }
1799
1800    print $Code;
1801    print "Run();\n";
1802    print $Func;
1803    exit( 0 );
1804
1805 } else {
1806
1807     # execute: make handles (and optionally open) all w files; run!
1808     for my $p ( keys( %wFiles ) ){
1809         exit( 1 ) unless makeHandle( $p );
1810     }
1811     Run();
1812 }
1813
1814
1815 =head1 ENVIRONMENT
1816
1817 The environment variable C<PSEDEXTBRE> may be set to extend BREs.
1818 See L<"Additional Atoms">.
1819
1820 =head1 DIAGNOSTICS
1821
1822 =over 4
1823
1824 =item ambiguous translation for character `%s' in `y' command
1825
1826 The indicated character appears twice, with different translations.
1827
1828 =item `[' cannot be last in pattern
1829
1830 A `[' in a BRE indicates the beginning of a I<bracket expression>.
1831
1832 =item `\' cannot be last in pattern
1833
1834 A `\' in a BRE is used to make the subsequent character literal.
1835
1836 =item `\' cannot be last in substitution
1837
1838 A `\' in a subsitution string is used to make the subsequent character literal.
1839
1840 =item conflicting flags `%s'
1841
1842 In an B<s> command, either the `g' flag and an n-th occurrence flag, or
1843 multiple n-th occurrence flags are specified. Note that only the digits
1844 `1' through `9' are permitted.
1845
1846 =item duplicate label %s (first defined at %s)
1847
1848 =item excess address(es)
1849
1850 The command has more than the permitted number of addresses.
1851
1852 =item extra characters after command (%s)
1853
1854 =item illegal option `%s'
1855
1856 =item improper delimiter in s command
1857
1858 The BRE and substitution may not be delimited with `\' or newline.
1859
1860 =item invalid address after `,'
1861
1862 =item invalid backreference (%s)
1863
1864 The specified backreference number exceeds the number of backreferences
1865 in the BRE.
1866
1867 =item invalid repeat clause `\{%s\}'
1868
1869 The repeat clause does not contain a valid integer value, or pair of
1870 values.
1871
1872 =item malformed regex, 1st address
1873
1874 =item malformed regex, 2nd address
1875
1876 =item malformed regular expression
1877
1878 =item malformed substitution expression
1879
1880 =item malformed `y' command argument
1881
1882 The first or second string of a B<y> command  is syntactically incorrect.
1883
1884 =item maximum less than minimum in `\{%s\}'
1885
1886 =item no script command given
1887
1888 There must be at least one B<-e> or one B<-f> option specifying a
1889 script or script file.
1890
1891 =item `\' not valid as delimiter in `y' command
1892
1893 =item option -e requires an argument
1894
1895 =item option -f requires an argument
1896
1897 =item `s' command requires argument
1898
1899 =item start of unterminated `{'
1900
1901 =item string lengths in `y' command differ
1902
1903 The translation table strings in a B<y> command must have equal lengths.
1904
1905 =item undefined label `%s'
1906
1907 =item unexpected `}'
1908
1909 A B<}> command without a preceding B<{> command was encountered.
1910
1911 =item unexpected end of script
1912
1913 The end of the script was reached although a text line after a
1914 B<a>, B<c> or B<i> command indicated another line.
1915
1916 =item unknown command `%s'
1917
1918 =item unterminated `['
1919
1920 A BRE contains an unterminated bracket expression.
1921
1922 =item unterminated `\('
1923
1924 A BRE contains an unterminated backreference.
1925
1926 =item `\{' without closing `\}'
1927
1928 A BRE contains an unterminated bounds specification.
1929
1930 =item `\)' without preceding `\('
1931
1932 =item `y' command requires argument
1933
1934 =back
1935
1936 =head1 EXAMPLE
1937
1938 The basic material for the preceding section was generated by running
1939 the sed script
1940
1941    #no autoprint
1942    s/^.*Warn( *"\([^"]*\)".*$/\1/
1943    t process
1944    b
1945    :process
1946    s/$!/%s/g
1947    s/$[_[:alnum:]]\{1,\}/%s/g
1948    s/\\\\/\\/g
1949    s/^/=item /
1950    p
1951
1952 on the program's own text, and piping the output into C<sort -u>.
1953
1954
1955 =head1 SED SCRIPT TRANSLATION
1956
1957 If this program is invoked with the name F<s2p> it will act as a
1958 sed-to-Perl translator. After option processing (all other
1959 arguments are ignored), a Perl program is printed on standard
1960 output, which will process the input stream (as read from all
1961 arguments) in the way defined by the sed script and the option setting
1962 used for the translation.
1963
1964 =head1 SEE ALSO
1965
1966 perl(1), re_format(7)
1967
1968 =head1 BUGS
1969
1970 The B<l> command will show escape characters (ESC) as `C<\e>', but
1971 a vertical tab (VT) in octal.
1972
1973 Trailing spaces are truncated from labels in B<:>, B<t> and B<b> commands.
1974
1975 The meaning of an empty regular expression (`C<//>'), as defined by B<sed>,
1976 is "the last pattern used, at run time". This deviates from the Perl
1977 interpretation, which will re-use the "last last successfully executed
1978 regular expression". Since keeping track of pattern usage would create
1979 terribly cluttered code, and differences would only appear in obscure
1980 context (where other B<sed> implementations appear to deviate, too),
1981 the Perl semantics was adopted. Note that common usage of this feature,
1982 such as in C</abc/s//xyz/>, will work as expected.
1983
1984 Collating elements (of bracket expressions in BREs) are not implemented.
1985
1986 =head1 STANDARDS
1987
1988 This B<sed> implementation conforms to the IEEE Std1003.2-1992 ("POSIX.2")
1989 definition of B<sed>, and is compatible with the I<OpenBSD>
1990 implementation, except where otherwise noted (see L<"BUGS">).
1991
1992 =head1 AUTHOR
1993
1994 This Perl implementation of I<sed> was written by Wolfgang Laun,
1995 I<Wolfgang.Laun@alcatel.at>.
1996
1997 =head1 COPYRIGHT and LICENSE
1998
1999 This program is free and open software. You may use, modify,
2000 distribute, and sell this program (and any modified variants) in any
2001 way you wish, provided you do not restrict others from doing the same.
2002
2003 =cut
2004