Article 2108 of alt.sources: Xref: feenix.metronet.com alt.sources:2108 Newsgroups: alt.sources Path: feenix.metronet.com!news.utdallas.edu!wupost!howland.reston.ans.net!agate!headwall.Stanford.EDU!unixhub!cadhub.SLAC.Stanford.EDU!jfm From: jfm@cadhub.SLAC.Stanford.EDU (John F. McGowan) Subject: newscan - a Perl Network News scanner script Message-ID: Keywords: NNTP news perl scanner Internet newsgroups Sender: news@unixhub.SLAC.Stanford.EDU Organization: Stanford Linear Accelerator Center Date: Tue, 19 Oct 1993 05:06:24 GMT Lines: 1767 Archive-name: newscan Submitted-by: jfm@jupiter.slac.stanford.edu Below is the source code for newscan, a Perl script for searching selected Internet newsgroups for articles containing matches to Perl regular expressions. newscan can also veto articles containing other matches. The source contains an embedded manual page following the convention in Programming Perl by Randal Schwarz. The manpage is in nroff at the end of the source. In general, newscan should be run as background process since scanning an Internet newsgroup can be quite time consuming. newscan -h generates a help message as well. Please send comments, bug reports, and so forth to jfm@jupiter.slac.stanford.edu. - John McGowan ------------------------------------>CUT HERE<---------------------------- #!/usr/local/bin/perl 'di'; 'ig00'; # $Header: /u/ey/jfm/hunter/RCS/newscan,v 1.35 1993/10/19 04:47:20 jfm Exp jfm $ # # Name: newscan # Date: $Date: 1993/10/19 04:47:20 $ # Version: $Revision: 1.35 $ # Author: John F. McGowan ($Author: jfm $) # # Description: # # newscan is a utility to scan netnews for articles that # contain matches to regular expressions. # newscan can also exclude articles that contain # matches to selected regular expressions. # # Articles that contain matches are stored in a file in mailbox format. # This file may be read using mail readers such as mail, elm, etc. # # Search is controlled by a resource file .newscanrc (by # default) in the current directory. the default file may overridden # through the environment variable NEWSCAN, e.g. setenv NEWSCAN /me/.myrc. # newscan is intended to be run as a background job, e.g. newscan & # since it takes a while to scan selected newsgroups for articles of # interest. # # Revision Record: # $Log: newscan,v $ # Revision 1.35 1993/10/19 04:47:20 jfm # Delete SEE ALSO from manpage for submission to alt.sources # # Revision 1.34 1993/09/05 21:19:55 jfm # 1. Slightly improve the embedded manpage to include all options in # the synopsis and a section on options. Needs work on formatting. # # Revision 1.33 1993/07/20 00:53:53 jfm # 1. Improve template configuration file. # # Revision 1.32 1993/07/19 23:54:34 jfm # 1. small improvements in handling of leading ~ in file-specifications. Now only leading ~ is translated to home directory. Other ~ are left alone. # # Revision 1.31 1993/07/18 21:34:07 jfm # 1. Removed -d debug option # 2. Support for -e edit configuraiton file flag # 3. Support for -r read mailbox format file using # a standard mail reader. Will try to use elm first. # 4. Has support for different search criteria for different newsgroups. # # Revision 1.30 1993/07/17 20:46:02 jfm # 1. Add support for newscan -e to edit configuration file before execute search. # 2. Add support for newscan -r to invoke mail reader to read the file containing the found articles (probably not done yet). # # Revision 1.29 1993/07/17 19:14:07 jfm # 1. Revised comments. Now support different search criterion for different groups. # 2. Now supports REQUIRE /regexp/i which finds only articles containing # a match to /regexp/i. # # Revision 1.28 1993/07/16 19:49:35 jfm # 1. Attempt to add support for different search criterion for different groups or collections of groups. Appears to only search group(s?) specified by last SELECT line however. Needs bug fix. # # Revision 1.27 1993/07/16 00:38:24 jfm # Switch from Greenwich Meridian Time to Local Time in statistics. # # Revision 1.26 1993/07/15 04:53:39 jfm # 1. Add better formatting for the statistics output. Make more readable. # # Revision 1.25 1993/07/14 00:32:47 jfm # 1. added COLLECT STATISTICS ON (VETO|UNLESS) PATTERNS # 2. COLLECT STATISTICS ON (SEARCH|WHERE) PATTERNS now # # Revision 1.24 1993/07/14 00:15:17 jfm # 1. Add option to COLLECT STATISTICS ON SEARCH PATTERNS # # Revision 1.23 1993/07/13 02:40:06 jfm # 1. Spaces between two regular expressions in pairs statistics. # # Revision 1.22 1993/07/13 02:25:05 jfm # 1. Added correlation coefficients for pairs of regular expression. # # Revision 1.21 1993/07/12 23:05:47 jfm # 1. Now COLLECT STATISTICS ON PAIRS to count pairs of regular expressions # # Revision 1.20 1993/07/12 22:29:07 jfm # 1. Now appends to statistics file rather than overwrites. # 2. Does date properly in statistics. # # Revision 1.19 1993/07/12 20:23:44 jfm # 1. Added COLLECT STATISTICS IN # 2. Added COLLECT STATISTICS ON # # Revision 1.18 1993/07/06 15:32:08 jfm # Appears to work ok # # Revision 1.17 1993/07/06 04:21:31 jfm # 1. Added routines to sort and merges elements in range of excluded articles information. # # Revision 1.16 1993/07/05 00:34:23 jfm # 1. ~ in file specification for home directory (for MBOX) # 2. [119] port number optional # 3. period . allowed in file specification # 4. automatically generates group:excluded range line if not supplied by user and appends to end of file. # 5. Recognizes additional date formats # 6. .newscanrc is in user's home directory as specified by $HOME environment variable. # 7. Improve manpage. # # Revision 1.15 1993/07/04 00:29:27 jfm # 1. Add support for dotted decimal internet addresses. # # Revision 1.15 1993/07/04 00:29:27 jfm # 1. Add support for dotted decimal internet addresses. # # Revision 1.14 1993/07/03 23:57:26 jfm # 1. Fixed bug so that nntp port is optional. # 2. Added support for dates of format 1 Jul 1993 05:54 CST and 1 Jul 93 05:54 CST. # 3. Changed sample resource file in embedded manpage to NNTP nntphost, omitted optional port number since 119 is the default port specified in NNTP rfc. # # Revision 1.13 1993/07/03 00:25:34 jfm # 1. Added -h command line argument for brief help. Use embedded manpage for detailed help. # 2. Need to fix multiple copy of same article bug. # # Revision 1.12 1993/07/03 00:04:23 jfm # 1. Added command line argument to specify a configuration file. # 2. Some changes to embedded manpage. # # Revision 1.11 1993/07/02 22:21:39 jfm # 1. Added more to manpage. # # Revision 1.10 1993/07/02 22:09:11 jfm # Add some more to the built-in manpage. # # Revision 1.9 1993/07/02 21:58:57 jfm # *** empty log message *** # # Revision 1.8 1993/07/02 21:56:21 jfm # 1. Added default mailbox file newscanBox # 2. Added sample configuration file. # # Revision 1.7 1993/07/02 21:48:53 jfm # 1. removed debug -d # 2. Added more comments # 3. Removed socket, bind, and connect ok messages. # 4. Expanded built in manpage # # Revision 1.6 1993/06/30 19:53:23 jfm # 1. Add RCS $Header: /u/ey/jfm/hunter/RCS/newscan,v 1.35 1993/10/19 04:47:20 jfm Exp jfm $ to comments # # Revision 1.5 1993/06/26 21:05:45 jfm # 1. Fixed bug in GetDate which did not handle the date format Sat, 26 Jun 1993 17:38:51 GMT correctly. # # Revision 1.4 1993/06/12 22:48:54 jfm # 1. Change to using local($_) in subroutines instead of $save = $_. This should be more efficient and easier to maintain. # # Revision 1.3 1993/06/10 03:12:39 jfm # 1. Removed bug to allow search patterns of for /.../ # # Revision 1.2 1993/06/10 02:41:49 jfm # 1. Changed syntax to use UNLESS to veto articles and WHERE to specifying match to find. # # Revision 1.1 1993/06/09 00:50:06 jfm # Initial revision # # Revision 1.17 1993/06/07 00:00:39 jfm # 1. Treats blank lines in configuration file as comment lines # 2. Support for supergroups of groups to do search on using lines of form SUPERGROUP: blank delimited list of newsgroups to search. # # Revision 1.16 1993/06/06 20:02:53 jfm # 1. Changed data structure for search patterns to associative array. Key is the group name, value is packed list of search patterns using // as the field separator. # 2. minor cleanup of code # # Revision 1.15 1993/06/05 23:48:40 jfm # 1. Improved update range subroutine # 2. Added subroutine GetXRef to extract cross references from article header. The Xref: line specifies the article number in other groups if the article has been posted to multiple groups. scanner should now retrieve only one copy of an article even if posted to multiple groups (not fully tested yet!). # # Revision 1.14 1993/06/03 21:57:33 jfm # 1. Add copyright notice to manpage # # Revision 1.13 1993/06/03 21:40:54 jfm # 1. Add more comments # 2. Add veto subroutine to veto articles. More efficient veto logic. # 3. Add built in manpage # # Revision 1.12 1993/06/02 22:04:54 jfm # 1. Solved date format problem. Was with dates of form 1 Jun etc. Old format expected two digit date but 1-9 are allowed in dates. # 2. Now updates range by overwriting the configuration file. # # Revision 1.11 1993/06/02 20:57:28 jfm # 1. Added subroutine to return the day of the week (Mon, Tue, ...) for a date. Use this subroutine to generate the mailbox header line. # # Revision 1.10 1993/06/01 00:56:53 jfm # 1. Subroutine UpDateRange added to update excluded range for a group. Still need to use this to update configuration file. # 2. Configuration file now specified by NEWSCAN environment variable. Defaults to .newscanrc if NEWSCAN environment variable not defined. # # Revision 1.9 1993/05/31 19:24:45 jfm # 1. Now supports more date formats for conversion to mailbox format # 2. Loops over groups correctly now (used perl keys function) # 3. Handles unknown date formats better so still get article # # Revision 1.8 1993/05/29 21:32:35 jfm # 1. Now generates output in mailbox format that can be used with elm mail reader. # # Revision 1.7 1993/05/29 01:14:14 jfm # 1. Now searches body of text of articles according to search patterns stored in .scannerrc file # 2. Stores articles to file scanned. # 3. scanned is NOT in mailbox format. Appear to need first line of each article to be of form From (path) (date). See valid mailbox files. # 4. System seems to prefer linefeed ^J as end of line rather than the ^M ^J # CR-LF end of line used by NNTP protocol (but I am not certain of this) # # Revision 1.6 1993/05/28 20:08:37 jfm # 1. Now loops over groups and loops over articles in group # 2. Some bugs (not a debugged version) # 3. No pattern searching yet - add pattern searching next # # Revision 1.5 1993/05/25 19:46:13 jfm # 1. Added very simple query and response communication with the NNTP server. Appears to work so far. # # Revision 1.4 1993/05/25 18:56:07 jfm # 1. Added support for a line in configuration file giving internet address and port number of the NNTP server. # # Revision 1.3 1993/05/25 18:43:41 jfm # 1. Added debug flag to run in perl debugger # 2. Removed most debug print statements # 3. Believe it now parses the config file # 4. Need to add network connection to NNTP server # # Revision 1.2 1993/05/22 21:40:03 jfm # fix bug - add ; at end of second line of code # # Revision 1.1 1993/05/22 21:24:25 jfm # Initial revision # # arrays for weekday and date conversion require "getopts.pl"; &Getopts('c:r:he'); # -c takes argument configuration file specification # -e indicates edit configuration file before search # -r invokes mail reader # -h is help @cPath = ( "/bin", "/usr/local/bin", "/usr/ucb", "/usr/bin"); @readers = ( "elm", "Mail", "mailx" ); # standard mail readers if($opt_r) { if($ENV{'READER'}) { $myReader = $ENV{'READER'}; $foundReader = 1; # indicate that reader has been found } else { # select a reasonable default mail reader - prefers elm if exists $foundReader = 0; foreach $command (@readers) { if(!$foundReader) { foreach $path (@cPath) { if(-e "$path/$command" && -x "$path/$command") { $myReader = $path . "/" . $command; $foundReader = 1; } } # close loop over paths } # close if not foundReader } # $myReader = 'elm'; # use elm mail reader for now } if(!$foundReader) { print "newscan problem! Could not find a mail reader from built in list @readers in any of @cPath directories!\n Please set the READER environment variable to a reader on your system!\n"; die; } exec("$myReader -f $opt_r"); } if($opt_h) { print <<"EndOfHelp"; newscan -- a network news scanner newscan searches selected Internet network news groups for articles that match perl regular expressions. newscan implements a boolean query key-pattern full-text information retrieval system. In plain English, this means newscan scans the complete text of each article for matches to various combinations of perl regular expressions. Command Line Options: -c This flag selects the configuration file that tells newscan which newsgroups to search and what patterns to search for. If this option flag is not specified, newscan uses the default configuration file .newscanrc in the user's home directory or specified by the environment variable NEWSCAN. -e This flag indicates that the configuration file should be edited before doing the search. newscan will pop the user into the editor specified by the EDITOR environment variable. If no configuration file exists, newscan provides a template configuration file that the user should edit. -r This flag invokes a mail reader on the mailbox format file specified by the file-specification. This should be the file containing the articles found by newscan in a search. The user should set the environment variable READER to his or her favorite mail reader. Otherwise, newscan will select a mail reader, trying to use elm first if it exists. -h Output this help message. EndOfHelp exit; } @Months = ('Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Dec'); %DaysInMonth = ( 'Jan', '31', 'Feb', '28', # 1992 is leap year with 29 days 'Mar', '31', 'Apr', '30', 'May', '31', 'Jun', '30', 'Jul', '31', 'Aug', '31', 'Sep', '30', 'Oct', '31', 'Nov', '30', 'Dec', '31', ); @DaysOfWeek = ('Mon','Tue','Wed','Thu','Fri','Sat','Sun'); %NewsArticle = (); # read the configuration file (default to .newscanrc) if($opt_c) { # configuration file is specified at command line. $configFile = $opt_c; $configFile =~ s/^\~/$ENV{'HOME'}/; } else { if($ENV{'NEWSCAN'}) { $configFile = $ENV{'NEWSCAN'}; } else # default configuration file in user's home directory { $configFile = $ENV{'HOME'} . '/.newscanrc'; } } # edit the configuration file before start if -e @editors = ( 'emacs', 'vi' ); if($ENV{'EDITOR'}) { $myeditor = $ENV{'EDITOR'}; } else { $foundEditor = 0; foreach $command (@editors) { foreach $path (@cPath) { if(-e "$path/$command" && -x "$path/$command") { $myeditor = $path . "/" . $command; $foundEditor = 1; } } last if $foundEditor; } die "Could not find an editor!\n" unless $foundEditor; # $myeditor = 'emacs'; # I am an emacs snob } if($opt_e) { if(-e $configFile) { # do nothing if configuration file already exists } else # file does not exist - provide a form { open(CONFIG,">$configFile"); print CONFIG <<"EndOfConfig"; # Template Configuration file for newscan Internet news scanner # # Note: Lines beginning with # are comments. Remove leading # and # edit line if you wish to activate line. # # specify the Internet address of the NNTP server for the system # the NNTP server is the machine where the Internet news is stored. NNTP # specify the mailbox format file for the found articles MBOX # specify the Internet newsgroups to be searched SELECT # retrieve a news article if it contains a match to regular expression #WHERE /perl regular expression/[i] # do not retrieve a news article if it contains a match to UNLESS regular expression #UNLESS /perl regular expression/[i] # a news article must contain a match to a REQUIRE regular expression to # be retrieved. #REQUIRE /perl regular expression/[i] # collect statistics on search in a file -- to aid in refining search criteria #COLLECT STATISTICS IN # collect statistics on frequency of articles that match a pattern #COLLECT STATISTICS ON /perl regular expression/i # collect statistics on search patterns specified by WHERE and REQUIRE lines #COLLECT STATISTICS ON SEARCH PATTERNS # collect statistics on unless patterns specified by UNLESS #COLLECT STATISTICS ON UNLESS PATTERNS # collect statistics on frequency of articles that match pairs of patterns #COLLECT STATISTICS ON PAIRS #: # EndOfConfig close(CONFIG); } system("$myeditor $configFile"); # edit the configuration file print "Do you wish to do this search (y/n):"; $ans = ; exit 0 if $ans =~ /[Nn]/; } open(CONFIG,"$configFile"); # open configuration file @config = ; # array @config now contains a copy of config file close(CONFIG); # $newgroup = 'NULL'; @groups = (); # null list of groups # parse the configuration file $mbox = "newscanBox"; # default mailbox file for found articles $port = 119; # default to 119 as NNTP server port for $i (0 .. $#config) { $_ = $config[$i]; chop; # removing the trailing newline ^J if(/^\s*NNTP\s+([a-zA-Z0-9\.]+)(\s+\d+)?\s*$/) { # line specifying internet address and port number of NNTP server $label = NNTP; $them = $1; if($2) { $two = $2; # note that parenthesis around $port is absolutely necessary ($port) = $two =~ /\s+(\d+)\b/; } } elsif(/^\s*MBOX\s+([\w\/\.\~]*)\s*$/) # name of file to store found articles { $mbox = $1; $mbox =~ s/^\~/$ENV{'HOME'}/; } elsif(/^#.*$/) # a comment line in configuration file begins with # { } elsif(/^\s*SELECT(\s+(\w+\.)+\w+)+\s*$/) { # expect space delimited list of newsgroups /^\s*SELECT\s*(.*)$/; @selectedGroups = split(/\s+/,$1); # create entry in range associative array for each group selected foreach $group (@selectedGroups) { $range{$group} = '' unless $range{$group}; # blank range } } elsif(/^\s*([\w\-]+\.)+[\w\-]+:.*$/) { # line specifying the range of newsgroup articles excluded from search # this line is optional. newscan will create this line after search # completed if it does not exist OR update the range if the line does exist # ($newgroup,$newrange) = split(/:/,$_,2); $range{$newgroup} = $newrange; } elsif(/^\s*WHERE\s*(\/.*[^\\]\/\w?)\s*$/) { # line specifying the search pattern for a newsgroup foreach $Group (@selectedGroups) { if($pattern{$Group}) { # \034 is the ascii field separator ^\ (I believe) $pattern{$Group} = join("\034",$pattern{$Group},$1); } else { $pattern{$Group} = $1; } } push(@theSearchPatterns,$1); } elsif(/^\s*REQUIRE\s+(\/.*[^\\]\/\w?)\s*$/) { foreach $Group (@selectedGroups) { if($required{$Group}) { $required{$Group} = join("\034",$required{$Group},$1); } else { $required{$Group} = $1; } } push(@theSearchPatterns,$1); } elsif(/^\s*UNLESS\s+(\/.*[^\\]\/\w?)\s*$/) # veto on this pattern { foreach $Group (@selectedGroups) { if($veto{$Group}) { $veto{$Group} = join("\034",$veto{$Group},$1); } else { $veto{$Group} = $1; } } push(@theVetoPatterns,$1); } elsif(/^\s*COLLECT\s+STATISTICS\s+IN\s+([\w\/\.\~]+)\s*$/) { $doCollect = 1; $statFile = $1; $statFile =~ s/^\~/$ENV{'HOME'}/; } elsif(/^\s*COLLECT\s+STATISTICS\s+ON\s+(\/.*[^\\]\/\w?)\s*$/) { $doCollect = 1; $statistics{$1} = 0; # initialize pattern histogram # counts number of articles # with a match } elsif(/^\s*COLLECT\s+STATISTICS\s+ON\s+PAIRS\s*$/) { $doCollect = 1; $doPairs = 1; %pairs = (); } elsif(/^\s*COLLECT\s+STATISTICS\s+ON\s+(SEARCH|WHERE)\s+PATTERNS\s*$/) { $doCollect = 1; $doSearchPatterns = 1; } elsif(/^\s*COLLECT\s+STATISTICS\s+ON\s+(VETO|UNLESS)\s+PATTERNS\s*$/) { $doCollect = 1; $doVetoPatterns = 1; } elsif(/^\s*$/) # skip blank line { } else { # line did not match any allowed pattern die "Abort! Syntax error in configuration file!\n Line: $i $_ \n"; } } # close loop over lines in configuration file &FixRange(*range); if($doCollect) { $statistics{'ALL'} = 0; # count of number of articles scanned $statistics{'FOUND'} = 0; # count of number of articles found } # add search patterns to patterns to collect statistics for # if requested if($doSearchPatterns) { # append search patterns to statistics foreach $pattern (@theSearchPatterns) { $statistics{$pattern} = 0; } } if($doVetoPatterns) { # append search patterns to statistics foreach $pattern (@theVetoPatterns) { $statistics{$pattern} = 0; } } # open mailbox file to store found newsarticles open(MBOX,">>$mbox"); # # connect to the NNTP server $port = 119 unless $port; $them = 'localhost' unless $them; $AF_INET = 2; $SOCK_STREAM = 1; $sockaddr = 'S n a4 x8'; chop($hostname = `hostname`); # translate protocol name to associated number ($name,$aliases,$proto) = getprotobyname('tcp'); # translates service (port) name to corresponding number ($name,$aliases,$port) = getservbyname($port,'tcp') unless $port =~ /^\d+$/; # translates network hostname to corresponding number ($name,$aliases,$type,$len,$thisaddr) = gethostbyname($hostname); # translate network hostname to corresponding number if($them =~ /(\d+)\.(\d+)\.(\d+)\.(\d+)/) { $thataddr = pack('C4',$1,$2,$3,$4); } elsif($them =~ /(\w+)(\.\w+)*/) { ($name,$aliases,$type,$len,$thataddr) = gethostbyname($them); } else { die "Fatal error: NNTP host not specified in proper format!\n"; } $this = pack($sockaddr, $AF_INET, 0, $thisaddr); $that = pack($sockaddr, $AF_INET, $port, $thataddr); # Make the socket a filehandle if(socket(S, $AF_INET, $SOCK_STREAM, $proto)) { # print "socket ok\n"; } else { die "Fatal Error: socket failed ", $1,"\n"; } # give the socket an address if(bind(S, $this)) { # print "bind ok\n"; } else { die "Fatal Error: bind to $this failed! ", $1, "\n"; } # Call up the server if(connect(S,$that)) { # print "Connect to $them ok\n"; } else { die "Fatal Error: Apparently, can't connect to $them tcp/ip port $port. Error: ", $1, "\n"; } # Set socket to be command buffered select(S); $| = 1; select(STDOUT); # loop over groups to search $_ = ; # read confirmation of connection message from NNTP server # loop over groups that are keys of range array # this allows different searches for different groups # %range associative array includes other groups such as cross posting groups foreach $group (keys %range) { # send group command to NNTP server print S "GROUP $group \n"; $_ = ; # read status reply from NNTP server ($status,$gn,$gfirst,$glast,$gname) = split(/ /); if($status == 411) { warn "Warning! Group $group does not exist! Skipping to next group!\n"; } elsif($status == 211) # loop over articles in the group { $i = $gfirst; $prevArticle = $gfirst; $do = 1; while($do) { # check if article is in not in excluded range (previously scanned articles) if(!&InRange($range{$group},$i)) { # check if article exists print S "STAT $i\n"; $_ = ; # read reply ($status,$article,$id,$rest) = split(/ /); if($status == 223) { # article retrieved print S "article\n"; $_ = ; ($status,$article,$id,$rest) = split(/ /); if($status == 220) { @text = (); $_ = ; # read first line of text while(!/^\.[^\.].*$/) # loop until encounter lone period at start of line (.^J ends text in NNTP) { chop; chop; # remove trailing CR LF (^M ^J) $line = $_ . "\n"; # add line feed LF ^J to end of line push(@text, $line); # add line to text list $_ = ; # read another line } &Collect(*text, *statistics, *pairs, *doPairs) if %statistics; $match = 0; # start with no match if(!&Veto(*text,*veto,*group)) { # look for match # deal with required patterns first $Required = 1; # start by assuming it matches all required patterns foreach $search (split("\034",$required{$group})) { $blatz = 0; $blatz = grep(eval($search),@text); if(!$blatz) { $Required = 0; # does not match pattern $search which is required. } } if($Required) { foreach $search (split("\034",$pattern{$group})) { $blatz = 0; $blatz = grep(eval($search),@text); if($blatz) { $match = 1; } } # close loop over search patterns } # close if Required } # clear the NewsArticle array %NewsArticle = (); # find any cross references if(!&GetXRef(*text,*NewsArticle)) { $NewsArticle{$group} = $article; } # NewsArticle is an associative array containing the group and article number # in group for the article ( an article may be posted to multiple groups # if match store article if($match) { $statistics{'FOUND'}++; &ToMailBox(*text); print MBOX @text; } # update the range --- only care about selected groups in @selectedGroups foreach $newsgroup (keys %range) { $ArticleInGroup = $NewsArticle{$newsgroup}; @theRange = split(',',$range{$newsgroup}); &UpDateRange(*theRange,*ArticleInGroup); $range{$newsgroup} = join(',',@theRange); } } else { warn "Warning! ARTICLE command returned unexpected status response: $status \n"; } } elsif($status == 423) { # 423 no such article number in this group warn "Warning! Article $i in Group $group does not exist!\n"; } elsif($status == 430) { # 430 no such article found warn "Warning! Article $i in Group $group not found!\n"; } else { die "Aborting! STAT $i in Group $group returned unexpected status response: $status.\n"; } # end if for result of STAT command } # end if !&InRange &NNTPNext; # go to next article in group } # end while($do) } else { die "Abort! NNTP GROUP command returned unexpected response: $status. \n"; } } # close loop over groups to search print S "quit\n"; # close the connection to the server close MBOX; # close the file of found news articles # update the configuration file &FixRange(*range); # clean up the range foreach $group (keys %range) { $notPresent{$group} = 'T'; } foreach $i (0 .. $#config) { $_ = $config[$i]; foreach $group (keys %range) { if(s/$group\s*:(.*)/$group:$range{$group}/) { $notPresent{$group} = 'F'; } } $config[$i] = $_; } # # append group range lines if don't exist # only do this for groups that have been selected # don't care about cross postings to groups that have not been # selected # foreach $group (keys %range) # keys of range are all groups to be searched { if($notPresent{$group} =~ /T/) { $line = "$group:$range{$group}\n"; push(@config,$line); } } open(CONFIG,">$configFile"); print CONFIG @config; close CONFIG; # save statistics if($statistics{'ALL'}) { open(STAT,">>$statFile"); # open to append to file print STAT "######\n"; @groups = keys %range; print STAT "Statistics on regular expression incidence in newsgroups: @groups \n"; ($sec, $min, $hour, $mday, $mon, $year, $wday, $yday, $isdst) = localtime(time); $month = (Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec)[$mon]; $year = '19' . $year; print STAT "Collected $hour:$min:$sec (Local Time) $month $mday, $year \n"; print STAT "Regular Expression (Perl Syntax) : Number of Articles With a Match \n"; $length = 0; foreach $pattern (keys %statistics) { $curLength = length($pattern); $length = $curLength if $curLength > $length; } if($length + 2 < 60) { $length = $length + 2; } else { $length = 60; } foreach $pattern (keys %statistics) { printf STAT "%-${length}s: %d\n", $pattern, $statistics{$pattern}; } if($doPairs) { print STAT " Incidence of pairs of regular expressions \n"; print STAT " Only pairs that occur in at least one (1) article are reported. \n"; print STAT " : Number of Articles with Match : Correlation Coefficient \n"; $length = 0; foreach $key (keys %pairs) { $length = length($key) if length($key) > $length; } if($length < 58) { $length += 2; } else { $length = 60; } foreach $key (keys %pairs) { $Nall = $statistics{'ALL'}; $Npairs = $pairs{$key}; ($pOne,$pTwo) = split(/\034/,$key); $NOne = $statistics{$pOne}; $NTwo = $statistics{$pTwo}; $muOne = $NOne/$Nall; $muTwo = $NTwo/$Nall; $varOne = $muOne - ($muOne * $muOne); $varTwo = $muTwo - ($muTwo * $muTwo); if($varOne && $varTwo) { $corr = ( (($Npairs)/$Nall) - ($muOne * $muTwo))/(sqrt($varOne) * sqrt($varTwo)); } else { $corr = "UNDEFINED"; } ($index = $key) =~ s/\034/ /; printf STAT "%-${length}s : %-10d : %5.3f \n", $index, $pairs{$key}, $corr if $pairs{$key} && !($corr =~ "UNDEFINED"); } } close STAT; } ##### Support Subroutines Follow ##### sub Collect { local(*text, *statistics, *pairs, *doPairs) = @_; local(@ans); local($pattern); local(%FoundPattern) = () if $doPairs; local($pOne, $pTwo); local(%done) = () ; # array of found patterns foreach $pattern (keys %statistics) { @ans = (); # clear answer array @ans = grep(eval($pattern),@text) unless $pattern =~ 'ALL' || $pattern =~ 'FOUND'; if(@ans) { $FoundPattern{$pattern} = 1 if $doPairs; $statistics{$pattern}++; } } $statistics{'ALL'}++; # count all articles scanned if($doPairs) { foreach $pOne (keys %statistics) { $done{$pOne} = 1; foreach $pTwo (keys %statistics) { if(!$done{$pTwo}) { if($FoundPattern{$pOne} && $FoundPattern{$pTwo}) { if($pairs{$pOne,$pTwo}) { $pairs{$pOne,$pTwo}++; } else { $pairs{$pOne,$pTwo} = 1; } } # end if found pattern one and two } # end if pattern two not already checked } # end loop over pattern two } # end loop over pattern one } return; } sub NNTPNext { # group sending next command to NNTP server and code to parse response print S "next\n"; $_ = ; # retrieve reply to next ($status,$article,$id,$rest) = split(/ /); if($status == 223) { # 223 n a message $i = $article; } elsif($status == 421) { # 421 no next article in group $do = 0; } else { die "Abort! Unexpected response $status from NEXT command.\n"; } } sub InRange { local($range, $n) = @_; local(@ranges) = split(/,/,$range); local($answer) = 0; # return false unless in range local($i); # should use a more efficient search algorithm here now that ranges # are sorted in ascending order for $i (0 .. $#ranges) { $_ = $ranges[$i]; if(/^\s*(\d+)\s*$/) { $1 == $n ? ($answer = 1) : ($answer); } elsif(/^\s*(\d+)-(\d+)\s*$/) { if($n >= $1 && $n <= $2) { $answer = 1; return $answer; } } else { die "Abort in InRange! Syntax error in range: $_ \n"; } } # print "InRange: answer is $answer \n"; return $answer } sub ToMailBox { local(*text) = @_; local($path) = &GetPath(*text); local($date) = &GetDate(*text); local($header) = sprintf("%s %s %s\n","From",$path,$date); unshift(@text,$header); } sub GetPath { local(*text) = @_; local($line) = ''; local($path) = ''; foreach $line (@text) { $_ = $line; if(/^\s*[Pp]ath\s*:\s*([^\s]*)\s*$/) { $path = $1; } } if($path) { return $path; } else { warn "Warning! Could not find a Path line in article!\n"; } } sub GetDate { local(*text) = @_; local($line) = ''; local($date) = ''; local($date) = ''; foreach $line (@text) { $_ = $line; if(/^\s*[Dd]ate\s*:\s*(.*)$/) { $_ = $1; if(/(\w\w\w),\s+(\d\d?)\s+(\w\w\w)\s+(\d\d)\s+(\d?\d:\d\d:\d\d)\s+(\w\w\w)\s*/) { # dates of form: Tue, 18 May 93 11:24:33 GMT $wday = $1; $mday = $2; $month = $3; $year = '19' . $4; $time = $5; &FixTime(*time); $zone = $6; $date = join(' ',$wday,$month,$mday,$time,$year,$zone); } elsif(/(\w\w\w),\s+(\d\d?)\s+(\w\w\w)\s+(\d\d)\s+(\d?\d:\d\d:\d\d)\s+(-\d\d\d\d)\s*/) { # dates of form: Tue, 18 May 93 11:24:33 -0400 $wday = $1; $mday = $2; $month = $3; $year = '19' . $4; $time = $5; &FixTime(*time); $zone = 'GMT'; # kludge for this $date = join(' ',$wday,$month,$mday,$time,$year,$zone); } elsif(/(\w\w\w),\s+(\d\d?)\s+(\w\w\w)\s+(\d\d\d\d)\s+(\d?\d:\d\d:\d\d)\s+(-\d\d\d\d)\s*/) { # dates of form: Tue, 18 May 1993 11:24:33 -0400 $wday = $1; $mday = $2; $month = $3; $year = $4; $time = $5; &FixTime(*time); $zone = 'GMT'; # kludge for this $date = join(' ',$wday,$month,$mday,$time,$year,$zone); } elsif(/(\w\w\w),\s+(\d\d?)\s+(\w\w\w)\s+(\d\d\d\d)\s+(\d?\d:\d\d:\d\d)\s+(\w\w\w)\s*/) { # dates of form: Tue, 18 May 1993 11:24:33 GMT $wday = $1; $mday = $2; $month = $3; $year = $4; $time = $5; &FixTime(*time); $zone = $6; $date = join(' ',$wday,$month,$mday,$time,$year,$zone); } elsif(/(\d\d?)\s+(\w\w\w)\s+(\d\d\d\d)\s+(\d?\d:\d\d:\d\d)\s+(\w\w\w)\s*/) { # dates of form: 18 May 1993 11:24:33 GMT $mday = $1; $month = $2; $year = $3; $time = $4; &FixTime(*time); $zone = $5; $wday = &GetWeekDay($mday,$month,$year); $date = join(' ',$wday,$month,$mday,$time,$year,$zone); } elsif(/(\d\d?)\s+(\w\w\w)\s+(\d\d)\s+(\d?\d:\d\d:\d\d)\s+(\w\w\w)\s*/) { # dates of form 18 May 93 11:24:33 GMT $mday = $1; $month = $2; $year = '19' . $3; $time = $4; &FixTime(*time); $zone = $5; $wday = &GetWeekDay($mday,$month,$year); $date = join(' ',$wday,$month,$mday,$time,$year,$zone); } elsif(/(\d\d?)\s+(\w\w\w)\s+(\d\d\d\d)\s+(\d?\d:\d\d)\s+(\w\w\w)\s*/) { # dates of format 1 Jul 1993 05:54 CST $mday = $1; $month = $2; $year = $3; $time = $4 . ':00'; &FixTime(*time); $zone = $5; $wday = &GetWeekDay($mday,$month,$year); $date = join(' ',$wday,$month,$mday,$time,$year,$zone); } elsif(/(\d\d?)\s+(\w\w\w)\s+(\d\d)\s+(\d?\d:\d\d)\s+(\w\w\w)\s*/) { # dates of format 1 Jul 93 05:54 CST $mday = $1; $month = $2; $year = '19' . $3; $time = $4 . ':00'; &FixTime(*time); $zone = $5; $wday = &GetWeekDay($mday,$month,$year); $date = join(' ',$wday,$month,$mday,$time,$year,$zone); } elsif(/(\d\d?)\s+(\w\w\w)\s+(\d\d)\s+(\d?\d:\d\d:\d\d)\s+(-\d\d\d\d)\s*/) { # dates of form: 18 May 93 05:11:21 -0400 $mday = $1; $month = $2; $year = '19' . $3; $time = $4; &FixTime(*time); $zone = 'GMT'; # temporary kludge until i figure out what -0400 means $wday = &GetWeekDay($mday,$month,$year); $date = join(' ',$wday,$month,$mday,$time,$year,$zone); } elsif(/(\d\d?)\s+(\w\w\w)\s+(\d\d\d\d)\s+(\d?\d:\d\d:\d\d)\s+(-\d\d\d\d)\s*/) { # dates of form: 18 May 1993 05:11:21 -0400 $mday = $1; $month = $2; $year = $3; $time = $4; &FixTime(*time); $zone = 'GMT'; # temporary kludge until i figure out what -0400 means $wday = &GetWeekDay($mday,$month,$year); $date = join(' ',$wday,$month,$mday,$time,$year,$zone); } elsif(/(\d\d?)\s+(\w\w\w)\s+(\d\d)\s+(\d?\d:\d\d:\d\d)\s*/) { # dates of form: 18 May 93 05:11:21 $mday = $1; $month = $2; $year = '19' . $3; $time = $4; &FixTime(*time); $zone = 'GMT'; #kludge $wday = &GetWeekDay($mday,$month,$year); $date = join(' ',$wday,$month,$mday,$time,$year,$zone); } elsif(/(\d\d?)\s+(\w\w\w)\s+(\d\d\d\d)\s+(\d?\d:\d\d:\d\d)\s*/) { # dates of form: 18 May 1993 05:11:21 $mday = $1; $month = $2; $year = $3; $time = $4; &FixTime(*time); $zone = 'GMT'; #kludge $wday = &GetWeekDay($mday,$month,$year); $date = join(' ',$wday,$month,$mday,$time,$year,$zone); } else { warn "Warning! Format of date used in news article is not recognized by newscan: $_ \n"; warn "Warning! newscan is unable to reformat this date to the date format required for the mailbox format file!\n"; warn "Warning! A dummy date is used so that the mailbox format file with the stored articles will function with a mail reader!\n"; # use a dummy date below so it knows that I did find a Date line # but could not parse date format $wday = 'Mon'; # kludge $mday = 1; $month = 'Jan'; $year = '1800'; $time = '12:00:00'; # noon $zone = 'GMT'; $date = join(' ',$wday,$month,$mday,$time,$year,$zone); } # end parsing of date } # end if date line } if($date) { return $date; } else { warn "Warning! Could not find Date line in article!\n"; } } sub FixRange { local(*range) = @_; local($group,@theRange,@sortedRange); foreach $group (keys %range) { @theRange = split(',',$range{$group}); @sortedRange = sort SortElements @theRange; &CleanUpRange(*sortedRange); $range{$group} = join(',',@sortedRange); } return; } sub SortElements { if( ($minA,$maxA) = $a =~ /(\d+)\-(\d+)/) { } elsif( ($minA) = $a =~ /(\d+)/) { $maxA = $minA; } else { die "Error in range\n"; } if( ($minB,$maxB) = $b =~ /(\d+)\-(\d+)/) { } elsif( ($minB) = $b =~ /(\d+)/) { $maxB = $minB; } else { die "Error in range\n"; } if( $maxA > $maxB && $minA >= $minB) { return 1; } elsif( $maxB > $maxA && $minB >= $minA) { return -1; } else { return 0; } } sub CleanUpRange { local(*theRange) = @_; local($i)=0; local($next); local($element); local($newElement); local($minLo, $maxLo, $minHi, $maxHi); while($i < $#theRange) { if( ($minLo,$maxLo) = $theRange[$i] =~ /(\d+)\-(\d+)/) { } elsif( ($minLo) = $theRange[$i] =~ /(\d+)/) { $maxLo = $minLo; } else { warn "CleanUpRange Error! \n"; } if( ($minHi,$maxHi) = $theRange[$i+1] =~ /(\d+)\-(\d+)/) { } elsif( ($maxHi) = $theRange[$i+1] =~ /(\d+)/) { $minHi = $maxHi; } else { warn "CleanUpRange Error! \n"; } if($minHi <= ($maxLo + 1) && $minHi >= $minLo && $maxLo <= $maxHi) { # merge two elements in the range $newElement = "$minLo\-$maxHi"; splice(@theRange,$i,2,$newElement); } else { # don't merge elements -- move to next element in list $i++; # increment element in range } } # end while loop return; } sub UpDateRange { local(*theRange,*theArticle) = @_; local($i) = 0; local($lrange); # theRange is an array consisting of article numbers nn or nn-mm where nn and mm are article numbers foreach $lrange (@theRange) { local($_) = $lrange; if(/^(\d+)$/) { if($1 == $theArticle) { return 0; } elsif($1 == ($theArticle - 1) ) { local($newRange) = join('-',$1,$theArticle); splice(@theRange,$i,1,$newRange); return 1; # return 1 if updated range } elsif($1 == ($theArticle + 1) ) { local($newRange) = join('-',$theArticle,$1); splice(@theRange,$i,1,$newRange); return 1; # return 1 if updated range } else { } } elsif(/^(\d+)-(\d+)$/) { if($1 <= $theArticle && $2 >= $theArticle) { return 0; } elsif(($theArticle + 1) == $1 ) { $newRange = join('-',$theArticle,$2); splice(@theRange,$i,1,$newRange); return 1; # returns one if updated range } elsif(($theArticle - 1) == $2 ) { $newRange = join('-',$1,$theArticle); splice(@theRange,$i,1,$newRange); return 1; # returns one if updated range } else { } } else { } $i++; } # end loop over range list # if it gets here then article is not in excluded range and # is not one after an existing range push(@theRange,$theArticle); return 1; } sub GetWeekDay { local($mday,$month,$year) = @_; local($i) = 0; local($days) = 0; # Number of days since 1 Jan 1991 # 1 Jan 1991 is a Tue # count number of days in current year to date while($Months[$i] ne $month && ($i < 12)) { $days += $DaysInMonth{$Months[$i]}; if($Months[$i] eq 'Feb' && ( $year - 1988 ) % 4 == 0) { $days++; # add additional day for the leap year } $i++; } $days += $mday; # count number of days in years since 1990 (counts 1991) for(local($past) = 1991; $past < $year; $past++) { if(($past - 1988) % 4) # non-zero if not a leap year { $days += 365; } else # a leap year { $days += 366; } } local($weekday) = $days % 7; # at this point 0 is a Monday return $DaysOfWeek[$weekday]; } sub Veto { local(*text,*veto,*group) = @_; # veto article if matches a veto pattern local($search, $blatz); # declare local variables local(@vpatterns) = split("\034",$veto{$group}); foreach $search (@vpatterns) { $blatz = grep(eval($search),@text); if($blatz) { return 1; # veto this article } } return 0; # no veto } sub GetXRef { local(*text,*NewsArticle) = @_; # return array of news article numbers local(@fields); local($line,$field,$key,$value); local($iret) = 0; # did not find an Xref: line in text foreach $line (@text) { # parse Xref line local($_) = $line; chop; if(/^Xref:/) { @fields = split; shift(@fields); shift(@fields); foreach $field (@fields) { ($key,$value) = split(':',$field); $NewsArticle{$key} = $value; $iret = 1; } } # end if Xref line } # end loop over lines in article text return $iret; } sub FixTime { local(*time) = @_; local($_) = $time; if(/^\d\d:\d\d:\d\d$/) { # do nothing: time in correct format } elsif(/^\d:\d\d:\d\d$/) { $time = '0' . $time; } elsif(/^\d\d:\d\d$/) { $time = $time . ':00'; } elsif(/^\d:\d\d$/) { $time = '0' . $time . ':00'; } else { warn "Fixtime Warning! Do Not Recognize Time format of time: $time !\n"; } return; } ################################################## # Next few lines are legal in both perl and nroff .00; # finish .ig 'di \" finish diversion -- previous line must be blank .nr nl 0-1 \" fake up transition to first page again .nr % 0 \" start at page 1 ';__END__ #### From here on it's a standard manual page ### .TH NEWSCAN 1 "June 3, 1993" .AT 3 .SH NAME newscan \- scans Usenet news for articles matching regular expressions (uses perl regular expressions). Articles are saved in a mail folder, a file in mailbox format. .SH SYNOPSIS .B newscan [-c configuration-file] [-e] [-r folder-file-specification] [-h] .SH DESCRIPTION .I newscan searches selected newsgroups for articles matching patterns. Patterns are specified as regular expressions in a configuration or resource file. Also may specifify patterns to veto articles. If an article contains this pattern, the article will be ignored even if it contains a search pattern. The configuration file is specified with the command line argument -c configuration-file or in the environment variable NEWSCAN. The command line argument takes precedence over the NEWSCAN environment variable. If NEWSCAN is undefined, the configuration file defaults to .newscanrc in the user's home directory. Typically, newscan is run as a batch job: newscan & . .SH OPTIONS What You Want To Do Option Get help on .I newscan -h Edit Configuration File -e Override Default Configuration File -c configuration-file Invoke Mail Reader to Read a Folder -r folder .SH CONFIGURATION FILE newscan is controlled by a resource or configuration file containing command in a simple language that newscan understands. These commands should be all capitals. Configuration file commands: NNTP them.them.com [119] specifies the Internet address of the NNTP server. The first argument is the Internet address either as a Fully Qualified Domain Name or the dotted decimal format for the 32 bit Internet address. The optional second argument is the port number of the NNTP server; this should be 119 if NNTP specification is followed. Port 119 is reserved for NNTP. MBOX specifies the file where the found articles are stored. This file is a mail folder that may be read and manipulated by any mailer (Mail User Agent) can be any valid Unix file-specification (including path). MBOX interprets a leading tilde ~ as the user's home directory: e.g. MBOX ~/tmp/myfile. SELECT my.group his.group alt.group specifies the Internet newsgroups for which subsequent search specifications apply. newscan searches all newsgroups specified by SELECT lines. REQUIRE /regexp/[i] specifies a perl regular expression that must be found for a match to occur. This search criterion applies only to the newsgroups specified by the last preceding SELECT line. WHERE /regexp/[i] specifies the perl regular expression to be found. As in perl, the optional trailing /i tells newscan to ignore case. This search criterion applies only to the newsgroups specified by the last preceding SELECT line. UNLESS /regexp/[i] specifies the perl regular expression used to exclude an article. Even if the article contains a match to a WHERE regular expression it will be excluded (not found) if it contains a match to an UNLESS expression. As in perl, the optional trailing /i tells newscan to ignore case. COLLECT STATISTICS IN line tells newscan to collect statistics on the incidence of regular expressions in all articles scanned (not just articles that match). These statistics are printed in human readable form in the file . newscan appends the statistics information to . COLLECT STATISTICS ON /regexp/[i] line tells newscan to collect statisitcs on incidence of perl regular expression regexp. newscan counts the number of articles that contain at least one match to regexp. COLLECT STATISTICS ON {SEARCH|WHERE} PATTERNS tells newscan to collect statistics on all perl regular expressions specified by WHERE /regexp/[i] lines in the configuration file. COLLECT STATISTICS ON {VETO|UNLESS} PATTERNS tells newscan to collect statistics on all perl regular expressions specified by UNLESS /regexp/[i] lines in the configuration file. COLLECT STATISTICS ON PAIRS tells newscan to collect statistics on all pairs of perl regular expressions for which statistics are being collected. newscan counts the number of articles containing at least one match to both regular expressions. newscan also calculates a simple correlation coefficient between the two regular expressions in the pair. Note that if the correlation coefficient is 1.0, then the two regular expressions always occur together; this means one regular expression is a redundant (unneeded) search criterion. my.favorite.group:1-1100,1105-1110 is a line in the configuration file that lists ranges of articles in a group that are excluded from search. newscan updates this range after it finishes to avoid repeating a search. If this line is not provided by the user, newscan will generate the group range line and append it to the end of the configuration file. newscan generates the group range line for any other groups that a found article has been posted to. .SH SAMPLE RESOURCE FILE NNTP nntphost MBOX myBox SELECT misc.jobs.offered ba.jobs.offered WHERE /gui/i WHERE /motif/i WHERE /graphic/i UNLESS /From:.*Headhunter/i UNLESS /Subject:.*Recruit/i misc.jobs.offered:1-1000 ba.jobs.offered: .SH ENVIRONMENT .I NEWSCAN environment variable defines the configuration or resource file that controls the search. If NEWSCAN is not defined, then newscan defaults to the file .newscanrc in the user's home directory. NEWSCAN is superseded by the -c configuration-file command line argument. .I READER environment variable selects the mail reader used by newscan. If READER is not set, newscan will try to find and use a mail reader of its choosing. newscan will use Dave Taylor's elm if it exists. .I EDITOR environment variable selects the editor used by newscan. If EDITOR is not set, newscan will try to use first emacs and then vi. .SH FILES .I $HOME/.newscanrc is the default resource or configuration file specifying the search. This can be overridden from the command line (using -c configuration-file) or by setting the NEWSCAN environment variable. The command line argument supersedes the NEWSCAN environment variable. .I newscanBox is the default file where newscan saves found articles in mailbox format. This can be overridden by using the MBOX file-name command in the configuration file. .SH AUTHOR .I John F. McGowan e-mail: jfm@jupiter.slac.stanford.edu (or) jfm@btr.com .SH DIAGNOSTICS .SH BUGS A known nuisance is that newscan will find the same article more than once if the article has been manually posted to more than one newsgroup that newscan searches. If the article has been properly cross-posted to multiple newsgroups, newscan will find only one copy of the article. Please report any bugs to the author. .SH COPYRIGHTS (C) Copyright 1993 by John F. McGowan You may use or modify newscan as you like. However, you must credit John McGowan as the author of the original version of newscan and include the copyright notice above. .SH DISCLAIMER .I newscan is provided .I as is. There is no warranty, express or implied, that it will do what you want. There is no warranty that it will work reliably or without error. The author disclaims all responsibility or liability for any consequences of using newscan, including but not limited to losses of time or money. In other words, use at your own risk.