Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Perl issue (parsing text)

  1. Apr 17, 2005 #1
    I hope a few of you are avid perl programmers. There seems to be a surprising paucity of decent perl-centric boards out there, so I thought I'd try this one since so many smart people come here :-)

    I myself am new to perl but have plenty of experience in C++/Objective-C and other languages. I am finding perl very easy to learn and experiment with but am having the darndest time with this one module, Text::ParseWords. It's supposed to, well, parse a line of text into words. I thought it was working until I started checking its math. Here, run this perl script (I apologize for the lack of spaces or tabs, this old browser doesn't support CODE tags):

    ----------------------
    use Text::ParseWords;

    open(INFILE, "/someplace/test.txt") or die "Can't read!";
    #Contents of test.txt are following four lines:
    # here are four words
    #here are three
    #This word's invisible
    #And the end.

    $word = 0;
    $wc = 0;

    while (<INFILE>)
    {
    #@words = &shellwords($_);
    @words = "ewords('\s+', 1, $_);
    $i = 0;

    foreach (@words)
    {
    if ($i == 1) # for each line's second word...
    {
    $word = &_;
    }
    $wc = $wc + 1;
    $i++;
    }
    print "2nd word is: '", $word, "' and this line makes ", $wc, " words so far\n";
    }

    print("Total of ", $wc, " words\n";

    close INFILE;
    ---------------------

    There, type or paste that in and run it. See the results? Notice that the total word count is right but the first line's word count is totally off and the third line isn't counted at all. The code that prints the 2nd word of each line is proof that the 3rd line is invisible. This reveals at least two bugs:

    - leading spaces on lines seem to get counted even though no logical word parser would work that way (it still doesn't make sense that line 1 has "six" words, you would expect "five")

    - apostrophes, aka "single-quotes" as far as computers are concerned, wreak havoc; that's the only way I can put it

    Unless I can get the word parsing module to handle leading spaces and apostrophes I won't be able to build this utility I'm working on. I know there must be a way!

    Btw, replace that call to quotewords() with the call to shellwords() that I commented out above it to see another possible way to handle it, which also fails miserably.

    Any help you guys can give would be much appreciated. And please don't reply with code in CODE tags, I might not be able to read it.
     
  2. jcsd
  3. Apr 17, 2005 #2
    This is real hackish, but it seems to work. All I did was add two regex substitution commands to replace " with \" and ' with \'. :smile:

    There were a couple other typos in your code in the original post, too. I presume you have a correct version already. :cool:

    -----------------
    use Text::ParseWords;

    open(INFILE, "test.txt") or die "Can't read!";
    #Contents of test.txt are following four lines:
    ## here are four words
    ##here are three
    ##This word's invisible
    ##And the end.

    $word = 0;
    $wc = 0;

    while (<INFILE>)
    {
    s/\"/\\"/g; s/\'/\\'/g; #escape double and single quote characters
    @words = &shellwords($_);
    #@words = quotewords('\s+', 1, $_);
    $i = 0;

    foreach (@words)
    {
    if ($i == 1) # for each line's second word...
    {
    $word = $_;
    }
    $wc = $wc + 1;
    $i++;
    }
    print "2nd word is: '", $word, "' and this line makes ", $wc, " words so far\n";
    }

    print("Total of ", $wc, " words\n");

    close INFILE;
     
    Last edited: Apr 17, 2005
  4. Apr 17, 2005 #3

    chroot

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    This all seems incredibly convoluted.

    Why not just use this:

    #!/usr/bin/perl

    while (<>)
    {
    s/^\s+//; # remove leading
    s/\s+$//; # and trailing whitespace

    @words = split(/\s/, $_);
    print scalar(@words) . "\n";
    }

    - Warren
     
    Last edited: Apr 17, 2005
  5. Apr 19, 2005 #4
    Domo arigato!

    By golly, why *don't* I just do that? :)

    I had thought about removing single-quotes and leading whitespaces from the text, but didn't know how. Plus I figured I just wasn't using the ParseWords functions properly.

    Regarding my typos, I notice at least one that seems to be a glitch in the post. Something ate a few characters. I'm glad it didn't throw you guys.

    Thanks to both chroot and abhishek for your responses. One of you gave me a way to work around quote characters, and the other gave me a way to remove spaces, which were the two obstacles I was facing. Of course I hadn't told either of you what I was working on and whether it was okay to alter the source text by removing characters, so you each came up with your own solution.

    Incidentally, it's no secret project. I'm just writing a command to give the total size of -- and number of items in -- a directory, an obviously useful function which my bash shell strangely does not seem to offer. I figured writing it would be instructive in learning the CLI and Perl. The catch was that I was saving the contents of a dir (find [...] > output.txt) and analyzing it with a perl script to get the total file size and count, but as you saw, ParseWords wasn't handling a couple things properly.

    I can post it when it's finished if anyone has a use for it, but it's written specifically for Mac OS X, i.e., it handles .apps properly. The whole catch is that on a Mac an application is actually a folder full of (sometimes) thousands of resource files, so using Unix's find command returns all those files within the apps, which shouldn't count as files. So the whole project became surprisingly complicated, what with the finding and the grepping and the perling (word?).

    Anyway, maybe I'll post it at some point; not like you guys couldn't write such a command yourselves, of course, and do it better than me, but maybe someone else will find it useful.
     
  6. Apr 19, 2005 #5

    chroot

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    The bash shell is not in the business of listing files. Use the programs ls and du to perform those functions. Use `man ls` and `man du` to get more information about these programs.

    To count the number of files (not inluding directories): `ls -l | grep -v -c "^d"`

    To count the number of files (including directories): `ls -l | wc -l`

    To get the total size of all the files in the current directory (not including subdirectories): `du -Ssh .`

    To get the total size of all the files in the current directory (including subdirectories): `du -sh .`
    This is an entertaining exercise, for sure, even though it is sorta reinventing the wheel.

    Here's a simple script I whipped up that will count files; you should be able to edit it pretty easily to handle the .apps folders as you'd prefer. (I didn't quite understand the behavior you're looking for, so I didn't attempt to code it.)

    #!/usr/bin/perl

    ($count, $size) = tally(shift || ".");

    if ($size > (1024 * 1024 * 1024))
    {
    $size = sprintf "%.2f GB", $size / (1024 * 1024 * 1024);
    }
    elsif ($size > (1024 * 1024))
    {
    $size = sprintf "%.2f MB", $size / (1024 * 1024);
    }
    elsif ($size > 1024)
    {
    $size = sprintf "%.2f kB", $size / 1024;
    }
    else
    {
    $size = $size . " b";
    }

    print "$count files, total size $size\n";

    sub tally
    {
    my $thing = shift;
    my ($count, $size, $subcount, $subsize, $entry);

    if (-f $thing)
    {
    return (1, -s $thing);
    }
    elsif (-d $thing)
    {
    # Uncomment to count directories, too
    # $count++;

    opendir(DIR, $thing) || die "Can't open directory: $thing";
    my @contents = grep { !/^\.$/ && !/^\.\.$/ } readdir DIR; # Read all files, even hidden ones
    # my @contents = grep { !/^\./ } readdir DIR; # Read only files that are not hidden
    closedir DIR;

    foreach $entry (@contents)
    {
    ($subcount, $subsize) = tally("$thing/$entry");
    $count += $subcount;
    $size += $subsize;
    }
    }

    return ($count, $size);
    }

    - Warren
     
    Last edited: Apr 19, 2005
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook

Have something to add?



Similar Discussions: Perl issue (parsing text)
  1. Numbers In Perl (Replies: 3)

  2. PERL Programming (Replies: 5)

  3. Embedded text (Replies: 4)

Loading...