How can I fix the bugs in my Perl script using Text::ParseWords?

  • Thread starter Thread starter Gnophos
  • Start date Start date
  • Tags Tags
    Text
Click For Summary

Discussion Overview

The discussion revolves around troubleshooting a Perl script that utilizes the Text::ParseWords module for parsing text into words. Participants explore issues related to word counting, handling of leading spaces, and the treatment of apostrophes in the input text. The scope includes practical coding challenges and potential solutions.

Discussion Character

  • Technical explanation
  • Debate/contested
  • Experimental/applied

Main Points Raised

  • One participant describes their experience with the Text::ParseWords module, noting that it incorrectly counts words due to leading spaces and apostrophes.
  • Another participant suggests a workaround involving regex substitutions to escape quote characters, claiming it resolves some issues in the original script.
  • A different participant proposes a simpler approach using basic Perl functions to remove leading and trailing whitespace and split the text into words, questioning the complexity of the original method.
  • The original poster acknowledges the suggestions and reflects on their initial assumptions about using ParseWords, expressing gratitude for the alternative solutions provided.
  • One participant critiques the original poster's approach, suggesting that standard shell commands could achieve the desired functionality without the need for a custom Perl script.
  • Another participant shares a complete Perl script designed to count files and calculate their total size, indicating that it could be adapted for the original poster's needs.

Areas of Agreement / Disagreement

Participants express differing opinions on the best approach to solve the problem, with no consensus reached on a single solution. Some favor using Text::ParseWords with modifications, while others advocate for simpler alternatives or existing shell commands.

Contextual Notes

Participants note various typos and issues in the original script, indicating that some of the problems may stem from these errors. The discussion also highlights the complexity of handling specific text formats and the challenges of parsing in Perl.

Who May Find This Useful

Individuals interested in Perl programming, text parsing, and those facing similar challenges with word counting in scripts may find this discussion beneficial.

Gnophos
Messages
21
Reaction score
0
I hope a few of you are avid perl programmers. There seems to be a surprising paucity of decent perl-centric boards out there, so I thought I'd try this one since so many smart people come here :-)

I myself am new to perl but have plenty of experience in C++/Objective-C and other languages. I am finding perl very easy to learn and experiment with but am having the darndest time with this one module, Text::ParseWords. It's supposed to, well, parse a line of text into words. I thought it was working until I started checking its math. Here, run this perl script (I apologize for the lack of spaces or tabs, this old browser doesn't support CODE tags):

----------------------
use Text::ParseWords;

open(INFILE, "/someplace/test.txt") or die "Can't read!";
#Contents of test.txt are following four lines:
# here are four words
#here are three
#This word's invisible
#And the end.

$word = 0;
$wc = 0;

while (<INFILE>)
{
#@words = &shellwords($_);
@words = "ewords('\s+', 1, $_);
$i = 0;

foreach (@words)
{
if ($i == 1) # for each line's second word...
{
$word = &_;
}
$wc = $wc + 1;
$i++;
}
print "2nd word is: '", $word, "' and this line makes ", $wc, " words so far\n";
}

print("Total of ", $wc, " words\n";

close INFILE;
---------------------

There, type or paste that in and run it. See the results? Notice that the total word count is right but the first line's word count is totally off and the third line isn't counted at all. The code that prints the 2nd word of each line is proof that the 3rd line is invisible. This reveals at least two bugs:

- leading spaces on lines seem to get counted even though no logical word parser would work that way (it still doesn't make sense that line 1 has "six" words, you would expect "five")

- apostrophes, aka "single-quotes" as far as computers are concerned, wreak havoc; that's the only way I can put it

Unless I can get the word parsing module to handle leading spaces and apostrophes I won't be able to build this utility I'm working on. I know there must be a way!

Btw, replace that call to quotewords() with the call to shellwords() that I commented out above it to see another possible way to handle it, which also fails miserably.

Any help you guys can give would be much appreciated. And please don't reply with code in CODE tags, I might not be able to read it.
 
Computer science news on Phys.org
This is real hackish, but it seems to work. All I did was add two regex substitution commands to replace " with \" and ' with \'. :smile:

There were a couple other typos in your code in the original post, too. I presume you have a correct version already. :cool:

-----------------
use Text::ParseWords;

open(INFILE, "test.txt") or die "Can't read!";
#Contents of test.txt are following four lines:
## here are four words
##here are three
##This word's invisible
##And the end.

$word = 0;
$wc = 0;

while (<INFILE>)
{
s/\"/\\"/g; s/\'/\\'/g; #escape double and single quote characters
@words = &shellwords($_);
#@words = quotewords('\s+', 1, $_);
$i = 0;

foreach (@words)
{
if ($i == 1) # for each line's second word...
{
$word = $_;
}
$wc = $wc + 1;
$i++;
}
print "2nd word is: '", $word, "' and this line makes ", $wc, " words so far\n";
}

print("Total of ", $wc, " words\n");

close INFILE;
 
Last edited:
This all seems incredibly convoluted.

Why not just use this:

#!/usr/bin/perl

while (<>)
{
s/^\s+//; # remove leading
s/\s+$//; # and trailing whitespace

@words = split(/\s/, $_);
print scalar(@words) . "\n";
}

- Warren
 
Last edited:
Domo arigato!

By golly, why *don't* I just do that? :)

I had thought about removing single-quotes and leading whitespaces from the text, but didn't know how. Plus I figured I just wasn't using the ParseWords functions properly.

Regarding my typos, I notice at least one that seems to be a glitch in the post. Something ate a few characters. I'm glad it didn't throw you guys.

Thanks to both chroot and abhishek for your responses. One of you gave me a way to work around quote characters, and the other gave me a way to remove spaces, which were the two obstacles I was facing. Of course I hadn't told either of you what I was working on and whether it was okay to alter the source text by removing characters, so you each came up with your own solution.

Incidentally, it's no secret project. I'm just writing a command to give the total size of -- and number of items in -- a directory, an obviously useful function which my bash shell strangely does not seem to offer. I figured writing it would be instructive in learning the CLI and Perl. The catch was that I was saving the contents of a dir (find [...] > output.txt) and analyzing it with a perl script to get the total file size and count, but as you saw, ParseWords wasn't handling a couple things properly.

I can post it when it's finished if anyone has a use for it, but it's written specifically for Mac OS X, i.e., it handles .apps properly. The whole catch is that on a Mac an application is actually a folder full of (sometimes) thousands of resource files, so using Unix's find command returns all those files within the apps, which shouldn't count as files. So the whole project became surprisingly complicated, what with the finding and the grepping and the perling (word?).

Anyway, maybe I'll post it at some point; not like you guys couldn't write such a command yourselves, of course, and do it better than me, but maybe someone else will find it useful.
 
Gnophos said:
I'm just writing a command to give the total size of -- and number of items in -- a directory, an obviously useful function which my bash shell strangely does not seem to offer.
The bash shell is not in the business of listing files. Use the programs ls and du to perform those functions. Use `man ls` and `man du` to get more information about these programs.

To count the number of files (not inluding directories): `ls -l | grep -v -c "^d"`

To count the number of files (including directories): `ls -l | wc -l`

To get the total size of all the files in the current directory (not including subdirectories): `du -Ssh .`

To get the total size of all the files in the current directory (including subdirectories): `du -sh .`
I figured writing it would be instructive in learning the CLI and Perl.
This is an entertaining exercise, for sure, even though it is sort of reinventing the wheel.

Here's a simple script I whipped up that will count files; you should be able to edit it pretty easily to handle the .apps folders as you'd prefer. (I didn't quite understand the behavior you're looking for, so I didn't attempt to code it.)

#!/usr/bin/perl

($count, $size) = tally(shift || ".");

if ($size > (1024 * 1024 * 1024))
{
$size = sprintf "%.2f GB", $size / (1024 * 1024 * 1024);
}
elsif ($size > (1024 * 1024))
{
$size = sprintf "%.2f MB", $size / (1024 * 1024);
}
elsif ($size > 1024)
{
$size = sprintf "%.2f kB", $size / 1024;
}
else
{
$size = $size . " b";
}

print "$count files, total size $size\n";

sub tally
{
my $thing = shift;
my ($count, $size, $subcount, $subsize, $entry);

if (-f $thing)
{
return (1, -s $thing);
}
elsif (-d $thing)
{
# Uncomment to count directories, too
# $count++;

opendir(DIR, $thing) || die "Can't open directory: $thing";
my @contents = grep { !/^\.$/ && !/^\.\.$/ } readdir DIR; # Read all files, even hidden ones
# my @contents = grep { !/^\./ } readdir DIR; # Read only files that are not hidden
closedir DIR;

foreach $entry (@contents)
{
($subcount, $subsize) = tally("$thing/$entry");
$count += $subcount;
$size += $subsize;
}
}

return ($count, $size);
}

- Warren
 
Last edited:

Similar threads

  • · Replies 5 ·
Replies
5
Views
4K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 16 ·
Replies
16
Views
4K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 7 ·
Replies
7
Views
3K
Replies
3
Views
6K
  • Sticky
  • · Replies 1 ·
Replies
1
Views
17K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 9 ·
Replies
9
Views
7K