#! /usr/bin/perl
my %word_list; # this hash will have an entry for every distinct word
my $count = 0; # count of distinct words
my $total_words =0; # total # of words used
my $file = shift; # name of text file to parse
open FILE, “<$file”;
while (<FILE>){ #for each line in the file
my @words = split; # split each line into separate words
foreach my $word (@words) { # for each word on that line
$word =~ s/[.]//g; # strip any periods
$word =~ s/”//g; # strip any quotes
$word =~ s/,//g; # strip any commas
$word =~ s/’//g; # strip any apostrophes
# unless ($word =~ /ly/) {next;} # remove the first # sign if you want to look for adverbs
if ($word_list{$word}) { # if the word has already been seen increment the count
$word_list{$word}++;
}
else {
$word_list{$word} = 1; # else it’s a new word, start with 1
}
$total_words++; # increment counter for total # of words
}
print “\nWord usage in $file\n\n\n”;
foreach $key (sort sort_values (keys(%word_list))) { # get the keys, sort them by value, for each one
print “$word_list{$key} \t $key\n”;
$count ++; # increment counter for distinct # of words
}
print “\n\nTotal of $total_words words, $count distinct words used\n”;
}
sub sort_values { # sort a hash by value
$word_list{$a} <=> $word_list{$b};
}
5 responses so far ↓
1 Word Usage in SciFi Stories // Sep 6, 2008 at 5:24 pm
[...] words and I wondered how to find them. Word doesn’t offer a word count histogram, so I wrote one in perl. If you are lucky enough to use a Mac you are only about 3 minutes away from running your [...]
2 ellen // Feb 19, 2010 at 10:46 pm
btw, your lovely web page typeset all the quotes, which have to be changed into regular quote marks. Any programmer would know that (and you’re assuming enough knowledge here that anybody who’d be able to use it would know to fix it, I suppose). It took this non-programmer more like half an hour to get it working, but SO WORTH IT. Really. Still squeeing over here.
3 Sean // Feb 20, 2010 at 5:25 pm
Ellen, I’m glad to have helped, and I appreciate the comments about the quotes. I’ll work on a less confusing method.
4 Greg // May 8, 2010 at 1:51 am
not sure what the error means as I’m new to perl.
“Unrecognized character \x93; marked by <– HERE after pen FILE, <– HERE near column 12 at distinct_word_count.pl line 8."
Oh, I named the file that I created with your perl code "distinct_word_count.pl".
5 Sean // May 8, 2010 at 3:01 pm
Somebody else pointed out to me that the curly open and close quotes rendered in a browser will not properly convert to flat quotes for perl scripts. Try editing the quotes and replacing them with simple quotes. This won’t work in Word but should work in a text editor.
Leave a Comment