Space Time Stories

Space and Time Travel Stories. A Science Fiction Blog By Sean O’Brien

  • FriendFeed
  • Word Usage in SciFi Stories

    September 6th, 2008 · 2 Comments

    As a professional programmer I sort text files and analyze their content. Everytime I look at my own writing I have the habit of analyzing it the same way. Yesterday I realized the value to running a purely statistical analysis of my word usage.

    I have been listening to Richard Morgan’s trilogy Altered Carbon, Broken Angels, and Woken Furies. I noticed that Morgan really likes the word “shrugged”. Kovacs and others use this word 30+ times per book. It’s annoying, probably an indicator that he rushed the books past an overworked editor. It also might be an indication that words can be overused in audiobooks even though the work is great on paper.

    I assumed that I had a few overused words and I wondered how to find them. Word doesn’t offer a word count histogram, so I wrote one in perl. If you are lucky enough to use a Mac you are only about 3 minutes away from running your own word count on any document you like. If you are on a PC you probably will have to download ActiveState Perl and get it running. This might take awhile. I’ll look into building an exe file if there is sufficient interest.

    On a Mac you have to do 4 things:

    1) open the Terminal and go to your target directory

    2) save this perl code in your target directory as an executable text file named wordcount.pl

    3) save your document as a text file in the same directory, let’s say you name it doc.txt. This program will not read Word files or other formats, just text.

    4) type ./wordcount.pl doc.txt

    The lightning fast result is a histogram analysis of your word usage. A 53,000 word document on a MacBook Pro runs in about 2 seconds. If you want to save the output for future review type ./wordcount.pl doc.txt > results.txt then open results.ext in your word processor.

    Each line of output starts with the number of times the word is used, followed by the word, like this:
    1176 of
    1255 and
    1268 a
    1441 to
    2502 the

    At the end is a summary: Total of 53955 words, 10352 distinct words used. My word count agreed exactly with Microsoft Word!! I would have bet money they would not be exactly the same. But at least it gives me confidence in my code.

    Your most heavily used words will be of course: the, to, a, and….etc. You will have to dig through the list to find the first word which is not common.

    My first word to study is Phillip, the name of my protagonist which I have used 122 times. This might be an indication I’m overusing his name, although it will take a careful reading of the book to decided when to drop the name.

    Suppose you only want to see the dreaded adverb. It’s trivial to look for words containing in “ly”. You can modify my script by deleting the first # sign around line 21 => unless ($word =~ /ly/) {next;} # remove the first # sign if you want to look for adverbs

    I learned that my novel contains 261 distinct words ending in ly, and I use them 800 times. I used “only” 98 times, “nearly” 53 times, and “really” 30 times. Awful, just awful. Instead of slowly reading each paragraph I can now target specific words in much faster editing sessions.

    Editors look for overusage of adverbs and specific words. Statistical analysis of your work is a tool you can use to get past these roadblocks. I look forward to your feedback.

    Tags: Reviews · commentary

    2 responses so far ↓

    • 1 Phil Weaver // Apr 20, 2009 at 11:51 am

      Thanks for the Perl script, Sean. I had a similar idea for my own book, and was glad to use it. I had a wider range of punctuation, though, and decided I wanted to convert everything to lower case before collecting the histogram. Here’s my modified version if you want it.

      Regards,
      -Phil

      my %word_list; # this hash will have an entry for every distinct word
      my $count = 0; # count of distinct words
      my $total_words =0; # total # of words used
      my $file = shift; # name of text file to parse

      open FILE, “<$file”;

      while (){ #for each line in the file

      my @words = split; # split each line into separate words

      foreach my $word (@words) { # for each word on that line

      $word =~ s/[.]//g; # strip any periods
      $word =~ s/”//g; # strip any quotes
      $word =~ s/β€œ//g; # strip any quotes
      $word =~ s/”//g; # strip any quotes
      $word =~ s/,//g; # strip any commas
      $word =~ s/’//g; # strip any apostrophes
      $word =~ s/://g; # strip any colons
      $word =~ s/’//g; # strip any single-quotes
      $word =~ s/;//g; # strip any semicolons
      $word =~ s/!//g; # strip any exclamation points
      $word =~ s/\?//g; # strip any question marks

      $word = lc($word);
      # unless ($word =~ /ly/) {next;} # remove the first # sign if you want to look for adverbs

      if ($word_list{$word}) { # if the word has already been seen increment the count
      $word_list{$word}++;
      }
      else {
      $word_list{$word} = 1; # else it’s a new word, start with 1
      }
      $total_words++; # increment counter for total # of words
      }
      }
      print “\nWord usage in $file\n\n\n”;

      foreach $key (sort sort_values (keys(%word_list))) { # get the keys, sort them by value, for each one
      print “$word_list{$key} \t $key\n”;
      $count ++; # increment counter for distinct # of words
      }

      print “\n\nTotal of $total_words words, $count distinct words used\n”;

      sub sort_values { # sort a hash by value
      $word_list{$a} $word_list{$b};
      }

    • 2 Sean // Apr 20, 2009 at 10:40 pm

      Phil, outstanding contribution. Thanks very much.

    Leave a Comment