15-110 Fall 2010 Homework 7

Due Sunday, October 24, at 10pm (no late submissions accepted!)

Note: to do this homework, you should download Homework7.zip, then unzip it, edit the files, rezip, and submit that zipped file to Autolab.

Note: in addition, you should download hw7-novels.zip, and unzip it so that the folders Homework7 and hw7-novels are both in the same folder.  You will need this folder to run hw7.1 (matchNovels).  However:  Do not submit hw7-novels!  Only submit your code (the Python files in the Homework7 folder).

Read these instructions first!

·         Do not change any file names!  Each problem that includes a changed file name will be docked a 50% penalty.

·         Include your name, AndrewID, and section clearly on the top of each file in your assignment.

·         Place all your solutions in a single file ZIP file (extension.zip). Once again, start with Homework7.zip. No other file formats will be accepted. You need to start by downloading a ZIP file that contains a template (a file ending in .py) for each problem in this homework.

·         Show your work. Correct answers without supporting calculations will not receive full credit.

·         How to submit:

Your submission is not completed until you see that text!!!

Once again: Late submissions will be rejected.

Log

We want to gauge how much time you are spending in each homework and how you are using your resources. This week we are going to record your answers using an online survey.  An email with the details will be sent to you.


Problem 1:  matchNovels (30 points)

This problem involves matching novels to other novels written by the same author.  We will hypothesize (guess) that despite some obvious variability, the same author will tend to use many of the same words in all of his or her novels.  If this pattern is sufficiently strong, then we can use word frequencies as something of a fingerprint that can help  identify the author of a novel.  In this problem, we will test this hypothesis.

First, we will download some novels (in plaintext from Project Gutenberg) and place them in a directory.  Actually, we've already done this for you with some sample data, so you can just unzip hw7-novels.zip to get started.  In any case, whether you are using hw7-novels.zip or novels of your choosing, the directory should hold two novels from each author that we use.  Next, we will compare every novel in the directory to every other novel.  For each pair of novels, we will compute a match based on the word frequencies of those two novels.  Our hopes are that novels by the same author have a high match, and those by different authors have a low match.

We have provided you with much of the "test harness" or "plumbing" to make this work:  code that compares each pair of novels, extracts the words from each novel, and prints the results of all the tests.  You are not responsible for any of that code (though you may wish to peek at it, if you're curious).  Your task is to provide the missing pieces (at the top of the file) that do the "heavy lifting" to actually compare the text from two novels.

Specifically, you should write the following functions (and it is recommended to do so in the order given):

1.1) getWordCounts(listOfWords) 

This function takes a list of (all lowercase) words and non-destructively returns a dictionary mapping each word in the list to a count of the number of times that word occurs in the list.  As with all the functions described here, refer to the assertions in the provided test functions for more details.

1.2) getWordFreqs(wordCounts)
This function takes a dictionary of wordCounts (the return value from getWordCounts, in fact), and non-destructively returns a new dictionary that maps words to their normalized frequencies rather than their counts.  To obtain frequencies, one just divides each count by the sum of all the counts, so that the sum of all the frequencies would equal 1.  However, we want to normalize the frequencies so that the maximum frequency is 1.  To do so, rather than divide by the total count, we divide by the largest count.  For example, consider this dictionary of counts:
    dict([("dog", 3), ("cat", 2), ("gnu", 5)])
We see that "gnu" has the largest count (5).  To compute the normalized frequencies, we divide each count by 5.  So "dog" will be mapped to 0.6, "cat" to 0.4, and "gnu" to 1.0.  Note that this is a somewhat simplistic normalization, but good enough for our purposes.

1.3) getAllWords(wordFreqs1, wordFreqs2)

This function takes two dictionaries of word frequencies (both returned from getWordFreqs), and non-destructively returns a new list containing all the words that occur as keys in either or both of wordFreqs1 or wordFreqs2.  If a word occurs in both dictionaries, however, it should still be included only once in the resulting list.  The resulting list may be in any order.

1.4) getWordFreqsMatch(wordFreqs1, wordFreqs2)

This function takes two dictionaries of word frequencies (both returned from getWordFreqs) and non-destructively computes and returns the match, which is a score from 0.0 (worst) to 1.0 (best) indicating (we hope) the likelihood that the two novels with these respective word frequencies are written by the same author.  For this problem, you must use this (rather simplistic) matching algorithm:  for every word in either novel, if the normalized frequency of the word (as provided in the parameters) in both novels is less than 0.01, we will ignore the word.  But if the frequency of the word is at least 0.01 in at least one novel, then increase our score by this amount:
   1 - | freq1 - freq2 |0.25
where the bars signify absolute-value.  After doing this for every word in either novel, divide the final score by the total number of words that were not ignored (thus guaranteeing that our result is between 0.0 and 1.0).

Note #1:  you will probably want to use the getAllWords function you wrote above.

Note #2:  to do this problem, you should download hw7-novels.zip, and unzip it so that the folders Homework7 and hw7-novels are both in the same folder.  However:  Do not submit hw7-novels!  Only submit your code (the Python files in the Homework7 folder).

Note #3:  You can download many thousands of other novels from Project Gutenberg, and then in may different formats.  For our purposes, you should restrict yourself to UTF-8 plaintext files.  Then, rename the file in the format author-novel-pg#.txt (our code assumes the author's name precedes the first hyphen in the filename).

Note #4:  The function we've provided is not horrible -- it works correctly for the 6 novels in the hw7-novels directory, for example.  But it is not very good, either.  For bonus, you'll have a chance to improve it.  But here, just get the function working as described.


Problem 2:  sokoban (35 points)

This problem involves the game Sokoban.  First, read the top part (through Gamplay and Rules) of the Wikipedia page on Sokoban so we can agree on the rules.  Our version looks like this:
sokoban
In this game, the green dot represents the player, who can move around with the arrow keys.  The brown dots are balls which the player is trying to push to the goals, which are blue.  A gold dot represents a ball that has been pushed onto a goal.  The game has one complication:  a ball can only be pushed onto a blank cell (with a gray background).  Thus, for example, the player cannot push two balls at once.

You have been provided with a mostly-working version of the game of Sokoban.  Try it!  Run the program, play the game.  Then, get a sense of what the code is trying to do.  Then study the code to see how it is doing it.  And then...

With this description in mind, solve the problems listed in the header at the top of the file.  For part 1, this includes free-responses that should be included directly in the header (and not in some other file).

Note #1:  as written, there is no undo in the game, which can be frustrating (you'll fix this as part of the assignment).  For now, if you get stuck, just hit "r" to reset the game.

Note #2:  there are many fine Sokoban games available online, too.  Google them, and have fun!



Problem 3:  frogger (35 points)

This problem involves a simplified version of the game Frogger (we'll only implement the bottom half, and then with some further simplifications).  First, read the top part (through Gamplay and Rules) of the Wikipedia page on Frogger so we can agree on the rules.  Our version looks like this:
frogger
In this game, the green dot represents the frog, who can move around with the arrow keys.  The red and yellow dots represent the predators who like to eat frogs.  Red dots move right-to-left on even rows, and yellow dots move left-to-right on odd rows.  Both appear at the start of their rows at random intervals.  The frog starts in the middle of the last row, and tries to dodge the predators while crossing the field to the water (the blue strip in row 0 at the top), where all the yummy flies live.

You have been provided with a mostly-working version of the game of Frogger (well, the limited version as just described).  Try it!  Run the program, play the game.  Then, get a sense of what the code is trying to do.  Then study the code to see how it is doing it.  And then...

With this description in mind, solve the problems listed in the header at the top of the file.  For part 1, this includes free-responses that should be included directly in the header (and not in some other file).

Note:  as with Sokoban, there are many fine Frogger games available online, too.  Google them, and have fun!



Bonus/Optional:  


Problem 4: Bonus/Optional:  getImprovedWordFreqsMatch (4 points)

As noted in problem 1 above:  Note #4:  The function we've provided is not horrible -- it works correctly for the 6 novels in the hw7-novels directory, for example.  But it is not very good, either.  For bonus, you'll have a chance to improve it.  So now is your chance.  In the same file 1.matchNovels.py, write the function getImprovedWordFreqsMatch which uses your improved algorithm.

How can you judge how good your algorithm is?  We'll use a somewhat simplistic metric to compare our matching functions.  Check out how the function reportMatches computes the "Quality of metric".  That explains it.  Just make that number as high as possible!

To receive credit, you must not "hardcode" your answer.  Instead, you must use fairly general techniques that should work across a broad range of novels and authors.

Note:  Special extra bonus will be given to the submission that performs the best on a suite of novels that we will select somewhat at random.