15-110 Fall 2010 Homework 7
Due Sunday, October 24, at 10pm (no late submissions accepted!)
Note: to do this homework, you should download Homework7.zip,
then unzip it, edit the files, rezip, and submit that zipped file to Autolab.
Note: in addition, you should download hw7-novels.zip,
and unzip it so that the folders Homework7 and hw7-novels are both in
the same folder. You will need this folder to run hw7.1
(matchNovels). However: Do not submit hw7-novels! Only submit your code (the Python files in the Homework7 folder).
Read these instructions first!
· Do not change any file names! Each problem that includes a changed file name will be docked a 50% penalty.
· Include your name, AndrewID, and section clearly
on the top of each file in your assignment.
· Place all your solutions in a single file ZIP
file (extension.zip). Once again, start with Homework7.zip. No
other file formats will be accepted. You need to start by downloading a ZIP
file that contains a template (a file ending in .py) for each problem in this
homework.
· Show your work. Correct answers without
supporting calculations will not receive full credit.
· How to submit:
- Start at the course
web site (http://www.cs.cmu.edu/~110)
- Select the Autolab
link
- Select the Hw7 link
- Select the Submit link
- Click on the
"Choose File" button
- Find your Hw7 file
that you are submitting
- Click on the
"Upload" button
- If it worked, you will
see this text: Submission Accepted.
Your submission is not completed until you
see that text!!!
Once again: Late submissions will be rejected.
Log
We want to gauge how much time you are spending in each
homework and how you are using your resources. This week we are going to record
your answers using an online survey. An
email with the details will be sent to you.
Problem 1: matchNovels (30 points)
This problem involves matching novels to other novels written by the
same author. We will hypothesize (guess) that despite some
obvious variability, the same author will tend to use many of the same
words in all of his or her novels. If this pattern is
sufficiently strong, then we can use word frequencies as something of a
fingerprint that can help identify the author of a novel.
In this problem, we will test this hypothesis.
First,
we will download some novels (in plaintext from
Project Gutenberg)
and place them in a directory. Actually, we've already done this
for you with some sample data, so you can just unzip hw7-novels.zip to
get started. In any case, whether you are using hw7-novels.zip or
novels of your choosing, the directory should hold two novels from each
author that we use. Next, we will compare every novel in the
directory to every other novel. For each pair of novels, we will
compute a match based on the word frequencies of those two novels.
Our hopes are that novels by the same author have a high match,
and those by different authors have a low match.
We have
provided you with much of the "test harness" or "plumbing" to make this
work: code that compares each pair of novels, extracts the words
from each novel, and prints the results of all the tests. You are
not responsible for any of that code (though you may wish to peek at
it, if you're curious). Your task is to provide the missing
pieces (at the top of the file) that do the "heavy lifting" to actually
compare the text from two novels.
Specifically, you should write the following functions (and it is recommended to do so in the order given):
1.1) getWordCounts(listOfWords) This
function takes a list of (all lowercase) words and non-destructively
returns a dictionary mapping each word in the list to a count of the
number of times that word occurs in the list. As with all the
functions described here, refer to the assertions in the provided test
functions for more details.
1.2) getWordFreqs(wordCounts)This
function takes a dictionary of wordCounts (the return value from
getWordCounts, in fact), and non-destructively returns a new dictionary
that maps words to their
normalized frequencies
rather than their counts. To obtain frequencies, one
just divides each count by the sum of all the counts, so that the
sum of all the frequencies would equal 1. However, we want to
normalize the frequencies so that the maximum frequency is 1. To do so, rather than divide by the total count, we divide by
the largest count. For example, consider this dictionary of counts:
dict([("dog", 3), ("cat", 2), ("gnu", 5)])We
see that "gnu" has the largest count (5). To compute the
normalized frequencies, we divide each count by 5. So "dog" will
be mapped to 0.6, "cat" to 0.4, and "gnu" to 1.0. Note that this
is a somewhat simplistic normalization, but good enough for our
purposes.
1.3) getAllWords(wordFreqs1, wordFreqs2)This
function takes two dictionaries of word frequencies (both returned from
getWordFreqs), and non-destructively returns a new list containing all
the words that occur as keys in either or both of wordFreqs1 or
wordFreqs2. If a word occurs in both dictionaries, however, it
should still be included only once in the resulting list. The
resulting list may be in any order.
1.4) getWordFreqsMatch(wordFreqs1, wordFreqs2)This
function takes two dictionaries of word frequencies (both returned from
getWordFreqs) and non-destructively computes and returns the match,
which is a score from 0.0 (worst) to 1.0 (best) indicating (we hope)
the likelihood that the two novels with these respective word
frequencies are written by the same author. For this problem, you
must use this (rather simplistic) matching algorithm: for every
word in either novel, if
the normalized frequency of the word (as provided in the parameters) in
both
novels is less than 0.01, we will ignore the word. But if the
frequency of the word is at least 0.01 in at least one novel,
then increase our score by this amount:
1 - | freq1 - freq2 |
0.25where
the bars signify absolute-value. After doing this for every word
in either novel, divide the final score by the total number of words
that were not ignored (thus guaranteeing that our result is between 0.0
and 1.0).
Note #1: you will probably want to use the getAllWords function you wrote above.
Note #2: to do this problem, you should download
hw7-novels.zip, and unzip it so that the folders Homework7 and hw7-novels are both in the same folder. However:
Do not submit hw7-novels! Only submit your code (the Python files in the Homework7 folder).
Note #3: You can download many thousands of other novels from
Project Gutenberg,
and then in may different formats. For our purposes, you should
restrict yourself to UTF-8 plaintext files. Then, rename the file
in the format author-novel-pg#.txt (our code assumes the author's name
precedes the first hyphen in the filename).
Note #4: The
function we've provided is not horrible -- it works correctly for the 6
novels in the hw7-novels directory, for example. But it is not
very good, either. For bonus, you'll have a chance to improve it.
But here, just get the function working as described.
Problem 2: sokoban (35 points)
This problem involves the game Sokoban. First, read the top part (through Gamplay and Rules) of
the Wikipedia page on Sokoban
so we can agree on the rules. Our version looks like this:
In
this game, the green dot represents the player, who can move around
with the arrow keys. The brown dots are balls which the player is
trying to push to the goals, which are blue. A gold dot
represents a ball that has been pushed onto a goal. The game has
one complication: a ball can only be pushed onto a blank cell
(with a gray background). Thus, for example, the player cannot
push two balls at once.
You have been provided with a
mostly-working version of the game of Sokoban. Try it! Run
the program, play the game. Then, get a sense of what the code is
trying to do. Then study the code to see
how it is doing it. And then...
With
this description in mind, solve the problems listed in the header at
the top of the file. For part 1, this includes free-responses
that should be included directly in the header (and not in some other
file).
Note #1: as written, there is no undo in the game,
which can be frustrating (you'll fix this as part of the assignment).
For now, if you get stuck, just hit "r" to reset the game.
Note #2: there are many fine Sokoban games available online, too. Google them, and have fun!
Problem 3: frogger (35 points)
This problem involves a simplified version of the game Frogger (we'll
only implement the bottom half, and then with some further
simplifications). First, read the top part (through Gamplay and
Rules) of
the Wikipedia page on Frogger
so we can agree on the rules. Our version looks like this:
In
this game, the green dot represents the frog, who can move around
with the arrow keys. The red and yellow dots represent the
predators who like to eat frogs. Red dots move right-to-left on
even rows, and yellow dots move left-to-right on odd rows. Both
appear at the start of their rows at random intervals. The frog
starts in the middle of the last row, and tries to dodge the predators
while crossing the field to the water (the blue strip in row 0 at the
top), where all the yummy flies live.
You
have been provided with a mostly-working version of the game of Frogger
(well, the limited version as just described). Try it! Run
the program, play the game. Then, get a sense
of what the code is trying to do. Then study the code to see
how it is doing it. And then...
With
this description in mind, solve the problems listed in the header at
the top of the file. For part 1, this includes free-responses that
should be included directly in the header (and not in some other file).
Note: as with Sokoban, there are many fine Frogger games available online, too. Google them, and have fun!
Bonus/Optional:
Problem 4: Bonus/Optional: getImprovedWordFreqsMatch (4 points)
As
noted in problem 1 above: Note #4: The function we've
provided is not horrible -- it works
correctly for the 6 novels in the hw7-novels directory, for example.
But it is not very good, either. For bonus, you'll have a
chance to
improve it. So now is your chance. In the same file
1.matchNovels.py, write the function getImprovedWordFreqsMatch which
uses your improved algorithm.
How
can you judge how good your algorithm is? We'll use a somewhat
simplistic metric to compare our matching functions. Check out
how the function reportMatches computes the "Quality of metric".
That explains it. Just make that number as high as possible!
To
receive credit, you must not "hardcode" your answer. Instead,
you must use fairly general techniques that should work across a broad
range of novels and authors.
Note: Special extra bonus will be given to the submission that performs the best on a suite of novels that we will select somewhat at random.