15-110 Spring 2011 Homework 4
Due Sunday, 20-Feb, at 10pm

Same basic directions as hw3, so read those carefully (replacing "hw3" with "hw4", where appropriate, of course).

• Start by creating a folder named hw4.  Place hw4-solo.py and hw4-collab.py in that file.  You will submit hw4.zip as usual, which will be a zipped version of the entire hw4 folder.
• This week's hw has a special emphasis on applications in Humanities and Business.  Future hw's will focus on other disciplines.   Our aim is to provide at least two weeks of special emphasis for each college at CMU.
• Reminder:  portions marked SOLO must be solved entirely on your own (you may discuss with course staff, but not with each other or any outside sources).  See syllabus for details.
• There are no restrictions on the Python contructs you may now use (conditionals, loops, lists, etc).

1. SOLO Problem-Solving [35 points]
1. SOLO median [10 pts]
Background:  Many computing tasks in the humanities require basic statistical analyses.  Here we will compute one of the most basic statistical measures:  the median.

Your task:  write the function median (in the file hw4-solo.py).  This function takes a non-empty list of integers and returns the median of the list, also as an integer, without modifying the list (though you may modify a copy of the list).  The median of a sorted list of odd length is the middle element.  So the median of the list [3, 6, 20] is 6.  On the other hand, the median of a sorted list of even length is the average of the middle two elements.  So the median of the list [3, 6, 20, 21] is the average of 6 and 20, or 13.   Note that the median of the list [2,3] is not 2.5, but instead is 2, because we are using integer division.  Here are some test cases for you:

assert(median([2]) == 2)
assert(median([2,3]) == 2)
assert(median([2,3,42]) == 3)
assert(median([2,3,42,99]) == 22)

2. SOLO Automated Readability Index [25 pts]
Background:  there are various algorithms to determine the approximate grade-level of reading material.  The "Automated Readability Index", or ARI, is one such algorithm specifically designed to be relatively easy to compute (see the Wikipedia page for details).  It takes a string of text and returns, as a float, the corresponding grade level of the text, where 13 is college-freshman-level.   Specifically, the ARI is computed as such:

Note #1:  characters only include letters or digits (ignoring spaces, punctuation, and other symbols).
Note #2:  when computing words, ignore all non-alphanumerics (that is, non-letters and non-digits) except spaces, question marks (?), exclamation points (!), and periods (.).  Then, a "word" is a block of adjacent letters or digits.
Note #3:  you can assume that sentences are always terminated by a question mark, exclamation point, or period.  However, you should treat a run of several duplicate sentence terminators as just one.  So, "This is one sentence!!!!!" is just 1 sentence.

Your task:  write the function automatedReadabilityIndex (in the file hw4-solo.py).  This function takes a string of text and returns the Automated Readability Index as described above.

To encourage you to use top-down design in this task, you must also write the following 3 helper functions:

charCount  This function takes a string and returns the number of letters or digits in the string.

wordCount  This function takes a string and returns the number of words in the string, as defined above.

sentenceCount  This function takes a string and returns the number of sentences in the string, as defined above.

And to help you write these helper functions, you must also write the following helper function:

isLetterOrDigit:  This function takes a single character and returns True if it is an uppercase or lowercase letter or a digit, and False otherwise.

Hopefully, you will see that by writing these helper functions, the problem becomes easier.  In any case, you will find test cases for these functions in the starter code we provided, though you are welcome to add more test cases, as always.
1. Collaborative Problem-Solving (with Conditionals and Loops)  [45 pts]
1. Top Unigrams  [20 pts]
Background:  first, skip ahead and do the reading on the Google ngram viewer (see part 2 below).  That tool counts how often various words or word phrases occur in (a huge amount of) text.  Here, we will do something similar, though we will restrict ourselves to unigrams (that is, just one word, and not word phrases).   Also, in part to help keep this homework manageable, but also to help you develop your skills in reasoning over code that you did not write, we have solved the main part of this exercise (the topUnigrams function, as described below).  You just need to write the helper functions here.

The main function for this exercise is topUnigrams, and it takes an unsorted list of words (which it may not modify) and a non-negative integer n, and returns a list of the top n unigrams.  Each unigram value in the result should be a list of two elements -- the word and the count (the number of times that word occurs in the text).  The list of unigrams should be sorted in the order of the counts, but ties should be sorted by the word (using Python's default string comparison, so "Z" < "a").  To help you understand the problem, here are some test cases:

wordList = ["dog", "cow", "cat", "cow", "cat", "cat", "dog", "dog", "dog", "cow"]
assert(topUnigrams(wordList, 0) == [])
assert(topUnigrams(wordList, 1) == [["dog", 4]])
assert(topUnigrams(wordList, 2) == [["dog", 4], ["cat", 3]])
assert(topUnigrams(wordList, 3) == [["dog", 4], ["cat", 3], ["cow", 3]])
assert(topUnigrams(["ack"], 2) == [["ack", 1]])

Once again, we have already written topUnigrams for you!  So your first step is to understand our code.  Note that there are somewhat more efficient ways to solve this, but our approach is reasonably efficient while only using techniques we have already covered in a straightforward manner.  Once you understand this function, you need to write its helper functions, as follows:

unigramOccursBefore [5 pts]:  This function takes two unigrams (that is, lists of length two where the first value is a word and the second value is a count of how many times that word occurs in some text) and returns True if the first unigram should occur before the second unigram in the resulting list from topUnigrams.  This is true if the first unigram's count is higher, or if the two have the same count and the first unigram's word is less than the second unigram's word according to Python's default string comparison (so "Z" < "a").  Here are some test cases for you (study them carefully to be sure you understand the problem!):

assert(unigramOccursBefore(["dog",5],["cat",8]) == False)
assert(unigramOccursBefore(["cat",8],["dog",5]) == True)
assert(unigramOccursBefore(["cat",8],["cat",8]) == False)
assert(unigramOccursBefore(["cat",8],["cat",4]) == True)
assert(unigramOccursBefore(["cat",4],["cat",8]) == False)

insertUnigram  [15 pts]:  This function takes three parameters -- a sorted list of unigrams, a new unigram, and a number n (the maximum number of unigrams) -- and destructively inserts the new unigram in the list as appropriate (and returns nothing, seeing as this is a destructive function). The function proceeds in two phases.  In the first phase, it adds the new unigram to the list.  There are two cases to consider here. If the list has fewer than n unigrams, just add the new unigram to the end of the list.  However, if the list already has n unigrams, then replace the last unigram in the list with the new unigram.  Now the new unigram is in the list, but not necessarily in the correct location (so the list is unsorted).  Thus, in the second phase, the function sorts the unigram list.  This cannot be done by calling list.sort()!  Instead, do the following:  since we know the entire list except the last element is already sorted, just keep swapping the new unigram with the unigram to its left until you have swapped it into sorted order.  This can be done in one pass with a straightforward loop.
Here are some test cases for you (and again, study them carefully to be sure you understand the problem!):

unigrams = []
insertUnigram(unigrams, ["cow", 5], 3)
assert(unigrams == [["cow", 5]])
insertUnigram(unigrams, ["dog", 3], 3)
assert(unigrams == [["cow", 5], ["dog",3]])
insertUnigram(unigrams, ["cat", 3], 3)
assert(unigrams == [["cow", 5], ["cat", 3], ["dog",3]])
insertUnigram(unigrams, ["yak", 9], 3)
assert(unigrams == [["yak", 9], ["cow", 5], ["cat", 3]])
2. Basic Economic Data Analysis  [25 pts]
Background:  many problems in economics involve analyzing tables of data.  Later in the course, we may do some more complex analyses or data visualizations, but here we will limit ourselves to some basic data manipulation and analysis.  As for the data, we will consider this table of wages by industry and by year.  Here is an excerpt of the first several rows of the table:

 Line 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 1 Wage and salary accruals 4,180,916 4,465,176 4,827,698 4,952,202 4,997,306 5,154,598 5,410,691 5,705,982 6,070,143 6,415,473 6,554,044 6,279,120 2 Domestic industries 4,185,479 4,470,389 4,832,388 4,957,417 5,002,880 5,160,302 5,416,839 5,712,389 6,076,756 6,422,567 6,561,363 6,286,930 3 Private industries 3,484,241 3,736,700 4,052,677 4,135,533 4,129,767 4,246,977 4,464,031 4,720,900 5,041,605 5,333,540 5,417,383 5,113,359 4 Agriculture, forestry, fishing,          and hunting 24,232 24,820 25,910 26,622 26,796 26,754 28,639 29,659 32,609 34,802 35,874 36,148 5 Farms 1 15,883 16,169 16,974 17,926 17,919 17,586 19,109 19,597 20,005 22,155 22,930 23,840 6 Forestry, fishing, and             related activities 8,349 8,651 8,937 8,697 8,876 9,168 9,530 10,062 12,604 12,647 12,944 12,308 7 Mining 29,233 28,583 29,602 32,010 30,718 31,260 34,856 40,228 48,049 53,800 62,315 55,006 8 Oil and gas extraction 10,151 10,115 10,850 11,502 11,306 11,505 13,078 14,548 17,407 19,046 23,128 21,701 9 Mining, except oil and gas 11,091 10,961 10,583 10,779 10,397 10,248 10,964 11,974 13,048 13,461 14,272 13,319 10 Support activities for mining 7,992 7,507 8,169 9,729 9,015 9,507 10,814 13,707 17,594 21,293 24,915 19,986

As you can see, the rows represent different industries, and the columns represent years, and the numeric values in the table are the salaries (in millions of dollars) for each industry in each year.  You may notice that different industries are indented differently.   This is because the industry list is hierarchical, with each summary row representing an industry group that includes the sum of all the values of the rows indented below it.  We will say that a row contains a "single industry" if it is not a summary row (that is, if the following row is not indented by more than the row in question).

The first step in using a data source such as this is to "condition" the data -- to convert it to an easily usable form.  In this case, we have already done this for you.  The file hw4-collab.py contains the function getWageData which wil return the table above, but cleaned up for use in Python (and with the last 3 rows commented out, as they deal with non-domestic data, and for this problem we are limiting our analysis to domestic data).

Then, write the following two functions:

makeSubtable  [15 pts]:  To support interesting queries over subsets of the data, write the function makeSubtable.  This function takes four values -- a table formatted as the one above, the name of an industry group, a start year, and an end year -- and returns a new table, also formatted as the one above, but containing only the single industries in the given industry group, and containing only the years in the given range.  you may assume the industry group is in the table, as are both the start and end years.
Here are some test cases for you (and again, study them carefully to be sure you understand the problem!):

table = getWageData()
subtable = [ ["line","",  1999,  2000,  2001,  2002],
[8, "Oil and gas extraction", 10115, 10850, 11502, 11306],
[9, "Mining except oil and gas", 10961, 10583, 10779, 10397],
[10, "Support activities for mining",  7507,  8169,  9729,  9015] ]
assert(makeSubtable(table, "Mining", 1999, 2002) == subtable)
subtable = [["line", "", 1998, 1999],
[89, "Civilian", 87876, 90276],
[90, "Military", 54048, 55524],
[91, "Government enterprises", 36948, 37716],
[94, "Education", 257912, 272332],
[95, "Other", 229376, 241003],
[96, "Government enterprises", 35078, 36838]]
assert(makeSubtable(table, "Government", 1998, 1999) == subtable)

Hint:  you should use top-down design with some well-chosen helper functions.  For example, in our sample solution, we found it helpful to write a function that took a string and returned the number of leading spaces in it.  We also wrote a function that took a table and the name of an industry group and returned the index of the row of that industry group in that table.  And we found it handy to write a function that indicated whether or not a given row contained a single industry or an industry group.  We also wrote a few other helper functions, but you get the idea.  In any case, do not place all your code in one monolithic function!

Another hint:  if s is a string, then s.strip() will return a new string with the leading and trailing spaces removed.  Handy!

highestPaidSingleIndustry  [10 pts]:  Write the function highestPaidSingleIndustry, which takes a table and finds the largest salary in any year for any single industry in the table, and returns a list of length 2 containing that single industry's name and the year.
Here are some test cases for you (and once more, study them carefully to be sure you understand the problem!):

wageData = getWageData()
assert(highestPaidSingleIndustry(wageData) == ["Education", 2009])
subtable = [ ["line","",  1999,  2000,  2001,  2002],
[8, "Oil and gas extraction", 10115, 10850, 11502, 11306],
[9, "Mining except oil and gas", 10961, 10583, 10779, 10397],
[10, "Support activities for mining",  7507,  8169,  9729,  9015] ]
assert(highestPaidSingleIndustry(subtable) == ["Oil and gas extraction", 2001])
2. Readings in Computational Thinking  [20 pts]
1. Google ngram Viewer  [10 pts]
Read this Wall Street Journal article on Google's ngram viewer.   Then, play around for a while with the actual Google ngram viewer.  Then, write a short paragraph explaining what the Google ngram viewer is and how it functions, described so a lay person could understand and appreciate it.  Then, using the tool, find some trends that you find interesting, and include a short paragraph explaining the trends you found and why you find them to be interesting.   We would expect that everyone would submit unique trends that they discovered on their own.
2. Computing, Digital Media and the Egyptian Revolution [10 pts]
Write a short paragraph explaining the role of computing, and especially digital media, in the recent (and even ongoing) Egyptian Revolution (or, to be precise, the events leading to the resignation of President Mubarak).  Find, read, and cite at least two relevant and generally respected sources to support this paragraph.  Next, write another short paragraph speculating on how computing trends might influence politics and governance across the world in the next 20 years.  We are looking for sound critical thinking and clear explications here.
3. Bonus/Optional:  justifyText   [3 pts]
Write the function justifyText (in the file hw4-collab.py).  This function takes two parameters -- a string and a positive integer lineWidth -- and returns a new string, with embedded newlines, such that each line except perhaps the last line is precisely lineWidth characters wide.  Do not hyphenate words or split words on hyphens.  Instead, include as many entire words as can fit on each line, and then include extra spaces between words (spread out evenly across the line) so the line exactly fits the lineWidth.  Do not add extra spaces to the last line, however.  Do not worry about strange cases, such as words that are longer than the lineWidth.  Make your own test function, too.
4. Bonus/Optional:  Psychology Experiment:  Recalling concrete, abstract, and nonsense words   [3 pts]
Here is a page of some interesting psychology experiments involving memory:
http://faculty.washington.edu/chudler/chmemory.html
If you scroll down far enough, you'll find a section titled "Concrete Words, Abstract Words and Just Plain Nonsense".  Read that section, then design a simple experiment using a Python program to attempt to confirm the hypothesis that it is easier to recall concrete words than abstract words, and easier to recall abstract words than nonsense words.  Then run the experiment.  Then, in your hw4.doc file, include a paragraph or two explaining your approach, including the raw data of your experiment, and then explaining whether the experiment confirmed or rejected the hypothesis.