15-112 Spring 2016 Homework 5
Due Sunday, 14-Feb, at 10pm

Read these instructions first!
  1. Better Big Oh [20 pts] [manually graded]
    In a triple-quoted string at the top of your file (just below your name), include solutions to this exercise. For each of the following functions:
    1. State in just a few words what it does in general.
    2. State and prove (in a few words) its worst-case big-oh runtime.
    3. Provide an equivalent Python function that is more than a constant-factor faster (so it's worst-case big-oh runtime is in a different function family). The better your solution's big-oh runtime, the more points you get!
    4. State and prove (in a few words) your better solution's worst-case big-oh runtime.
    You should assume that lists all contain at least 5 values, and only integer values. Also, if a function takes two lists, you should assume they are the same length N.
    import copy
    
    def slow1(a):
        (b, c) = (copy.copy(a), 0)
        while (b != [ ]):
            b.pop()
            c += 1
        return c
    
    def slow2(a):
        n = len(a)
        count = 0
        for i in range(n):
            for j in range(n):
                if (a[i] == a[j]):
                    count += 1
        return (count == n)
    
    def slow3(a, b):
        # assume a and b are the same length n
        n = len(a)
        assert(n == len(b))
        result = 0
        for c in b:
            if c not in a:
                result += 1
        return result 
    
    def slow4(a, b):
        # assume a and b are the same length n
        n = len(a)
        assert(n == len(b))
        result = abs(a[0] - b[0])
        for c in a:
            for d in b:
                delta = abs(c - d)
                if (delta > result):
                    result = delta
        return result
    
    def slow5(a, b):
        # Hint: this is a tricky one!  Even though it looks syntactically
        # almost identical to the previous problem, in fact the solution
        # is very different and more complicated.
        # You'll want to sort one of the lists,
        # and then use binary search over that sorted list (for each value in
        # the other list).  In fact, you should use bisect.bisect for this
        # (you can read about this function in the online Python documentation).
        # The bisect function returns the index i at which you would insert the
        # value to keep the list sorted (with a couple edge cases to consider, such
        # as if the value is less than or greater than all the values in the list, 
        # or if the value actually occurs in the list).
        # The rest is left to you...
        #
        # assume a and b are the same length n
        n = len(a)
        assert(n == len(b))
        result = abs(a[0] - b[0])
        for c in a:
            for d in b:
                delta = abs(c - d)
                if (delta < result):
                    result = delta
        return result
    

  2. invertDictionary(d) [20 pts] [autograded]
    Write the function invertDictionary(d) that takes a dictionary d that maps keys to values and returns a dictionary of its inverse, that maps the original values back to their keys. One complication: there can be duplicate values in the original dictionary. That is, there can be keys k1 and k2 such that (d[k1] == v) and (d[k2] == v) for the same value v. For this reason, we will in fact map values back to the set of keys that originally mapped to them. So, for example:
    assert(invertDictionary({1:2, 2:3, 3:4, 5:3}) == 
           {2:set([1]), 3:set([2,5]), 4:set([3])})
    
    Also, you may assume that the values in the original dictionary are all immutable, so that they are legal keys in the resulting inverted dictionary.

  3. friendsOfFriends(d) [20 pts] [autograded]
    Background: we can create a dictionary mapping people to sets of their friends. For example, we might say:
        d["fred"] = set(["wilma", "betty", "barney", "bam-bam"])
        d["wilma"] = set(["fred", "betty", "dino"])
    
    With this in mind, write the function friendsOfFriends(d) that takes such a dictionary mapping people to sets of friends and returns a new dictionary mapping all the same people to sets of their friends of friends. For example, since wilma is a friend of fred, and dino is a friend of wilma, dino is a friend-of-friend of fred. This set should exclude any direct friends, so even though betty is also a friend of wilma, betty does not count as a friend-of-friend of fred (since she is simply a friend of fred). Thus, in this example, if fof = friendsOfFriends(d), then fof["fred"] is a set containing just "dino" and fof["wilma"] is a set containing both "barney" and "bam-bam". Also, do not include anyone either in their own set of friends or their own set of friends-of-friends.

    Note: you may assume that everyone listed in any of the friend sets also is included as a key in the dictionary. Also, do not worry about friends-of-friends-of-friends.

  4. largestSumOfPairs(a) [10 pts] [autograded]
    Write the function largestSumOfPairs(a) that takes a list of integers, and returns the largest sum of any two elements in that list, or None if the list is of size 1 or smaller. So, largestSumOfPairs([8,4,2,8]) returns the largest of (8+4), (8+2), (8+8), (4+2), (4+8), and (2+8), or 16.

    The naive solution is to try every possible pair of numbers in the list. This runs in O(n**2) time and is much too slow.

  5. containsPythagoreanTriple(a) [10 pts] [autograded]
    Write the function containsPythagoreanTriple(a) that takes a list of positive integers and returns True if there are 3 values (a,b,c) anywhere in the list such that (a,b,c) form a Pythagorean Triple (where a**2 + b**2 == c**2). So [1,3,6,2,5,1,4] returns True because of (3,4,5).

    A naive solution would be to check every possible triple (a,b,c) in the list. That runs in O(n**3). You'll have to do better than that.

  6. mergeSortWithOneAuxList(a) [10 pts] [manually graded]
    Write the function mergeSortWithOneAuxList(a) that works just like mergeSort from the notes, only here you can only create a single aux list just one time, rather than once for each call to merge. To do this, you will need to create the aux list in the outer function (mergeSortWithOneAuxList) and then pass it as a parameter into the merge function. The rest is left to you. In a comment at the top of this function, include some timing measurements comparing this function to the original mergeSort, and a brief reflection on whether or not this change was worthwhile.

  7. 3-Way Mergesort [10 pts] [manually graded]
    First, write the function threeWayMergesort(L) that takes a list L and does a modified version of mergesort on L, where instead of combining two sorted sublists at a time, you should combine three sorted sublists at a time (so after one pass, you have sublists of size 3, and after two passes, sublists of size 9, and so on). You may not assume that the size of the list is a power of 3.

    Next, in a triple-quoted string just below your threeWayMergesort function, write a short proof that the function runs in O(nlogn) -- that is, in the same big-oh worst-case runtime as normal two-way mergesort. Then, write the function threeWayMergesortTiming() that empirically demonstrates that threeWayMergesort runs in the same big-oh as normal (two-way) mergesort (no more than a constant time faster or slower, even as n grows large). So all that extra work to write threeWayMergesort was for naught. Sigh. When you are done with this function, run it, get the timing data that proves your point, and include it with a very brief explanation at the end of that triple-quoted string just below threeWayMergesort.

  8. Bonus/Optional: linearRegression(a) [2.5 pts] [autograded]
    Write the function linearRegression(a) that takes a list of (x,y) points and finds the line of best fit through those points. Specifically, your function should return a triple of floats (a,b,r) such that y = ax+b is the line of best fit through the given points and r is the correlation coefficient as explained here (and yes, you must follow this exact approach). For example (taken from the text), linearRegression([(1,3), (2,5), (4,8)]) should return a triple of 3 values approximately equal to (1.6429, 1.5, 0.9972), indicating that the line of best fit is y = 1.6429x + 1.5, with a correlation coefficient of 0.9972 (so it's a really good fit, too). Note that the result is approximately equal to these values. Also note: the notes we refer you to do not discuss the meaning of the sign of r, so just return the absolute value |r|. And: you may ignore the case of a vertical line in your linearRegression code.

  9. Bonus/Optional: bogusDataFinder(csvData) [2.5 pts] [manually graded]
  10. Background: for this problem, we will consider the data in this file, which is a modified data file obtained from http://baseballguru.com/MLB2014.xls. Be sure to use our version of the file. This file contains real comma-separated batting data for 2014 Major League Baseball. Cool. We have deleted most of the lesser-known statistics. The columns that are included are all straightforward, so for those of you who do not know much about baseball, you could ask most anyone who does and they could easily explain their meaning. In fact, however, you can do this problem without ever knowing their meaning, so that's not required.

    The problem: we are going to take this file, delete some lines, and then shuffle the remaining lines (except the first line, with the column names, which will remain intact at the top). So far, no problem. But then we will deliberately change some data values. Your task is to identify which values we changed.

    Of course, there is a way to make this problem trivial: just copy the entire table into your code. Then you can just check each value against its original, and be certain to find the changes. And, of course, we won't allow this!

    More specifically: while you may include summary statistics derived from this data, you may not include any numbers directly from the data. Also, you may not include any lists of data in your code (or encoded variants, say in strings) with more than 100 total values. Finally, you may not have more than 300 lines total in your solution, including any such data. The point is: don't just record the data and uncleverly compare the original to the changed copy, but instead cleverly analyze the data to detect the changes.

    With this in mind, write the function bogusDataFinder(csvData) that takes a multiline string that contains the csv data as described above, and returns a list of tuples indicating the changed items as the tuples (playerID, columnTitle). The list should be sorted alphabetically first by playerID and then by columnTitle.

    So you know: At first, we'll make very large changes, to ease in their detection. But then the changes will get increasingly smaller, until nobody can detect them.

    Also: if we see especially remarkable solutions here, we may award some additional bonus as appropriate. We'll see. But don't do this for the points, do it for the joy of learning. Truly.

    Here is an example. Note that Bobby Abreu actually had 12 runs (which we edited to 120), and Jose Abreu actually had 556 atBats (which we edited to 5560). We did not make any edits to Zoilo Almonte.
    bogusData = """\
    playerID,nameFirst,nameLast,teamID,games,atBats,runs,hits,doubles,triples,homeRuns,runsBattedIn,stolenBases,caughtStealing,Walks,StrikeOuts,battingAverage
    abreujo02,Jose,Abreu,CHA,145,5560,80,176,35,2,36,107,3,1,51,131,0.317
    almonzo01,Zoilo,Almonte,NYA,13,36,2,5,0,0,1,3,1,0,0,14,0.139
    abreubo01,Bobby,Abreu,NYN,78,133,120,33,9,0,1,14,1,0,20,21,0.248"""
    
    assert(bogusDataFinder(bogusData) == [("abreubo01", "runs"),
                                          ("abreujo02, "atBats")
                                         ]
          )
    
    Also note that you may use 2d lists for this problem (and only this problem) if you wish, though that is not required.

    Have fun!