15-110: Principles of Computing - Week 11

Introduction to Data Analysis: Clustering

Definitions and vocabulary

A dataset is a collection of data points or instances. An instance is composed of one or more attributes or features. An attribute/feature can be numeric or nominal (e.g., male/female, low/medium/high, section, color, etc.).

Motivating problem: The T-shirt factory

You own a clothing factory. You know how to make a T-shirt given the height and weight of a customer. You want to standardize the production on three sizes: small, medium, and large. How would you figure out the actual size of these 3 types of shirt to better fit your customers?

Activity

Data collection. We collected measurements of students' height and weight. We recorded the data on an OpenOffice or Excel spreadsheet: Name, Gender, Height, Weight. We took care of some issues:
- Privacy: we used fake names to protect people's privacy.
- Units of measure: we made sure to use consistent units of measure for our numeric features (inches and pounds).
Data formatting and cleaning. We carefully checked for typing errors, and we eliminated missing values (blank rows). Reflection: what kind of issues may arise in a less controlled environment such as a survey on the web? We choose what features (attributes) we needed for our analysis: we kept "name" as the instance identifier, and "height" and "weight" as the numeric features we will cluster our data on. We got rid of "gender" (we will make unisex T-shirts), but we may want to repeat the analysis later splitting the dataset by gender and observe the differences. We saved the properly formatted dataset (name, height, weight) in plain text CSV (comma separated values) format, suitable for importing the data in Python.
Data visualization. We imported the CSV file in Python using the code linked at the bottom of this page. We visualized the data with a scatterplot. We visually noticed a strong positive relationship between height and weight of people. We also noticed the presence of outliers. Outliers are values that are very distant from the majority of the other data points. They can create a problem to our analysis, because they can skew the results and we may end up making T-shirts that are too small or too large for the rest of the population. Also, outliers make our data more difficult to visualize, because they "dominate" the scale of our scatterplot. We went back to the data file and removed the outliers.
Clustering. Using Python, we ran an animation of the k-means clustering algorithm on our data, with k=3. We ran it several times, and noticed how cluster assignments might change on different runs of the algorithm. This can happen because k-means finds only a locally optimal solution, not necessarily a globally optimal one. To alleviate this problem, we can run the algorithm many times (e.g., 100 times) and keep the best solution we find.
Comparing two solutions. To compare two clustering solutions and choose the best one, we can use a metric called withinss (within-cluster sum of squares). This number represents the sum of all the distances between each data point and the centroid of its cluster. The smaller the withinss, the more "compact" each cluster is. For the same dataset, having clusters with smaller withinss is generally better.
Discussion questions. How do we interpret the results? Do the extracted clusters make sense? What are the suggested sizes for the T-shirts? What happens if we look for a different number of clusters, say k=2 or k=4? What happens if we use a different/larger/smaller dataset?

The k-means algorithm

K-means is an iterative algorithm that keeps assigning data points to clusters identified by special points called centroids, until the cluster assignment stabilizes.

Initialization: choose k initial centroids arbitrarily (or randomly).
Assign each data point to the centroid that is closer to it.
Update all the centroids. The new centroid of a cluster is the mean of all the points assigned to that cluster.
Repeat points 2 and 3 until the new centroids are the same as the previous ones.

You can download a complete example of k-means. You can also download some Python code that automatically solves examples like the previous one.

When we move from a dataset of 1-D points (instances with only 1 numeric feature) to a dataset of 2-D, 3-D, or even multi-dimensional points, we can use the Euclidean distance as a measure of the distance between two points. For example, the Euclidean distance between the two 3-D points (x1, y1, z1) and (x2, y2, z2) is:

    d = sqrt((x1-x2)**2 + (y1-y2)**2 + (z1-z2)**2)

Notice that since we only need to compare the relative distance between points, we can simplify the math and get rid of the square root, and use the sum of squares as our distance metric:

    ss = (x1-x2)**2 + (y1-y2)**2 + (z1-z2)**2

Resources

kmeans.py K-means code base in Python. Includes functions to read CSV files, visualize scatterplots, and perform k-means clustering with or without animation, once or multiple times.
tshirts-G.ods and tshirts-G.csv: T-shirts dataset collected in section G.
tshirts-H.ods and tshirts-H.csv: T-shirts dataset collected in section H.
tshirts-I.ods and tshirts-I.csv: T-shirts dataset collected in section I.
tshirts-J.ods and tshirts-J.csv: T-shirts dataset collected in section J.
tshirts-G-nooutliers.csv: T-shirts dataset collected in section G, outliers removed.
tshirts-H-nooutliers.csv: T-shirts dataset collected in section H, outliers removed.
tshirts-I-nooutliers.csv: T-shirts dataset collected in section I, outliers removed.
tshirts-J-nooutliers.csv: T-shirts dataset collected in section J, outliers removed.
tshirts-GHIJ-nooutliers.csv: T-shirts dataset collected in sections G, H, I, and J, outliers removed.