I recently created a map of the cost of crime in Philadelphia’s neighborhoods: Crime clusters

After calculating these values in R, I needed a way to present the total cost of crime in each neighborhood in a digestible way.

I picked a choropleth map, but didn’t know how to cluster the values. I had two problems: how many clusters should I choose, and how should I determine the different breaks for each cluster?

I discovered the Jenks natural breaks optimization, which nearly answers both questions. The Jenks method determines breaks by reducing the variance within classes and maximizing the variance between classes1.

I say nearly because the Jenks method requires an input for the number of breaks you want to cluster the data into. Fortunately, you can calculate the fit and accuracy of your decision. As the number of breaks increases, the accuracy increases, but the legibility of your map can suffer.

R and the classInt package make it easy to see this relationship so you can test different quantities of breaks. Here’s a screenshot of me testing 4, 6, 8, 10, and 12 breaks: Testing Jenks natural breaks 6, 8, and 10 breaks looked promising, so I tested them and ultimately selected 8 breaks.

Mapbox Studio and Mapbox GL JS made it easy to style these different breaks push the map to the web (I used this guide to learn how to do this).

If you’re interested in calculating the cost of crime for your city, check out the project on GitHub.

  1. This method breaks down when comparing different groups of data (like crime in Philly neighborhoods versus crime in DC neighborhoods), but since I was clustering univariate data, this worked well.