Introduction and Problem Statement
Singapore is small country of 725.7 sq km area and 5.6 million population. Despite its small size, Singapore has a diversity of languages, religions, and cultures. Singapore does not fit the traditional description of a nation, it is a society-in-transition, given the fact that Singaporeans do not all speak the same language, share the same religion, or have the same customs. This diversity in Singapore population as given rise to neighborhoods which can be distinguished from each other based on cuisine, culture, food, nationality, religion and many other features.
I am trying to explore what the venue distribution tells us about the neighborhood.
I am an Indian national residing in Singapore. I have stayed in places such as Clementi, Bukit Panjang.
- Is there a similarity between these locations I have stayed at?
- I am looking to move to east of island due to change in job, which area would be similar to my preference?
Description of Data and How to use it
Wikipedia on Singapore's postal codes gives us the information of how the country is divided into various locations.
We can scrape Wikipedia page for the postal code table and locations in Singapore.
Geopy library Nominatim API
The search API allows you to look up a location from a textual description. From the list of locations in Singapore found from wikipedia we can obtain geographical coordinates from Geopy library. (https://nominatim.openstreetmap.org/)
This API offers real-time access to Foursquare’s global database of rich venue data and user content to power your location-based experiences in your app or website. We then construct a URL to send a request to the Foursquare API to explore geographical locations, and to get trending venues around a location.
So how we would approach this problem using data is as follows:
- Collect the Singapore city data from https://en.wikipedia.org/wiki/Postal_codes_in_Singapore
- Using Geopy library and Nominatim API we determine coordinates of each location
- Using FourSquare API we will find all venues for each neighborhood
- Using venue description and frequency we sort each venue by location
- Visualize the neighborhoods using folium library
- Cluster the neighborhoods using the kmeans clustering algorithm
- Derive conclusion based on the clusters and venue data
In Singapore there are multiple postal district and within each postal district there are postal sectors. Each postal sector has locations associated. Here we will extract the list of locations from the Wikipedia page.
I also added Latitude and Longitude columns which we will fill later. Right now they have None values.
Now we explode the location columns, that is we will convert the comma separated Location column into a list which contains all locations in Singapore.
Still the latitude and longitude columns are empty.
Now we use Geopy library to get the latitude and longitude values of all locations in Singapore and then fill our data-frame with coordinated of each location
Lets now create a map of Singapore with neighborhood locations superimposed on top
Now comes the step to use the Foursquare API to extract nearby venues information near each of these locations.
Here is the sample output of our extraction,
Number of venues found at each location
We can also find top venues at each location. Here is a sample output of top 5 locations by count near each location mentioned below
We can then put this output in a data-frame. I have taken top 10 locations at each location in order to make the below table:
Now begins the part of clustering the locations. We have around 70 locations so I have made an assumption to make 7 clusters in Singapore. So that each cluster can accommodate on an average 10 locations.
I am using k means clustering algorithm to cluster the locations. The feature set being the top 10 venues near that location
Here is how the code looks like.
Finally, let’s visualize the resulting clusters
The resulting clusters are found to have answered our questions but also have unearthed some interesting observations. Following are some observations points
- Bukit Panjang and Clementi — the places I have stayed in Singapore at part of same cluster 1 (aqua blue)
- If I have to look for places to stay for my next job change then the favorable locations are Simei, Punggol, Tampines, Bedok, Pasir Ris
- Cluster 4 (red) is an interesting cluster. It is spread all across Singapore. It is found at the corners of Singapore as well as at the Central Business District. The reason for this looks like the presence of places of tourist attractions which are natural attractions at the suburbs and museums at the central district
- Little India location forms a cluster 2 (green) on its own. This is a popular spot for Indian community and plethora of shopping options and restaurants. It is indeed a place which is unique in its own way and the experience cannot be found at any other place.
Above methodology and analysis gives an excellent approach to finding answers to our initial questions. The study for uncovers some interesting observations which can be used for answering other important questions also.
This analysis when combined with the restaurant or tourist attraction or shopping experience ratings can answer questions on pin pointing exact venue of interest.
For example. Where can I find the coffee I like when I move to different location.
Overall this case study has been very useful to gain knowledge of data wrangling, data engineering, machine learning, visualization and presentation.