Exploring the Relationship of Uber Pickups and Venues of New York City
1. Introduction
I explored New York City, segmented and clustered its neighborhoods based on their popular venue categories following the instructions of the course lab. The biggest clustered dataset showed that there were many coffee shops, cafes, restaurants of various cuisines, and bars and clubs. It also had health-related venues such as gyms, yoga studios, cycle studios, etc. The data clearly showed that the city is very diverse.
In a big city like New York City, it is very popular to take taxes or use ride sharing services like Uber and Lyft. The Uber Pickups in New York City is available on kaggle.com. I would like to further explore New York City combining the Foursquare’s location data with the Uber Pickups data to see if venue categories are related to Uber usage count. This may discover popular pickup locations and this finding may be helpful for the city to better understand how much traffic increase in certain areas by Uber.
2. Data Acquisition and Cleaning
2.1 Data Sources
The Uber Pickups in New York City dataset can be found on Kaggle.com [link]. This dataset contains 6 months of Uber pick up geo-locations (latitudes and longitudes) of year 2014. Location information along with venue details will be dynamically obtained by utilizing the Foursquare location APIs.
2.2 Data Cleaning
The Uber Pickups data is a zipped file (Uber-dataset.zip) and contains 6 CSV files in it. And each CSV file contains pick up data for each month from Apr through Sep 2014. The individual file has 4 attributes (columns): Date/Time, Lat, Lon, and Base. Latitude and longitude data are required to query nearest venues with the Foursquare APIs, but the other columns, Date/Time and Base, are not needed for the analysis. So, I dropped these two unnecessary columns from the dataset and combined all the data from the 6 CSV files and created a table as shown in <Figure 1>. In order to read CSV files in sequence with for loop, I renamed the file names with index number at the end: uber-raw-data-0.csv uber-raw-data-1.csv, …, uber-raw-data-5.
<Figure 1>
I created a table with all the location data and there were 4,534,327 rows. I then inserted the entire data into the Locations table in the SQLite database as shown in <Figure 2>. It was not possible to call the Foursquare API for all the locations in the table due to the daily limit, I randomly chose 8,000 locations among them. Utilizing the SQLite database, I was able to keep track of locations that had been used to get nearby venues by calling the search API (https://api.foursquare.com/v2/venues/search).
I collected venues within a radius of 75 meters from each coordinate (latitude and longitude) in the Locations table. I selected 75 meters after testing several shorter distances. The API did not return any nearby venues for many pick-up locations for shorter radiuses. 75 meters was big enough to find more than one nearby venue. Since the API also provides a distance between the pick-up location and the venue, I can later use only the first n venues listed in the order of distance. It required too many days to get nearby venues for all those 8,000 randomly chosen locations due to the daily limit of calling the search API. I spent several days and was able to get 20,156 venues for 4,228 locations. I then put all the collected venues’ information such as venue_id, name, latitude, longitude, category and a distance from the pick-up location into another table, Venues, in the database as shown in <Figure 3>.
<Figure 2>
<Figure 3>
As the next step, I originally picked only two closely located venues from each pick-up location in order to reduce duplicated venues that may be located in the overlapped areas within radiuses of multiple pick-up locations as shown in <Figure 4>. However, I realized that if I picked only two venues, it means that I can only use two categories as features for k-means clustering algorithm. So, I increased the number to five.
<Figure 4>
This was achieved through a few steps of data processing. First, I got all the venue data from the Venues table in the database and created a Pandas dataframe. As the second step, I sorted the dataset with these two columns, location_id and distance, and got the results as shown in <Figure 5>. Lastly, I grouped the dataset by the location_id column and picked only the top five rows by using head(5) method. The results are shown in <Figure 6>. Note that some locations have less than five nearby venues.
<Figure 5>
<Figure 6>
2.3 Feature Selection
Analyzing the venue data in the Venues table in the database, I found that there are 411 unique venue categories. I had already created a new dataset with only top 5 nearby venues for 4,228 locations among 8,000 randomly picked ones. Some of the locations have less than 5 nearby venues, but most of locations have more than 5 venues. Therefore, using the first top 5 venue categories for each location will be enough to run the k-means clustering algorithm to cluster them into 5 different clusters. As a preparation, I created a new dataset with columns for all the venue categories. And I take means of the frequency of occurrences of each category for every single location. Finally, I picked the first top 5 venue categories for each location as shown in <Figure 7>. This is the final dataset that will be used for clustering.
<Figure 7>
3. Data Analysis
3.1 k-clusters = 5
I ran the k-means clustering algorithm with the prepared dataset with k_clusters set to 5 and plotted all the clustered locations on the map as shown in <Figure 8>. As you can see, the clustered points were scattered all around Manhattan. They were not clustered with any shapes that were separated each other. Therefore, it was necessary to carefully examine the data for each cluster.
<Figure 8>
I count the frequency of all the categories for the 1st Most Common Venue for each cluster and created tables that show the top 20 popular venue categories for each cluster. With the sorted and counted categories in these tables, I could easily find what venue categories that each cluster represents. It was easy to anticipate seeing Bar, Hotel, Coffee shop, and Café. However, it was very interesting to find that Clothing Store and Juice Bar were among the top categories. Theater, Art Gallery and Park were listed as popular categories as well. The most interesting finding was that Italian Restaurant was the top in the Cluster 4 even though there were so many different types of restaurants around. Although I had expected Airport and Hotel to be ranked top, I was not able to see them in top 2 for all the clusters.
I collected all the top 2 categories and plotted them by their counts as a bar chart. I also calculated weights of those top categories in their own cluster and plotted them as another bar chart. They are shown in <Figure 9> and <Figure 10>, respectively.
<Figure 9>
<Figure 10>
3.2 k-cluster = 7
Through the first clustering results, I was able to find quite interesting top common categories and was curious to see more clusters. As the second analysis, I increased k_clusters value to 7. After completing clustering, I plotted the clustered locations on the map shown in <Figure 11>.
<Figure 11>
And I count the frequency of the 1st Most Common Venue categories for each cluster and the tables below show the top 20 popular venue categories for 7 clusters.
As for the first analysis with k_clusters=5, I plotted every top 2 frequent categories in each cluster by their counts and weight in <Figure 12> and <Figure 13>.
<Figure 12>
<Figure 13>
By increasing the number of clusters to 7, I could see new categories such as Airport Service, Deli / Bodega, American Restaurant, Steakhouse and Wine Shop.
4. Results
As shown in the map below, the Uber pick-up locations that were used to call the Foursquare API were very well distributed in the entire Manhattan area and I had not expected to see any dominant venue categories other than a few categories that I could easily guessed before running the k-means clustering algorithm and carefully examined the results. It was somewhat expected to see categories like Bar, Coffee Shop, Hotel and Theater; however, the other categories such as Italian Restaurant, Clothing Store, Art Gallery, and Wine Shop were very interesting findings.
<Figure 14>
5. Conclusion
New York city is very diverse, and a lot of people use car sharing services like Uber. Even though the pick-up locations seemed to be randomly scattered around without any patterns, it was very interesting to see there were very popular venue categories in the clustered locations. I could have found some different results if I could have collected venue information for more pick-up locations. Because of the daily limit of API calls, I was not able to collect more data within a relatively short period of time. However, I collected 20,156 different venues for 4,228 different pick-up locations, and it was still very interesting to see how all those venue categories were clustered into several distinctive clusters. I mainly used Python programming language and its packages such as Pandas, NumPy, matplotlib, folium and sqlite3. In addition, I used Excel, SQLite Studio, and Jupyter Notebook running on my local machine as tools. I realized that there were so many useful libraries and tools to manipulate, analyze and visualize the data. I utilized existing datasets like the Uber pick-ups and Foursquare location data, and successfully proved that there was some level of relationship between Uber pick-up locations and nearly venues by showing the top common venue categories near those locations.
6. Future Direction
The Uber pick-ups dataset has the Date/Time column which tells when an Uber passenger was picked up at that location. The Date information can provide what day was the passenger was picked up, and Time information can provide whether it was morning, afternoon, evening, or late night. The Date information can also provide whether it was a weekday or weekend. These can be another set of features that we can use for clustering algorithms. I guess that I would see categories like Night Club and Bar if I have used Weekday vs Weekend, and Day vs Night as features.