Description
We created an automated framework for analysis of large-scale geospatial data using Spark, Impala, and Python. We use this framework to join billions of device locations daily from mobile phone users to millions of points of interest. We discuss our project structure and workflow. Our work has broader application to movement of populations through time and space.
Large-scale geospatial data is a common analytical challenge. We receive billions of location data points per day for user locations. We need to filter this data and map it to millions of points of interest. Our end use case is to provide clients with information as to the attributes of users who visit their stores. This workflow, however, has application to the broader analytical problem of mapping geospatially located entities of interest to points of interest with known locations. We also provide our approach to the frequently encountered issue of needing both standardized and flexible reporting. We have automated the standard analyses in our workflow, but created an API for analysts to quickly develop ad-hoc analyses based on customer requests.
There are freely available tools to the Python user that can help us complete all of the tasks in our data analysis pipeline. At a high level, the architecture of our process is as follows: HDFS storage for large-scale geospatial data, Spark for geospatial joins, Cloudera Impala accessed via Ibis to query resultant datasets, and scientific python for conducting analyses. We make the automated analyses to the end users via a web portal created using Flask and Celery. We created an API available to analysts as a Python package so that they can quickly perform custom analyses created by clients. We used Sphinx to aid in documentation for easier use of the API.