Managing Spatial Data in a Distributed Environment

ABSTRACT

Advances in distributed computing enable the pooling of resources located across the Internet to provide a scalable, and robust solution for many computational needs. Distributed key-value store systems like Google's BigTable and Amazon's Dynamo allow the indexing and retrieval of a large amount of data in parallel, while distributed computing frameworks like MapReduce and Pregel provide a fault-tolerant way to process a large amount of data using distributed computing resources. These distributed computing techniques are applied to the spatial database domain. Specifically, issues involved in storing and retrieving spatial data in a distributed environment, as well as, processing spatial queries in parallel using a distributed computing framework are investigated.

All of these methods rely on variants of hashing in order to obtain near constant time behavior in distributing the data and it is preferable that they are as close as possible to being distance-preserving. Specifically, spatial objects in proximity should have similar hash values. In particular, it is desirable to be able to estimate how far apart two objects are (within a given error bound) by just considering their hash values. Such hash functions enable performing an approximate range query using simple hash table lookup operations. Other issues involve the parallel processing of spatial queries. Some easy examples are the distance join query which finds pairs (p,q) of objects (from two different sets) where the distance between p and q is less than a given threshold, or computing the shortest paths from each node to every other node in a road network. More difficult are the spatial problems which can not be easily decomposed into multiple tasks running in parallel, e.g., the distance semi-join query, and network Voronoi diagram construction. This requires developing a generic method to traverse a graph or a tree in parallel to solve these query problems. Ideally, the method should require little or no communication between parallel tasks which will be accomplished by allowing the parallel tasks to produce redundant results which can then be pruned.

The tools that we develop will help improve the robustness and scalability for spatial data management. Results from our investigation on parallel query processing can be useful for query problems which require traversing a tree or a graph which are often spatially embedded. Having a method to traverse a graph or a tree in parallel that requires little or no communication enables us to process many types of spatial queries using distributed computing resources where communication can be very costly. Specifically, we expect our tools to enable spatial applications such as online mapping, computer aided design, online gaming and scientific simulations to handle terabytes of spatial data while it is impossible or inefficient to do with the current technologies. This is of utility to all organizations that process spatial data and attempts will be made to use it in some government agencies.

NSF Grant IIS-1320791, NSF I-Corps Team 708, previous NSF annual report (2015) is here

RESULTS

  1. S. Peng, J. Sankaranarayanan, and H. Samet. ASDO: Complex Analytical Road Queries Inside Any Database. Submitted for publication.

    Fast computation of the distance oracle ASDO (all store distance oracle) to achieve a high-throughput for shortest distance retrieval.

  2. S. Peng, J. Sankaranarayanan, and H. Samet. SPDO: High-Throughput Road Distance Computations on Spark using Distance Oracles. In 32nd IEEE International Conference on Data Engineering (ICDE), pages 1239-1250, Helsinki, Finland, May, 2016. [link]

    A framework called SPDO is presented which implements an extremely fast distributed algorithm for computing road network distance queries on Apache Spark. The approach extends our previous work of developing the distance oracle which has now been adapted to use Spark’s resilient distributed dataset (RDD).

  3. S. Ayhan and H. Samet. Aircraft trajectory prediction made easy with predictive analytics. In Proceedings of the 22nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining, San Francisco, August 2016.

  4. H. Li, S. Peng, and H. Samet. Streaming News Image Summarization. To appear in 23nd International Conference on Pattern Recognition (ICPR), Cancun, Mexico, December, 2016.

    Describes work on summarization and visualization of streaming local news.

  5. S. Peng and H. Samet. Analytical Queries on Road Networks: An Experimental Evaluation of Two Architectures. In Proceedings of the 23nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Seattle, WA, November, 2015. [link]

    A detailed evaluation of the two architectures for a variety of analytical query processing tasks such as region, KNN, distance matrix and trajectory queries is presented and the lessons learned are discussed.

  6. S. Peng, M. D. Adelfio, and H. Samet
    Viewing streaming spatially-referenced data at interactive rates.
    In Proceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 409-412, Dallas, TX, November, 2014. [link]

    Deals with Dynamic map labeling which arises when the location associated with the label is constantly changing.

  7. J. Rao, J. Lin, and H. Samet
    Partitioning strategies for spatio-textual similarity join.
    In Proceedings of the 3rd ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data, pages 40-49, Dallas, TX, November, 2014.[link]

    Evaluation of algorithms for computing a spatio-textual similarity join.

DEMO (works best with the Chrome browser)

A full version web-based demo can be found here, which is based on our previous year's demo. We participated in the NSF I-Corps program cohort of Mar. 2016 - May. 2016 to perform customer discovery for the possible commercialization of our distance oracle technology. During our customer discovery we found that shortest distance/path queries in real applications, such as package delivery and local awareness ads, occur primarily in a city-level region, i.e., very few of the shortest distance/path queries take place across the entire USA continent. Our experiments as in the following graph show that using geodesic distance (Euclidean distance) to approximate network distance can produce significant errors. The following graph tabulates the ratio of network distance to geodesic distance for all pairs of nvertices on three example city road networks. The ratio is called the Route Directness Indexand shows the distortion resulting from approximating network distance with geodesic distance. On the average, the value of this index for New York City (NYC) is 1.213, for the San Francisco Bay Area is 1.384, and for Salt Lake City (SLC) it is 1.475, where NYC is low because most of the streets are laid out on a grid. But even for NYC, use of the approximation finds that 40% of the distance queries will have an error of 20% or more.

The demo on this site has two components. The first part corresponds to the computation of the network distance using the distance oracle for a given error tolerance between two locations which are entered by positioning the mouse on the map at the desired locations and clicking on the buttons labeled "Enter Marker1" and "Enter Marker2". This year our computation is much more accurate than the calculation in the previous year's demo as here we only focused on a small region (e.g., the San Francisco Bay Area) and use a more accurate map built with OpenStreetMap. Building our distance oracles with OpenStreetMap data took a bit of time and trying to do it in a reasonable time was one of the challenges that we faced. Here the ASDO (i.e., the all store distance oracle) has been precomputed with an epsilon error tolerance value of 0.1. For comparison purposes, we also provide the true network distance computed using the CH method which is the state-of-the-art method on the same OpenStreetMap road network, as well as the Google Maps result on an accurate commercial road network. Our distances are off vis-a-vis the results of using Google Maps since we built the ASDO using the free OpenStreetMap road network while Google uses an accurate commercial road network which is not available to us. Access to data is an issue here. Building our distance oracles with OpenStreetMap data took a bit of time and trying to do it in a reasonable time was one of the challenges that we faced.

Notice that the distance computed by ASDO is very close in value to the true distance (the epsilon error tolerance value of 0.1 is a bound but in fact is much in excess of the actual error that we observed and thus it is quite reasonable to use it). Please also note that the Google computed network distance is only available if you use the Chrome browser; otherwise, it is missing. Additionally, we provide the SQL query that we executed against the oracle to obtain the network distance.

The second part of the demo consists of the computation of an origin-destination distance matrix and for a set of user-defined delivery locations and the computation of a route that visits all of these locations. The first step is to click on the "Clear" button to start afresh. The delivery locations are specified by repeatedly clicking on the "Add More Delivery!" button, while the origin-destination distance matrix is obtained by clicking the "Compute Distance Matrix" button. The route is obtained by clicking the "Solve Delivery" button. Users can also bypass clicking the "Compute Distance Matrix" button and compute the route directly in which case the time to do so is greater as it also includes the time needed to compute the origin-destination distance matrix. Notice that we also provide the actual times needed to do these calculations in the row labeled "Analytics" where we also provide the time needed with the alternative methods that are based on contraction hierarchies (termed the "CH" method"). The CH methods do not scale whereas ours do and this scaling process is the result of the research that we have been doing under this grant. This demo reveals a speedup factor of approximately 5.

Observe the speedup of ASDO over the true distance using the CH method which can be as high as 100X. Moreover, ASDO permits simultaneous computation of the distance and can handle as many as 65,000 distance lookups per second assuming one database server. Adding Y servers will increase the number of lookups per second by a factor of Y. On the other hand, the true distance is computed sequentially and cannot be parallelized on one machine although it can be parallelized over Y machines in which case it is sped up by a factor of Y but this is still a much slower throughput than what is achievable with ASDO.

In addition, as the size of the region is restricted to a city or several cities, we can easily load all ASDO results inside memory instead of a disk-based database. Our evaluations show that ASDO inside memory can achieve 7 million shortest distances per query using 8 threads on a regular Macbook Pro laptop. It is 100X faster than the ASDO inside a disk-based database. It can compute a (10K X 10K) distance matrix within 13 seconds, which is desirable by delivery companies such as USPS, UPS, and Fedex. We are exploring new applications that require a super high throughput, and developing solutions using our memory-based ASDO.

1st Location
Lat: Lon:
2nd Location
Lat: Lon:
ASDO Network Distance
Quad Depth
Euclidean Distance
Google Maps Distance
True Network Distance
SQL Query
ASDO Delivery API
Analytics Runtime

RELEVANT PUBLICATIONS

  1. J. Sankaranarayanan, H. Samet
    Query processing using distance oracles for spatial networks.
    IEEE Transactions on Knowledge and Data Engineering, 22(8):1158-1175, August 2010.[link]
    Best Papers of ICDE 2009 Special Issue

  2. J. Sankaranarayanan, H. Samet
    Distance oracles for spatial networks.
    In Proceedings of the 25th IEEE International Conference on Data Engineering, pages 652-663, Shanghai, China, April 2009.[link]
    (One of the Best Papers of ICDE 2009. Selected for publication in TKDE journal.)

  3. J. Sankaranarayanan, H. Samet, H. Alborzi
    Path oracles for spatial networks.
    PVLDB, 2(1):1210-1221, August 2009.[link]
    Also Proceedings of the 35th International Conference on Very Large Data Bases (VLDB)

  4. H. Samet, J.Sankaranarayanan, H. Alborzi
    Scalable network distance browsing in spatial databases.
    Computer Science Technical Report TR-4865, University of Maryland, College Park, MD, April 2007.
    In Proceedings of the ACM SIGMOD Conference, pages 43-54, Vancouver, Canada, June 2008.[link]
    Also see University of Maryland Computer Science Technical Report TR-4865, April 2007
    (2008 ACM SIGMOD Best Paper Award)

  5. J. Sankaranarayanan, H. Alborzi, H. Samet
    Enabling query processing on spatial networks.
    In Proceedings of the 22nd IEEE International Conference on Data Engineering, page 163, Atlanta, GA, April 2006.[link]

  6. J. Sankaranarayanan, H. Alborzi, H. Samet
    Distance join queries on spatial networks.
    In Proceedings of the 14th ACM International Symposium on Advances in Geographic Information Systems, pages 211-218, Arlington, VA, November 2006.[link]

  7. J. Sankaranarayanan, H. Alborzi, H. Samet
    Efficient query processing on spatial networks.
    In Proceedings of the 13th ACM International Symposium on Advances in Geographic Information Systems, pages 200-209, Bremen, Germany, November 2005.[link]

Date of last update: Aug 4th, 2016