Managing Spatial Data in a Distributed Environment

ABSTRACT

Advances in distributed computing enable the pooling of resources located across the Internet to provide a scalable, and robust solution for many computational needs. Distributed key-value store systems like Google's BigTable and Amazon's Dynamo allow the indexing and retrieval of a large amount of data in parallel, while distributed computing frameworks like MapReduce and Pregel provide a fault-tolerant way to process a large amount of data using distributed computing resources. These distributed computing techniques are applied to the spatial database domain. Specifically, issues involved in storing and retrieving spatial data in a distributed environment, as well as, processing spatial queries in parallel using a distributed computing framework are investigated.

All of these methods rely on variants of hashing in order to obtain near constant time behavior in distributing the data and it is preferable that they are as close as possible to being distance-preserving. Specifically, spatial objects in proximity should have similar hash values. In particular, it is desirable to be able to estimate how far apart two objects are (within a given error bound) by just considering their hash values. Such hash functions enable performing an approximate range query using simple hash table lookup operations. Other issues involve the parallel processing of spatial queries. Some easy examples are the distance join query which finds pairs (p,q) of objects (from two different sets) where the distance between p and q is less than a given threshold, or computing the shortest paths from each node to every other node in a road network. More difficult are the spatial problems which can not be easily decomposed into multiple tasks running in parallel, e.g., the distance semi-join query, and network Voronoi diagram construction. This requires developing a generic method to traverse a graph or a tree in parallel to solve these query problems. Ideally, the method should require little or no communication between parallel tasks which will be accomplished by allowing the parallel tasks to produce redundant results which can then be pruned.

The tools that we develop will help improve the robustness and scalability for spatial data management. Results from our investigation on parallel query processing can be useful for query problems which requires traversing a tree or a graph which are often spatially embedded. Having a method to traverse a graph or a tree in parallel that requires little or no communication enables us to process many types of spatial queries using distributed computing resources where communication can be very costly. Specifically, we expect our tools to enable spatial applications such as online mapping, computer aided design, online gaming and scientific simulations to handle terabytes of spatial data while it is impossible or inefficient to do with the current technologies. This is of utility to all organizations that process spatial data and attempts will be made to use it in some government agencies.

NSF Grant IIS-1320791.

RESULTS

  1. Peng, J. Sankaranarayanan, and H. Samet. An all-store oracle for fast, approximate shortest distances and paths in road networks. Submitted for publication.

    Fast computation of the distance oracle ASDO (all store distance oracle).

  2. S. Peng, M. D. Adelfio, H. Samet
    Viewing streaming spatially-referenced data at interactive rates.
    In Proceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 409-412, Dallas, TX, November, 2014. [link]

    Dynamic map labeling which is the case when the location associated with the label is constantly changing.

  3. S. Peng and H. Samet. Analytical Queries on Road Networks: An Experimental Evaluation of Two Architectures. Submitted for publication.

    Represents the evaluation of a distributed solution to analytical queries on a road network database.

  4. J. Rao, J. Lin, H. Samet
    Partitioning strategies for spatio-textual similarity join.
    In Proceedings of the 3rd ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data, Dallas, TX, November, 2014.[link]

    Evaluation of algorithms for computing a spatio-textual similarity join.

DEMO (works best with the Chrome browser)

Sample ASDO Web Demo: A full version web demo is can be found here To use this demo click on any two points within the map of the mainland of the USA and the oracle returns the network distance between the two points. This is achieved with a database lookup operation. The ASDO (i.e., the all store distance oracle) has been precomputed with an epsilon error tolerance value of 0.25. For comparison purposes, we provide the true network distance computed using our data as well as the Google computed network distance although our distances are erroneous since we built the ASDO using a free inaccurate road network while Google uses an accurate commercial road network. This problem will go away once we invest in a better road network input dataset for our oracle.

Notice that the distance computed by ASDO is very close to the true distance (the epsilon error tolerance value of 0.25 is much in excess of the real error and thus is quite reasonable). Please note that the Google computed network distance is only available if you use the Chrome browser; otherwise, it is missing. Additionally, we provide the SQL query that we executed against the oracle to provide the network distance.

Observe the speedup of ASDO over the true distance which can be as high as 1000X. Moreover, ASDO permits simultaneous computation of the distance and can handle as many as 65,000 distance lookups per second assuming one database server. Adding Y servers will increase the number of lookups per second by a factor of Y. On the other hand, the true distance is computed sequentially and cannot be parallelized on one machine although it can be parallelized over Y machines in which case it is sped up by a factor of Y but this is still a much slower throughput than what is achievable with ASDO.

1st Location
Lat: Lon:
2nd Location
Lat: Lon:
ASDO Network Distance
Quad Depth
Euclidean Distance
Google Maps Distance
True Network Distance
SQL Query

RELEVANT PUBLICATIONS

  1. J. Sankaranarayanan, H. Samet
    Query processing using distance oracles for spatial networks.
    IEEE Transactions on Knowledge and Data Engineering, 22(8):1158-1175, August 2010.[link]
    Best Papers of ICDE 2009 Special Issue

  2. J. Sankaranarayanan, H. Samet
    Distance oracles for spatial networks.
    In Proceedings of the 25th IEEE International Conference on Data Engineering, pages 652-663, Shanghai, China, April 2009.[link]
    (One of the Best Papers of ICDE 2009. Selected for publication in TKDE journal.)

  3. J. Sankaranarayanan, H. Samet, H. Alborzi
    Path oracles for spatial networks.
    PVLDB, 2(1):1210-1221, August 2009.[link]
    Also Proceedings of the 35th International Conference on Very Large Data Bases (VLDB)

  4. H. Samet, J.Sankaranarayanan, H. Alborzi
    Scalable network distance browsing in spatial databases.
    Computer Science Technical Report TR-4865, University of Maryland, College Park, MD, April 2007.
    In Proceedings of the ACM SIGMOD Conference, pages 43-54, Vancouver, Canada, June 2008.[link]
    Also see University of Maryland Computer Science Technical Report TR-4865, April 2007
    (2008 ACM SIGMOD Best Paper Award)

  5. J. Sankaranarayanan, H. Alborzi, H. Samet
    Enabling query processing on spatial networks.
    In Proceedings of the 22nd IEEE International Conference on Data Engineering, page 163, Atlanta, GA, April 2006.[link]

  6. J. Sankaranarayanan, H. Alborzi, H. Samet
    Distance join queries on spatial networks.
    In Proceedings of the 14th ACM International Symposium on Advances in Geographic Information Systems, pages 211-218, Arlington, VA, November 2006.[link]

  7. J. Sankaranarayanan, H. Alborzi, H. Samet
    Efficient query processing on spatial networks.
    In Proceedings of the 13th ACM International Symposium on Advances in Geographic Information Systems, pages 200-209, Bremen, Germany, November 2005.[link]

Date of last update: July 9, 2015