pyscan

Spatial Scan Statistics provide a rigorous framework in which to define anomalies. This family of algorithms scan a set of defined regions over two, or possibly more sets of objects and attempt to maximize a function where the sets of objects have the largest disagreement. For instance given a red and blue set of points in 2d we might try to find a disk where the fraction of red points is as large as possible in comparison with the fraction of blue points.

Usually the maximized function is the Kulldorf Scan Statistic, \(\phi(m, b) = m \log m / b + (1 - m) \log (1 - m) / (1 - b)\), which is derived by assuming that there is a measured set of data \(M\) that follows a Poisson distribution with respect to a baseline set of data \(B\) [Kul97]. There are other statistics as well derived by assuming Gamma, Bernoulli, or Gaussian distributions.

Most current software available for region based spatial scan statistics use exact algorithms which have scaling issues when applied to modern datasets. Pyscan provides a way to significantly increase the scale at which these methods can be run at while providing real guarantees on the accuracy and power of the anomalies found. The code is under active development and has been written about in a number of publications [MP18b][MP18a][MSZ+16]. Pyscan can scan regions defined by halfplanes, disks, and rectangles on point sets and trajectories and can be adapted to work in other related domains such as data sets aggregated to regions. All of these methods can be run at very large scales.

Our code is currently hosted on github at https://github.com/michaelmathen/pyscan.git.

Halfplane Scanning

Rectangle Scanning

Bibliography

[AMP+06]Deepak Agarwal, Andrew McGregor, Jeff M. Phillips, Suresh Venkatasubramanian, and Zhengyuan Zhu. Spatial scan statistics: approximations and performance study. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ‘06, 24–33. New York, NY, USA, 2006. ACM. URL: http://doi.acm.org/10.1145/1150402.1150410, doi:10.1145/1150402.1150410.
[APV06]Deepak Agarwal, Jeff M. Phillips, and Suresh Venkatasubramanian. The hunting of the bump: on maximizing statistical discrepancy. SODA, 2006.
[DGM95]David P. Dobkin, Dimitrios Gunopulos, and Wolfgang Maass. Computing the maximum bichromatic discrepancy, with applications to computer graphics and machine learning. NeuroCOLT Technical Report Series, March 1995.
[Kul97]Martin Kulldorff. A spatial scan statistic. Communications in Statistics: Theory and Methods, 26:1481–1496, 1997.
[MP18a]Michael Matheny and Jeff M. Phillips. Computing approximate statistical discrepancy. ISAAC, 2018.
[MP18b]Michael Matheny and Jeff M. Phillips. Practical low-dimensional halfspace range space sampling. ESA, 2018.
[MSZ+16]Michael Matheny, Raghvendra Singh, Liang Zhang, Kaiqiang Wang, and Jeff M. Phillips. Scalable spatial scan statistics through sampling. In SIGSPATIAL. 2016.
[Ram72]Urs Ramer. An iterative procedure for the polygonal approximation of plane curves. Computer Graphics and Image Processing, 1(3):244 – 256, 1972. URL: http://www.sciencedirect.com/science/article/pii/S0146664X72800170, doi:https://doi.org/10.1016/S0146-664X(72)80017-0.