Scanning¶

pyscan.max_halfplane(net, mpts, bpts, disc_f)

Scans the set of halfplanes defined by the points in the net and maximizes the function disc_f with respect to the fraction of points contained in mpts and bpts. Function takes $$O(ns\log n)$$ where n is the net size and s is the size of the mpts and bts.

Parameters: net – List of Points mpts – List of WPoints bpts – List of WPoints disc_f – Discrepancy function to maximize over. Tuple of the found halfplane and the maximum disc_f value
pyscan.max_halfplane_fast(r, mpts, bpts, disc_f)

HERE BE DRAGONS!!! This method is still experimental and somewhat broken. This method uses a different method to scan the set of potential halfplanes with respect to the fraction of points contained in mpts and bpts. This code could potentially be much faster than the standard max_halfplane method.

Parameters: r – Number of regions asymptopically to consider. mpts – List of WPoints bpts – List of WPoints disc_f – Discrepancy function to maximize over. Tuple of the found halfplane and the maximum disc_f value
pyscan.max_halfspace(net, mpts, bpts, disc_f)

Scans the set of halfspaces defined by the points in the net and maximizes the function disc_f with respect to the fraction of points contained in mpts and bpts. Function takes $$O(n^2s\log n)$$ where n is the net size and s is the size of the mpts and bts.

Parameters: net – List of Point3s mpts – List of WPoint3s bpts – List of WPoint3s disc_f – Discrepancy function to maximize over. Tuple of the found halfspace and the maximum disc_f value
pyscan.max_disk(net, mpts, bpts, disc_f)

Scans the set of disks defined by the points in the net and maximizes the function disc_f with respect to the fraction of points contained in mpts and bpts. Function takes $$O(n^2s\log n)$$ where n is the net size and s is the size of the mpts and bts.

Parameters: net – List of Points mpts – List of WPoints bpts – List of WPoints disc_f – Discrepancy function to maximize over. Tuple of the found disk and the maximum disc_f value
pyscan.max_disk_scale(net, mpts, bpts, min_res, disc_f)

Scans the set of disks defined by the points in the net, between the min_res and 2 * min_res and maximizes the function disc_f with respect to the fraction of points contained in mpts and bpts. Internally we use the disk radius restriction to ignore far away points and prune out many potential disks. This can significantly decrease the amount of time the method takes to find the best region. In worst case this method will still take $$O(n^2s\log n)$$ where n is the net size and s is the size of the mpts and bts, but usually runtime will be much less.

Parameters: net – List of Points mpts – List of WPoints bpts – List of WPoints min_res – A float that corresponds to minimum radius to consider. disc_f – Discrepancy function to maximize over. Tuple of the found disk and the maximum disc_f value
pyscan.max_subgrid(grid, disc_f)

Finds the maximum subgrid exactly by enumerating all the subgrids and computing disc_f on each subgrid. Takes $$O(n^4)$$ where the grid is $$n \times n$$.

Parameters: grid – A Grid object disc_f – Discrepancy function to maximize over. The maximum subgrid found.
pyscan.max_subgrid_linear(grid, x, y)

Finds the maximum subgrid with respect to the linear function $$f(m, b) = xm + yb$$ exactly by using the Kadane algorithm. Takes $$O(n^3)$$ where the grid is $$n \times n$$.

Parameters: grid – A Grid object x – Double defines linear function. y – Double defines linear function. The maximum subgrid found.
pyscan.max_subgrid_linear_theory(grid, r, x, y)

Finds an approximate maximum subgrid with respect to the linear function $$f(m, b) = xm + yb$$ using the algorithm from [MP18a]. Takes $$O(n^2 + r^2 \log r)$$ where the grid is $$n \times n$$. :param grid: A Grid object :param r: Determines how frequently to approximate the subgrids. See [MP18a] for details. :param x: Double defines linear function. :param y: Double defines linear function. :rtype: The maximum subgrid found.

pyscan.max_subgrid_convex(grid, eps, disc_f)

Approximately finds the maximum subgrid of the grid with error eps by successively computing the maximum subgrid with respect to various linear function using the Kadane algorithm. Takes $$O(\frac{1}{\sqrt{\varepsilon}} n^3)$$ where the grid is $$n \times n$$.

Parameters: grid – A Grid object eps – The absolute error to incur. disc_f – Discrepancy function to maximize over. The maximum subgrid found.
pyscan.max_subgrid_convex_theory(grid, eps, disc_f)

Approximately finds the maximum subgrid of the grid with error eps by successively computing the maximum subgrid with respect to various linear function using the algorithm from [MP18a]. Takes $$O(\frac{1}{\sqrt{\varepsilon}} n^2 \log n)$$ where the grid is $$n \times n$$. This method is depreciated and will be replaced shortly.

Parameters: grid – A Grid object eps – The absolute error to incur. disc_f – Discrepancy function to maximize over. The maximum subgrid found.
pyscan.max_rectangle(mpts, bpts, eps, x, y)
pyscan.max_rectangle_heap(mpts, bpts, x, y)

Implements the algorithm from [APV06][DGM95] to scan the data set in $$s^2 \log s$$ time. We use a Treap based rotation scheme to keep the tree balanced.

Parameters: mpts – List of WPoints. bpts – List of WPoints. x – Linear parameter. y – Linear parameter. The maximum rectangle found.
pyscan.ham_tree_sample(pts, size)

Takes the list of pts and computes a ham tree sample [MP18b]. This method can shrink the size of the sample significantly while preserving the error with respect to halfplanes. This serves as a useful preprocessing step for speeding up halfplane scanning. In theory sample sizes go from $$O(\frac{1}{\varepsilon^2}$$ to $$O(\frac{1}{\varepsilon^{1.53..}}\log^{.766}\frac{1}{\varepsilon})$$ where $$\varepsilon$$ is the absolute fraction of misplaced points in a halfplane.

Parameters: pts – List of WPoints List of WPoints

Labeled Scanning¶

pyscan.max_halfplane_labeled(net, lmpts, lbpts, disc_f)

Scans the set of halfplanes defined by the points in the net and maximizes the function disc_f with respect to the labeled sets of points lmpts and lbpts. If two points with the same label are in the same region then the two points only contribute one of their weights to the region. This algorithm runs in time $$O(ns \log n)$$ where n is the net size and s is the size of the lmpts and lbpts.

Parameters: net – List of Points mpts – List of LPoints bpts – List of LPoints disc_f – Discrepancy function to maximize over. Tuple of the found halfplane and the maximum disc_f value
pyscan.max_halfspace_labeled(net, lmpts, lbpts, disc_f)

Scans the set of halfspaces defined by the points in the net and maximizes the function disc_f with respect to the labeled sets of points lmpts and lbpts. If two points with the same label are in the same region then the two points only contribute one of their weights to the region. This algorithm runs in time $$O(n^2s \log n)$$ where n is the net size and s is the size of the lmpts and lbpts.

Parameters: net – List of Point3s mpts – List of LPoint3s bpts – List of LPoint3s disc_f – Discrepancy function to maximize over. Tuple of the found halfspace and the maximum disc_f value
pyscan.max_disk_labeled(net, lmpts, lbpts, disc_f)

Scans the set of disks defined by the points in the net and maximizes the function disc_f with respect to the fraction of points contained in lmpts and lbpts. Points with identical labels are not double counted if they are contained in the same region. In worst case this method will still take $$O(n^2s\log n)$$ where n is the net size and s is the size of the lmpts and lbts.

Parameters: net – List of Points lmpts – List of LPoints lbpts – List of LPoints disc_f – Discrepancy function to maximize over. Tuple of the found disk and the maximum disc_f value
pyscan.max_disk_scale_labeled(net, lmpts, lbpts, compress, min_res, disc_f)

Scans the set of disks defined by the points in the net and maximizes the function disc_f with respect to the fraction of points contained in lmpts and lbpts. Only considers disks with radii between min_res and 2 * min_res. This method internally maps the disk scanning problem to halfplane scanning. Using this mapping we can compress sets of points with the same labels. This compression step can significantly improve the runtime and doesn’t incurr any extra error. In worst case this method will still take $$O(n^2s\log n)$$ where n is the net size and s is the size of the lmpts and lbts, but in practice this method should be much faster than this.

Parameters: net – List of Points lmpts – List of LPoints lbpts – List of LPoints compress – Turns the compression step on or off. min_res – Defines the radius range to consider. disc_f – Discrepancy function to maximize over. Tuple of the found disk and the maximum disc_f value
pyscan.max_disk_scale_labeled_alt(net, lmpts, lbpts, min_res, disc_f)

Scans the set of disks defined by the points in the net and maximizes the function disc_f with respect to the fraction of points contained in lmpts and lbpts. Only considers disks with radii between min_res and 2 * min_res. This method internally directly scans disks and lacks the compression step found in the previous algorihtm, so it is slighlty slower in practice then the previous method with compression turned off. In worst case this method will still take $$O(n^2s\log n)$$ where n is the net size and s is the size of the lmpts and lbts, but in practice this method should be much faster than this.

Parameters: net – List of Points lmpts – List of LPoints lbpts – List of LPoints min_res – Defines the radius range to consider. disc_f – Discrepancy function to maximize over. Tuple of the found disk and the maximum disc_f value py::def(“max_rect_labeled”, &pyscan::max_rect_labeled); py::def(“max_rect_labeled_scale”, pyscan::max_rect_labeled_scale);
pyscan.max_rect_labeled(r, max_width, lmpts, lbpts, disc_f)

Scans an approximate set of rectangles smaller than max_width in lmpts and lbpts by using a grid where each row and column of the grid can have at most $$1/r$$ fraction of the total weight.

Parameters: r – integer max_width – double lmpts – List of LPoints lbpts – List of LPoints disc_f – Discrepancy function to maximize over. Tuple of the found rectangle and the maximum disc_f value
pyscan.max_rect_labeled_scale(r, max_width, alpha, lmpts, lbpts, disc_f)

Creates a regular sparse grid of size $$1/alpha \times 1/alpha$$ and then considers subgrids of max_width size containing a large enough fraction of points to be worth scanning. On these small subgrids this method constructs a new grid such that each row and column of this grid can have at most $$1/r$$ fraction of the total weight and then scans ever subgrid of this new grid. This method is usually faster than the previous method, but introduces some spatial error $$\alpha$$.

Parameters: r – integer alpha – double max_width – double lmpts – List of LPoints lbpts – List of LPoints disc_f – Discrepancy function to maximize over. Tuple of the found rectangle and the maximum disc_f value

Kernelized Scanning¶

These methods use different scan statistics than the purely combinatorial versions.

pyscan.gaussian_kernel(bandwidth)

Constructs a gaussian kernel to be used with the smooth scanning function of form $$1/\sqrt{2 \pi} \exp(-u^2/(2 b^2))$$ where b is bandwidth.

Parameters: bandwidth – double (kernel bandwidth)
pyscan.Bernouli_K(kernel)

Constructs a discrepancy object that can be used to scan in the smooth scanning functions. The kernel parameter defines how distance is measured and how the bandwidth is determined.

Parameters: kernel – The kernel used to determine distance.
pyscan.max_annuli(net, mpts, bpts, radii, disc_f)

Scans the set of annuli with radii defined by radii and the points in the net. For each annuli we maximizes the function disc_f with respect to the mpts and bpts.

Parameters: net – List of Points mpts – List of WPoints bpts – List of WPoints min_res – A float that corresponds to minimum radius to consider. disc_f – KDisc object to maximize over. Tuple of a disk of radius equal to radii[-1] centered at the annuli and the maximum disc_f value
pyscan.max_annuli_scale(net, mpts, bpts, radii, disc_f)

Scans the set of annuli with radii defined by radii and the points in the net. For each annuli we maximizes the function disc_f with respect to the mpts and bpts. This version ignores far away points, so in theory can be much faster than the previous version.

Parameters: net – List of Points mpts – List of WPoints bpts – List of WPoints min_res – A float that corresponds to the radius to consider. disc_f – KDisc object to maximize over. Tuple of a disk of radius equal to radii[-1] centered at the annuli and the maximum disc_f value
pyscan.max_annuli_scale_multi(net, mpts, bpts, res_scales, max_radii, disc_f)

Scans the set of annuli with radii less than max_radii. The algorithm uses the res_scales to define levels in the annuli for approximating the kernel values. For each annuli we maximizes the function disc_f with respect to the mpts and bpts. This version ignores far away points and uses a more efficient method for scanning all the annuli. In theory the runtime of this method should be as good or even better than the previous algorithm, but in practice is probably slightly slower.

Parameters: net – List of Points mpts – List of WPoints bpts – List of WPoints res_scales – A list of floats that correspond to the relative sizes of the different annuli. max_radii – A float that corresponds to the maximum radius to consider. This will relate to the bandwidth of the kernel used. disc_f – KDisc object to maximize over. Tuple of a disk of radius equal to radii[-1] centered at the annuli and the maximum disc_f value