Below I create a code example where I plant a region containing 5% of the points and make this region anomalous. Inside of the region 80% of the points are from the red set while outside of the region 50% of the points are red. We would like to recover this planted region by using the max_halfplane scanning algorithm.
In [7]:
import pyscan
import matplotlib.pyplot as plt
import numpy as np
import random
def get_coord(i, lst):
return [pt[i] for pt in lst]
# l = (a, b, c) where ax + by + c = 0 => (-ax -c) / b = y
def f(mr, x):
l = mr.get_coords()
return -(x * l[0] + l[2]) / l[1]
def plot_plane(red, blue, max_region, eps):
n = np.rint(1 / eps)
s = np.rint(1 / (2 * eps * eps))
# create a region containing 5% of the points. Inside of this region points are more likely to be red.
#print(get_coord(1, red))
_, ax = plt.subplots(figsize=(16, 12))
ax.scatter(get_coord(0, red), get_coord(1, red), marker=".", c="r")
ax.scatter(get_coord(0, blue), get_coord(1, blue), marker=".", c="b")
net = pyscan.my_sample(red, n) + pyscan.my_sample(blue, n)
red_s = pyscan.my_sample(red, s)
blue_s = pyscan.my_sample(blue, s)
approx_region, _ = pyscan.max_halfplane(net, red_s, blue_s, pyscan.KULLDORF)
ax.scatter(get_coord(0, net), get_coord(1, net), marker="x", c="k")
ax.plot([0, 1], [f(max_region, 0), f(max_region, 1)], c="k", linewidth=4.0)
ax.plot([0, 1], [f(approx_region, 0), f(approx_region, 1)], c="g", linewidth=4.0)
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.axis('off')
plt.show()
We can plant a region that roughly cuts the point set in half and set two different halfs to be radically different. One side will have 90% percent red and the the other side 90% blue. The cutting region is in black.
In [8]:
pts = [pyscan.WPoint(1.0, random.random(), random.random(), 1.0) for i in range(400)]
red, blue, max_region = pyscan.plant_halfplane(pts, .5, .1, .9)
_, ax = plt.subplots(figsize=(16, 12))
ax.scatter(get_coord(0, red), get_coord(1, red), marker=".", c="r")
ax.scatter(get_coord(0, blue), get_coord(1, blue), marker=".", c="b")
ax.plot([0, 1], [f(max_region, 0), f(max_region, 1)], c="k", linewidth=4.0)
ax.set_xlim([0, 1])
ax.set_ylim([0, 1])
plt.axis('off')
plt.show()
As we use a larger sample size we get better and better results. The algorithm will subsample the data based on an error parameter called \(\varepsilon\). There will be two samples uses in the algorithm one of size \(O(1/\varepsilon)\) and another of size \(O(1/\varepsilon^2)\). If we do not set \(\varepsilon\) to be small enough we will not recover the region. Below I steadily decrease \(\varepsilon\) and therefore increase the sample sizes. The small x markers define the regions that we scan. At first we will not recover the anomalous region.
In [11]:
plot_plane(red, blue, max_region, .2)
plot_plane(red, blue, max_region, .1)
In the above example we found the halfplane rather easily, but we can devise much more complicated situations where the region is much harder to recover.
In [14]:
# generate some random points
pts = [pyscan.WPoint(1.0, random.random(), random.random(), 1.0) for i in range(10000)]
red, blue, max_region = pyscan.plant_halfplane(pts, .05, .5, .8)
_, ax = plt.subplots(figsize=(18, 12))
ax.scatter(get_coord(0, red), get_coord(1, red), marker=".", c="r")
ax.scatter(get_coord(0, blue), get_coord(1, blue), marker=".", c="b")
ax.plot([0, 1], [f(max_region, 0), f(max_region, 1)], c="k", linewidth=4.0)
ax.set_xlim([0, 1])
ax.set_ylim([0, 1])
plt.axis('off')
plt.show()
Now we are scanning a point set with 10000 points and the region is much smaller and less anomalous.
In [15]:
plot_plane(red, blue, max_region, .2)
plot_plane(red, blue, max_region, .1)
In [18]:
plot_plane(red, blue, max_region, .05)
Around .025 we get start getting very close to recovering the region and at .01 we almost perfectly recover it.
In [19]:
plot_plane(red, blue, max_region, .025)
plot_plane(red, blue, max_region, .01)
We can plant a different where 60% of the points are from the anomalous red set and leave the baseline set at 50/50. This makes the region much harder to recover.
In [22]:
red, blue, max_region = pyscan.plant_halfplane(pts, .05, .5, .6)
_, ax = plt.subplots(figsize=(16, 12))
plt.scatter(get_coord(0, red), get_coord(1, red), marker=".", c="r")
plt.scatter(get_coord(0, blue), get_coord(1, blue), marker=".", c="b")
plt.plot([0, 1], [f(max_region, 0), f(max_region, 1)], c="k", linewidth=4.0)
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.axis("off")
plt.show()
We now fail to recover the region with the earlier parameters that were able to recover the region.
In [28]:
plot_plane(red, blue, max_region, .025)
Increasing the error paramter even more allows us to again recover the region.
In [27]:
plot_plane(red, blue, max_region, .005)