diff --git a/_posts/2014-08-20-copyvio-detector.md b/_posts/2014-08-20-copyvio-detector.md index 4eba400..a4e065b 100644 --- a/_posts/2014-08-20-copyvio-detector.md +++ b/_posts/2014-08-20-copyvio-detector.md @@ -15,10 +15,10 @@ available on [GitHub](//github.com/earwig/copyvios). One of the most important aspects of the detector is not fetching and parsing potential sources, but figuring out the likelihood that a given article is a violation of a given source. We call this number, a value between 0 and 1, the -"confidence" of a violation. Values between 0 and 0.5 are considered to -indicate no violation (green background in results page), between 0.5 and 0.75 -a "possible" violation (yellow background), and between 0.75 and 1 a -"suspected" violation (red background). +"confidence" of a violation. Values between 0 and 0.5 indicate no violation +(green background in results page), between 0.5 and 0.75 a "possible" violation +(yellow background), and between 0.75 and 1 a "suspected" violation (red +background). To calculate the confidence of a violation, the copyvio detector uses the maximum value of two functions, one of which accounts for the size of the delta @@ -28,19 +28,29 @@ chain (\\(\Delta\\)) in relation to the article chain chains are small, but not when \\(A\\) is significantly larger than \\(\Delta\\). -The article–delta confidence function is simply -\\(\frac{\Delta}{A}\\). Therefore, we have complete confidence of -a violation (\\(C(A, \Delta)=1\\)) when the article and suspected -source share all of their trigrams, half confidence -(\\(C(A, \Delta)=0.5\\)) when the source shares half of the -article's trigrams, and so on. - -The delta confidence function, \\(C_{\Delta}\\), is more -complicated because it must determine a confidence value without having -anything to compare \\(\Delta\\) to. A number of confidence values -were derived experimentally, and the function was extrapolated from there such -that \\(\lim_{Δ \to +\infty}C\_{\Delta}(\Delta) = 1\\). The -reference points were \\(\\{(0, 0), (100, 0.5), (250, 0.75), (500, 0.9), +The article–delta confidence function, \\(C_{A\Delta}\\), is +piecewise-defined such that confidence increases at an exponential rate as +\\(\frac{\Delta}{A}\\) increases, until the value of +\\(C_{A\Delta}\\) reaches the "suspected" violation threshold, at +which point confidence increases at a decreasing rate, with +\\(\lim_{\frac{\Delta}{A} \to 1}C\_{A\Delta}(A, \Delta)=1\\) +holding true. The exact coefficients used are shown below: + +
$$C_{A\Delta}(A, \Delta)=\begin{cases} \ln\frac{1}{1-\frac{\Delta}{A}} & +\frac{\Delta}{A} \le 0.52763 \\[0.5em] +-0.8939(\frac{\Delta}{A})^2+1.8948\frac{\Delta}{A}-0.0009 & +\frac{\Delta}{A} \gt 0.52763 \end{cases}$$
+ +A graph can be viewed [here](/static/article-delta_confidence_function.pdf), +with the x-axis indicating \\(\frac{\Delta}{A}\\) and the y-axis +indicating confidence. The background is colored red, yellow, and green when a +violation is considered suspected, possible, or not present, respectively. + +The delta confidence function, \\(C_{\Delta}\\), is also +piecewise-defined. A number of confidence values were derived experimentally, +and the function was extrapolated from there such that +\\(\lim_{Δ \to +\infty}C\_{\Delta}(\Delta)=1\\). The reference +points were \\(\\{(0, 0), (100, 0.5), (250, 0.75), (500, 0.9), (1000, 0.95)\\}\\). The function is defined as follows:
$$C_{\Delta}(\Delta)=\begin{cases} \frac{\Delta}{\Delta+100} & \Delta\leq @@ -49,13 +59,13 @@ reference points were \\(\\{(0, 0), (100, 0.5), (250, 0.75), (500, 0.9), \frac{\Delta-50}{\Delta} & \Delta\gt500 \end{cases}$$
A graph can be viewed [here](/static/delta_confidence_function.pdf), with the -background colored red, yellow, and green when a violation is considered -suspected, possible, or not present, respectively. +x-axis indicating \\(\Delta\\). The background coloring is the +same as before. Now that we have these two definitions, we can define the primary confidence function, \\(C\\), as follows: -
$$C(A, \Delta) = \max(\tfrac{\Delta}{A}, C_{\Delta}(\Delta))$$
+
$$C(A, \Delta) = \max(C_{A\Delta}(A, \Delta), C_{\Delta}(\Delta))$$
By feeding \\(A\\) and \\(\Delta\\) into \\(C\\), we get our final confidence value. diff --git a/static/article-delta_confidence_function.pdf b/static/article-delta_confidence_function.pdf new file mode 100644 index 0000000..1a16d1b Binary files /dev/null and b/static/article-delta_confidence_function.pdf differ