diff --git a/_posts/2014-08-20-copyvio-detector.md b/_posts/2014-08-20-copyvio-detector.md new file mode 100644 index 0000000..bc9f9e8 --- /dev/null +++ b/_posts/2014-08-20-copyvio-detector.md @@ -0,0 +1,61 @@ +--- +layout: post +title: Copyvio Detector +description: A technical writeup of some recent developments. +--- + +This is an in-progress technical writeup of some recent developments involving +the [copyright violation](//en.wikipedia.org/wiki/WP:COPYVIO) detector for +Wikipedia articles that I maintain, located at +[tools.wmflabs.org/copyvios](//tools.wmflabs.org/copyvios). Its source code is +available on [GitHub](//github.com/earwig/copyvios). + +## Determining violation confidence + +One of the most important aspects of the detector is not fetching and parsing +potential sources, but figuring out the likelihood that a given article is a +violation of a given source. We call this number, a value between 0 and 1, the +"confidence" of a violation. Values between 0 and 0.5 are considered to +indicate no violation (green background in results page), between 0.5 and 0.75 +a "possible" violation (yellow background), and between 0.75 and 1 a +"suspected" violation (red background). + +To calculate the confidence of a violation, the copyvio detector uses the +maximum value of two functions, one of which accounts for the size of the delta +chain (\\(\Delta\\)) in relation to the article chain +(\\(A\\)), and the other of which accounts for just the size of +\\(\Delta\\). This ensures a high confidence value when both +chains are small, but not when \\(A\\) is significantly larger +than \\(\Delta\\). + +The article–delta confidence function is simply +\\(\frac{\Delta}{A}\\). Therefore, we have complete confidence of +a violation (\\(C(A, \Delta)=1\\)) when the article and suspected +source share all of their trigrams, half confidence +(\\(C(A, \Delta)=0.5\\)) when the source shares half of the +article's trigrams, and so on. + +The delta confidence function, \\(C_{\Delta}\\), is more +complicated because it must determine a confidence value without having +anything to compare \\(\Delta\\) to. A number of confidence values +were derived experimentally, and the function was extrapolated from there such +that \\(\lim_{Δ \to +\infty}C\_{\Delta}(\Delta) = 1\\). The +reference points were \\(\\{(0, 0), (100, 0.5), (250, 0.75), (500, 0.9), +(1000, 0.95)\\}\\). The function is defined as follows: + +
$$C_{\Delta}(\Delta)=\begin{cases} \frac{\Delta}{\Delta+100} & \Delta\leq +100 \\[0.5em] \frac{\Delta-25}{\Delta+50} & 100\lt \Delta\leq 250\; \\[0.5em] +\frac{10.5\Delta-750}{10\Delta} & 250\lt \Delta\leq 500\; \\[0.5em] +\frac{\Delta-50}{\Delta} & \Delta\gt500 \end{cases}$$
+ +A graph can be viewed [here](/static/Delta Confidence Function.pdf), with the +background colored red, yellow, and green when a violation is considered +suspected, possible, or not present, respectively. + +Now that we have these two definitions, we can define the primary confidence +function, \\(C\\), as follows: + +
$$C(A, \Delta) = \max(\tfrac{\Delta}{A}, C_{\Delta}(\Delta))$$
+ +By feeding \\(A\\) and \\(\Delta\\) into +\\(C\\), we get our final confidence value. diff --git a/static/Delta Confidence Function.pdf b/static/Delta Confidence Function.pdf new file mode 100644 index 0000000..c4150ff Binary files /dev/null and b/static/Delta Confidence Function.pdf differ