diff --git a/_posts/2014-08-20-copyvio-detector.md b/_posts/2014-08-20-copyvio-detector.md new file mode 100644 index 0000000..bc9f9e8 --- /dev/null +++ b/_posts/2014-08-20-copyvio-detector.md @@ -0,0 +1,61 @@ +--- +layout: post +title: Copyvio Detector +description: A technical writeup of some recent developments. +--- + +This is an in-progress technical writeup of some recent developments involving +the [copyright violation](//en.wikipedia.org/wiki/WP:COPYVIO) detector for +Wikipedia articles that I maintain, located at +[tools.wmflabs.org/copyvios](//tools.wmflabs.org/copyvios). Its source code is +available on [GitHub](//github.com/earwig/copyvios). + +## Determining violation confidence + +One of the most important aspects of the detector is not fetching and parsing +potential sources, but figuring out the likelihood that a given article is a +violation of a given source. We call this number, a value between 0 and 1, the +"confidence" of a violation. Values between 0 and 0.5 are considered to +indicate no violation (green background in results page), between 0.5 and 0.75 +a "possible" violation (yellow background), and between 0.75 and 1 a +"suspected" violation (red background). + +To calculate the confidence of a violation, the copyvio detector uses the +maximum value of two functions, one of which accounts for the size of the delta +chain (\\(\Delta\\)) in relation to the article chain +(\\(A\\)), and the other of which accounts for just the size of +\\(\Delta\\). This ensures a high confidence value when both +chains are small, but not when \\(A\\) is significantly larger +than \\(\Delta\\). + +The article–delta confidence function is simply +\\(\frac{\Delta}{A}\\). Therefore, we have complete confidence of +a violation (\\(C(A, \Delta)=1\\)) when the article and suspected +source share all of their trigrams, half confidence +(\\(C(A, \Delta)=0.5\\)) when the source shares half of the +article's trigrams, and so on. + +The delta confidence function, \\(C_{\Delta}\\), is more +complicated because it must determine a confidence value without having +anything to compare \\(\Delta\\) to. A number of confidence values +were derived experimentally, and the function was extrapolated from there such +that \\(\lim_{Δ \to +\infty}C\_{\Delta}(\Delta) = 1\\). The +reference points were \\(\\{(0, 0), (100, 0.5), (250, 0.75), (500, 0.9), +(1000, 0.95)\\}\\). The function is defined as follows: + +