diff --git a/_posts/2014-08-20-copyvio-detector.md b/_posts/2014-08-20-copyvio-detector.md index a98a5c9..0983f36 100644 --- a/_posts/2014-08-20-copyvio-detector.md +++ b/_posts/2014-08-20-copyvio-detector.md @@ -4,21 +4,50 @@ title: Copyvio Detector description: A technical writeup of some recent developments. --- -This is an in-progress technical writeup of some recent developments involving -the [copyright violation](//en.wikipedia.org/wiki/WP:COPYVIO) detector for +This is an technical writeup of some recent developments involving the +[copyright violation](//en.wikipedia.org/wiki/WP:COPYVIO) detector for Wikipedia articles that I maintain, located at [tools.wmflabs.org/copyvios](//tools.wmflabs.org/copyvios). Its source code is available on [GitHub](//github.com/earwig/copyvios). +## Dealing with sources + +Of course, the central component of the detector is finding and parsing +potential sources of copyright violations. These sources are obtained through +two methods: investigating external links found in the article, and searching +for article content elsewhere on the web using a search engine +([Yahoo! BOSS](//developer.yahoo.com/boss/search/), paid for by the Wikimedia +Foundation). + +To use the search engine, we must first break the article text up into plain +text search queries, or "chunks". This involves some help from +[mwparserfromhell](//github.com/earwig/mwparserfromhell), which is used to +strip out non-text wikicode from the article, and the [Python Natural Language +Toolkit](http://www.nltk.org/), which is then used to split this up into +sentences, of which we select a few medium-sized ones to search for. +mwparserfromhell is also used to extract the external links. + +Sources are fetched and then parsed differently depending on the document type +(HTML is handled by +[beautifulsoup](http://www.crummy.com/software/BeautifulSoup/), PDFs are +handled by [pdfminer](http://www.unixuser.org/~euske/python/pdfminer/)), and +normalized to a plain text form. We then create multiple +[Markov chains](https://en.wikipedia.org/wiki/Markov_chain) – the *article +chain* is built from word trigrams from the article text, and a *source chain* +is built from each source text. A *delta chain* is created for each source +chain, representing the intersection of it and the article chain by examining +which nodes are shared. + +But how do we use these chains to decide whether a violation is present? + ## Determining violation confidence -One of the most important aspects of the detector is not fetching and parsing -potential sources, but figuring out the likelihood that a given article is a -violation of a given source. We call this number, a value between 0 and 1, the -"confidence" of a violation. Values between 0 and 0.4 indicate no violation -(green background in results page), between 0.4 and 0.75 a "possible" violation -(yellow background), and between 0.75 and 1 a "suspected" violation (red -background). +One of the most nuanced aspects of the detector is figuring out the likelihood +that a given article is a violation of a given source. We call this number, a +value between 0 and 1, the "confidence" of a violation. Values between 0 and +0.4 indicate no violation (green background in results page), between 0.4 and +0.75 a "possible" violation (yellow background), and between 0.75 and 1 a +"suspected" violation (red background). To calculate the confidence of a violation, the copyvio detector uses the maximum value of two functions, one of which accounts for the size of the delta @@ -36,7 +65,7 @@ which point confidence increases at a decreasing rate, with \\(\lim_{\frac{\Delta}{A} \to 1}C\_{A\Delta}(A, \Delta)=1\\) holding true. The exact coefficients used are shown below: -