--- layout: post title: Copyvio Detector tags: Wikipedia description: A technical writeup of some recent developments --- This is an technical writeup of some recent developments involving the [copyright violation](//en.wikipedia.org/wiki/WP:COPYVIO) detector for Wikipedia articles that I maintain, located at [tools.wmflabs.org/copyvios](//tools.wmflabs.org/copyvios). Its source code is available on [GitHub](//github.com/earwig/copyvios). ## Dealing with sources Of course, the central component of the detector is finding and parsing potential sources of copyright violations. These sources are obtained through two methods: investigating external links found in the article, and searching for article content elsewhere on the web using a search engine ([Yahoo! BOSS](//developer.yahoo.com/boss/search/), paid for by the Wikimedia Foundation). To use the search engine, we must first break the article text up into plain text search queries, or "chunks". This involves some help from [mwparserfromhell](//github.com/earwig/mwparserfromhell), which is used to strip out non-text wikicode from the article, and the [Python Natural Language Toolkit](http://www.nltk.org/), which is then used to split this up into sentences, of which we select a few medium-sized ones to search for. mwparserfromhell is also used to extract the external links. Sources are fetched and then parsed differently depending on the document type (HTML is handled by [beautifulsoup](http://www.crummy.com/software/BeautifulSoup/), PDFs are handled by [pdfminer](http://www.unixuser.org/~euske/python/pdfminer/)), and normalized to a plain text form. We then create multiple [Markov chains](https://en.wikipedia.org/wiki/Markov_chain) – the *article chain* is built from word trigrams from the article text, and a *source chain* is built from each source text. A *delta chain* is created for each source chain, representing the intersection of it and the article chain by examining which nodes are shared. But how do we use these chains to decide whether a violation is present? ## Determining violation confidence One of the most nuanced aspects of the detector is figuring out the likelihood that a given article is a violation of a given source. We call this number, a value between 0 and 1, the "confidence" of a violation. Values between 0 and 0.4 indicate no violation (green background in results page), between 0.4 and 0.75 a "possible" violation (yellow background), and between 0.75 and 1 a "suspected" violation (red background). To calculate the confidence of a violation, the copyvio detector uses the maximum value of two functions, one of which accounts for the size of the delta chain (\\(\Delta\\)) in relation to the article chain (\\(A\\)), and the other of which accounts for just the size of \\(\Delta\\). This ensures a high confidence value when both chains are small, but not when \\(A\\) is significantly larger than \\(\Delta\\). The article–delta confidence function, \\(C_{A\Delta}\\), is piecewise-defined such that confidence increases at an exponential rate as \\(\frac{\Delta}{A}\\) increases, until the value of \\(C_{A\Delta}\\) reaches the "suspected" violation threshold, at which point confidence increases at a decreasing rate, with \\(\lim_{\frac{\Delta}{A} \to 1}C\_{A\Delta}(A, \Delta)=1\\) holding true. The exact coefficients used are shown below: