From e87d3ada6ac724ae03567fd348b780ec1f4a8206 Mon Sep 17 00:00:00 2001 From: Ben Kurtovic Date: Fri, 26 Sep 2014 20:04:14 -0500 Subject: [PATCH] Finish copyvio detector post. --- _posts/2014-08-20-copyvio-detector.md | 75 ++++++++++++++++++++++++++++++----- 1 file changed, 65 insertions(+), 10 deletions(-) diff --git a/_posts/2014-08-20-copyvio-detector.md b/_posts/2014-08-20-copyvio-detector.md index a98a5c9..0983f36 100644 --- a/_posts/2014-08-20-copyvio-detector.md +++ b/_posts/2014-08-20-copyvio-detector.md @@ -4,21 +4,50 @@ title: Copyvio Detector description: A technical writeup of some recent developments. --- -This is an in-progress technical writeup of some recent developments involving -the [copyright violation](//en.wikipedia.org/wiki/WP:COPYVIO) detector for +This is an technical writeup of some recent developments involving the +[copyright violation](//en.wikipedia.org/wiki/WP:COPYVIO) detector for Wikipedia articles that I maintain, located at [tools.wmflabs.org/copyvios](//tools.wmflabs.org/copyvios). Its source code is available on [GitHub](//github.com/earwig/copyvios). +## Dealing with sources + +Of course, the central component of the detector is finding and parsing +potential sources of copyright violations. These sources are obtained through +two methods: investigating external links found in the article, and searching +for article content elsewhere on the web using a search engine +([Yahoo! BOSS](//developer.yahoo.com/boss/search/), paid for by the Wikimedia +Foundation). + +To use the search engine, we must first break the article text up into plain +text search queries, or "chunks". This involves some help from +[mwparserfromhell](//github.com/earwig/mwparserfromhell), which is used to +strip out non-text wikicode from the article, and the [Python Natural Language +Toolkit](http://www.nltk.org/), which is then used to split this up into +sentences, of which we select a few medium-sized ones to search for. +mwparserfromhell is also used to extract the external links. + +Sources are fetched and then parsed differently depending on the document type +(HTML is handled by +[beautifulsoup](http://www.crummy.com/software/BeautifulSoup/), PDFs are +handled by [pdfminer](http://www.unixuser.org/~euske/python/pdfminer/)), and +normalized to a plain text form. We then create multiple +[Markov chains](https://en.wikipedia.org/wiki/Markov_chain) – the *article +chain* is built from word trigrams from the article text, and a *source chain* +is built from each source text. A *delta chain* is created for each source +chain, representing the intersection of it and the article chain by examining +which nodes are shared. + +But how do we use these chains to decide whether a violation is present? + ## Determining violation confidence -One of the most important aspects of the detector is not fetching and parsing -potential sources, but figuring out the likelihood that a given article is a -violation of a given source. We call this number, a value between 0 and 1, the -"confidence" of a violation. Values between 0 and 0.4 indicate no violation -(green background in results page), between 0.4 and 0.75 a "possible" violation -(yellow background), and between 0.75 and 1 a "suspected" violation (red -background). +One of the most nuanced aspects of the detector is figuring out the likelihood +that a given article is a violation of a given source. We call this number, a +value between 0 and 1, the "confidence" of a violation. Values between 0 and +0.4 indicate no violation (green background in results page), between 0.4 and +0.75 a "possible" violation (yellow background), and between 0.75 and 1 a +"suspected" violation (red background). To calculate the confidence of a violation, the copyvio detector uses the maximum value of two functions, one of which accounts for the size of the delta @@ -36,7 +65,7 @@ which point confidence increases at a decreasing rate, with \\(\lim_{\frac{\Delta}{A} \to 1}C\_{A\Delta}(A, \Delta)=1\\) holding true. The exact coefficients used are shown below: -
$$C_{A\Delta}(A, \Delta)=\begin{cases} \ln\frac{1}{1-\frac{\Delta}{A}} & +
$$C_{A\Delta}(A, \Delta)=\begin{cases} -\ln(1-\frac{\Delta}{A}) & \frac{\Delta}{A} \le 0.52763 \\[0.5em] -0.8939(\frac{\Delta}{A})^2+1.8948\frac{\Delta}{A}-0.0009 & \frac{\Delta}{A} \gt 0.52763 \end{cases}$$
@@ -69,3 +98,29 @@ function, \\(C\\), as follows: By feeding \\(A\\) and \\(\Delta\\) into \\(C\\), we get our final confidence value. + +## Multithreaded worker model + +At a high level, the detector needs to be able to rapidly handle a lot of +requests at the same time, but without falling victim to denial-of-service +attacks. Since the tool needs to download many webpages very quickly, it is +vulnerable to abuse if the same request is repeated many times without delay. +Therefore, all requests made to the tool share the same set of persistent +worker subprocesses, referred to as *global worker* mode. However, the +underlying detection machinery in earwigbot also supports a *local worker* +mode, which spawns individual workers for each copyvio check so that idle +processes aren't kept running all the time. + +But how do these workers handle fetching URLs? The "safe" solution is to only +handle one URL at a time per request, but this is too slow when twenty-five +pages need to be checked in a few seconds – one single slow website will cause +a huge delay. The detector's solution is to keep unprocessed URLs in +site-specific queues, so that at any given point, only one worker is handling +URLs for a particular domain. This way, no individual website is overloaded by +simultaneous requests, but the copyvio check as a whole is completed quickly. + +Other features enable efficiency: copyvio check results are cached for a period +of time so that the Foundation doesn't have to pay Yahoo! for the same +information multiple times; and if a possible source is found to have a +confidence value within the "suspected violation" range, yet-to-be-processed +URLs are skipped and the check short-circuits.