From e87d3ada6ac724ae03567fd348b780ec1f4a8206 Mon Sep 17 00:00:00 2001
From: Ben Kurtovic <ben.kurtovic@gmail.com>
Date: Fri, 26 Sep 2014 20:04:14 -0500
Subject: [PATCH] Finish copyvio detector post.

---
 _posts/2014-08-20-copyvio-detector.md | 75 ++++++++++++++++++++++++++++++-----
 1 file changed, 65 insertions(+), 10 deletions(-)
diff --git a/_posts/2014-08-20-copyvio-detector.md b/_posts/2014-08-20-copyvio-detector.md
index a98a5c9..0983f36 100644
--- a/_posts/2014-08-20-copyvio-detector.md
+++ b/_posts/2014-08-20-copyvio-detector.md
@@ -4,21 +4,50 @@ title: Copyvio Detector
 description: A technical writeup of some recent developments.
 ---
 
-This is an in-progress technical writeup of some recent developments involving
-the [copyright violation](//en.wikipedia.org/wiki/WP:COPYVIO) detector for
+This is an technical writeup of some recent developments involving the
+[copyright violation](//en.wikipedia.org/wiki/WP:COPYVIO) detector for
 Wikipedia articles that I maintain, located at
 [tools.wmflabs.org/copyvios](//tools.wmflabs.org/copyvios). Its source code is
 available on [GitHub](//github.com/earwig/copyvios).
 
+## Dealing with sources
+
+Of course, the central component of the detector is finding and parsing
+potential sources of copyright violations. These sources are obtained through
+two methods: investigating external links found in the article, and searching
+for article content elsewhere on the web using a search engine
+([Yahoo! BOSS](//developer.yahoo.com/boss/search/), paid for by the Wikimedia
+Foundation).
+
+To use the search engine, we must first break the article text up into plain
+text search queries, or "chunks". This involves some help from
+[mwparserfromhell](//github.com/earwig/mwparserfromhell), which is used to
+strip out non-text wikicode from the article, and the [Python Natural Language
+Toolkit](http://www.nltk.org/), which is then used to split this up into
+sentences, of which we select a few medium-sized ones to search for.
+mwparserfromhell is also used to extract the external links.
+
+Sources are fetched and then parsed differently depending on the document type
+(HTML is handled by
+[beautifulsoup](http://www.crummy.com/software/BeautifulSoup/), PDFs are
+handled by [pdfminer](http://www.unixuser.org/~euske/python/pdfminer/)), and
+normalized to a plain text form. We then create multiple
+[Markov chains](https://en.wikipedia.org/wiki/Markov_chain) – the *article
+chain* is built from word trigrams from the article text, and a *source chain*
+is built from each source text. A *delta chain* is created for each source
+chain, representing the intersection of it and the article chain by examining
+which nodes are shared.
+
+But how do we use these chains to decide whether a violation is present?
+
 ## Determining violation confidence
 
-One of the most important aspects of the detector is not fetching and parsing
-potential sources, but figuring out the likelihood that a given article is a
-violation of a given source. We call this number, a value between 0 and 1, the
-"confidence" of a violation. Values between 0 and 0.4 indicate no violation
-(green background in results page), between 0.4 and 0.75 a "possible" violation
-(yellow background), and between 0.75 and 1 a "suspected" violation (red
-background).
+One of the most nuanced aspects of the detector is figuring out the likelihood
+that a given article is a violation of a given source. We call this number, a
+value between 0 and 1, the "confidence" of a violation. Values between 0 and
+0.4 indicate no violation (green background in results page), between 0.4 and
+0.75 a "possible" violation (yellow background), and between 0.75 and 1 a
+"suspected" violation (red background).
 
 To calculate the confidence of a violation, the copyvio detector uses the
 maximum value of two functions, one of which accounts for the size of the delta
@@ -36,7 +65,7 @@ which point confidence increases at a decreasing rate, with
 <span>\\(\lim_{\frac{\Delta}{A} \to 1}C\_{A\Delta}(A, \Delta)=1\\)</span>
 holding true. The exact coefficients used are shown below:
 
-<div>$$C_{A\Delta}(A, \Delta)=\begin{cases} \ln\frac{1}{1-\frac{\Delta}{A}} &
+<div>$$C_{A\Delta}(A, \Delta)=\begin{cases} -\ln(1-\frac{\Delta}{A}) &
 \frac{\Delta}{A} \le 0.52763 \\[0.5em]
 -0.8939(\frac{\Delta}{A})^2+1.8948\frac{\Delta}{A}-0.0009 &
 \frac{\Delta}{A} \gt 0.52763 \end{cases}$$</div>
@@ -69,3 +98,29 @@ function, <span>\\(C\\)</span>, as follows:
 
 By feeding <span>\\(A\\)</span> and <span>\\(\Delta\\)</span> into
 <span>\\(C\\)</span>, we get our final confidence value.
+
+## Multithreaded worker model
+
+At a high level, the detector needs to be able to rapidly handle a lot of
+requests at the same time, but without falling victim to denial-of-service
+attacks. Since the tool needs to download many webpages very quickly, it is
+vulnerable to abuse if the same request is repeated many times without delay.
+Therefore, all requests made to the tool share the same set of persistent
+worker subprocesses, referred to as *global worker* mode. However, the
+underlying detection machinery in earwigbot also supports a *local worker*
+mode, which spawns individual workers for each copyvio check so that idle
+processes aren't kept running all the time.
+
+But how do these workers handle fetching URLs? The "safe" solution is to only
+handle one URL at a time per request, but this is too slow when twenty-five
+pages need to be checked in a few seconds – one single slow website will cause
+a huge delay. The detector's solution is to keep unprocessed URLs in
+site-specific queues, so that at any given point, only one worker is handling
+URLs for a particular domain. This way, no individual website is overloaded by
+simultaneous requests, but the copyvio check as a whole is completed quickly.
+
+Other features enable efficiency: copyvio check results are cached for a period
+of time so that the Foundation doesn't have to pay Yahoo! for the same
+information multiple times; and if a possible source is found to have a
+confidence value within the "suspected violation" range, yet-to-be-processed
+URLs are skipped and the check short-circuits.