|
|
@@ -4,21 +4,50 @@ title: Copyvio Detector |
|
|
|
description: A technical writeup of some recent developments. |
|
|
|
--- |
|
|
|
|
|
|
|
This is an in-progress technical writeup of some recent developments involving |
|
|
|
the [copyright violation](//en.wikipedia.org/wiki/WP:COPYVIO) detector for |
|
|
|
This is an technical writeup of some recent developments involving the |
|
|
|
[copyright violation](//en.wikipedia.org/wiki/WP:COPYVIO) detector for |
|
|
|
Wikipedia articles that I maintain, located at |
|
|
|
[tools.wmflabs.org/copyvios](//tools.wmflabs.org/copyvios). Its source code is |
|
|
|
available on [GitHub](//github.com/earwig/copyvios). |
|
|
|
|
|
|
|
## Dealing with sources |
|
|
|
|
|
|
|
Of course, the central component of the detector is finding and parsing |
|
|
|
potential sources of copyright violations. These sources are obtained through |
|
|
|
two methods: investigating external links found in the article, and searching |
|
|
|
for article content elsewhere on the web using a search engine |
|
|
|
([Yahoo! BOSS](//developer.yahoo.com/boss/search/), paid for by the Wikimedia |
|
|
|
Foundation). |
|
|
|
|
|
|
|
To use the search engine, we must first break the article text up into plain |
|
|
|
text search queries, or "chunks". This involves some help from |
|
|
|
[mwparserfromhell](//github.com/earwig/mwparserfromhell), which is used to |
|
|
|
strip out non-text wikicode from the article, and the [Python Natural Language |
|
|
|
Toolkit](http://www.nltk.org/), which is then used to split this up into |
|
|
|
sentences, of which we select a few medium-sized ones to search for. |
|
|
|
mwparserfromhell is also used to extract the external links. |
|
|
|
|
|
|
|
Sources are fetched and then parsed differently depending on the document type |
|
|
|
(HTML is handled by |
|
|
|
[beautifulsoup](http://www.crummy.com/software/BeautifulSoup/), PDFs are |
|
|
|
handled by [pdfminer](http://www.unixuser.org/~euske/python/pdfminer/)), and |
|
|
|
normalized to a plain text form. We then create multiple |
|
|
|
[Markov chains](https://en.wikipedia.org/wiki/Markov_chain) – the *article |
|
|
|
chain* is built from word trigrams from the article text, and a *source chain* |
|
|
|
is built from each source text. A *delta chain* is created for each source |
|
|
|
chain, representing the intersection of it and the article chain by examining |
|
|
|
which nodes are shared. |
|
|
|
|
|
|
|
But how do we use these chains to decide whether a violation is present? |
|
|
|
|
|
|
|
## Determining violation confidence |
|
|
|
|
|
|
|
One of the most important aspects of the detector is not fetching and parsing |
|
|
|
potential sources, but figuring out the likelihood that a given article is a |
|
|
|
violation of a given source. We call this number, a value between 0 and 1, the |
|
|
|
"confidence" of a violation. Values between 0 and 0.4 indicate no violation |
|
|
|
(green background in results page), between 0.4 and 0.75 a "possible" violation |
|
|
|
(yellow background), and between 0.75 and 1 a "suspected" violation (red |
|
|
|
background). |
|
|
|
One of the most nuanced aspects of the detector is figuring out the likelihood |
|
|
|
that a given article is a violation of a given source. We call this number, a |
|
|
|
value between 0 and 1, the "confidence" of a violation. Values between 0 and |
|
|
|
0.4 indicate no violation (green background in results page), between 0.4 and |
|
|
|
0.75 a "possible" violation (yellow background), and between 0.75 and 1 a |
|
|
|
"suspected" violation (red background). |
|
|
|
|
|
|
|
To calculate the confidence of a violation, the copyvio detector uses the |
|
|
|
maximum value of two functions, one of which accounts for the size of the delta |
|
|
@@ -36,7 +65,7 @@ which point confidence increases at a decreasing rate, with |
|
|
|
<span>\\(\lim_{\frac{\Delta}{A} \to 1}C\_{A\Delta}(A, \Delta)=1\\)</span> |
|
|
|
holding true. The exact coefficients used are shown below: |
|
|
|
|
|
|
|
<div>$$C_{A\Delta}(A, \Delta)=\begin{cases} \ln\frac{1}{1-\frac{\Delta}{A}} & |
|
|
|
<div>$$C_{A\Delta}(A, \Delta)=\begin{cases} -\ln(1-\frac{\Delta}{A}) & |
|
|
|
\frac{\Delta}{A} \le 0.52763 \\[0.5em] |
|
|
|
-0.8939(\frac{\Delta}{A})^2+1.8948\frac{\Delta}{A}-0.0009 & |
|
|
|
\frac{\Delta}{A} \gt 0.52763 \end{cases}$$</div> |
|
|
@@ -69,3 +98,29 @@ function, <span>\\(C\\)</span>, as follows: |
|
|
|
|
|
|
|
By feeding <span>\\(A\\)</span> and <span>\\(\Delta\\)</span> into |
|
|
|
<span>\\(C\\)</span>, we get our final confidence value. |
|
|
|
|
|
|
|
## Multithreaded worker model |
|
|
|
|
|
|
|
At a high level, the detector needs to be able to rapidly handle a lot of |
|
|
|
requests at the same time, but without falling victim to denial-of-service |
|
|
|
attacks. Since the tool needs to download many webpages very quickly, it is |
|
|
|
vulnerable to abuse if the same request is repeated many times without delay. |
|
|
|
Therefore, all requests made to the tool share the same set of persistent |
|
|
|
worker subprocesses, referred to as *global worker* mode. However, the |
|
|
|
underlying detection machinery in earwigbot also supports a *local worker* |
|
|
|
mode, which spawns individual workers for each copyvio check so that idle |
|
|
|
processes aren't kept running all the time. |
|
|
|
|
|
|
|
But how do these workers handle fetching URLs? The "safe" solution is to only |
|
|
|
handle one URL at a time per request, but this is too slow when twenty-five |
|
|
|
pages need to be checked in a few seconds – one single slow website will cause |
|
|
|
a huge delay. The detector's solution is to keep unprocessed URLs in |
|
|
|
site-specific queues, so that at any given point, only one worker is handling |
|
|
|
URLs for a particular domain. This way, no individual website is overloaded by |
|
|
|
simultaneous requests, but the copyvio check as a whole is completed quickly. |
|
|
|
|
|
|
|
Other features enable efficiency: copyvio check results are cached for a period |
|
|
|
of time so that the Foundation doesn't have to pay Yahoo! for the same |
|
|
|
information multiple times; and if a possible source is found to have a |
|
|
|
confidence value within the "suspected violation" range, yet-to-be-processed |
|
|
|
URLs are skipped and the check short-circuits. |