Personal website https://benkurtovic.com/
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 

6.6 KiB

layout title tags description
post Copyvio Detector Wikipedia A technical writeup of some recent developments

This is an technical writeup of some recent developments involving the copyright violation detector for Wikipedia articles that I maintain, located at tools.wmflabs.org/copyvios. Its source code is available on GitHub.

Dealing with sources

Of course, the central component of the detector is finding and parsing potential sources of copyright violations. These sources are obtained through two methods: investigating external links found in the article, and searching for article content elsewhere on the web using a search engine (Yahoo! BOSS, paid for by the Wikimedia Foundation).

To use the search engine, we must first break the article text up into plain text search queries, or “chunks”. This involves some help from mwparserfromhell, which is used to strip out non-text wikicode from the article, and the Python Natural Language Toolkit, which is then used to split this up into sentences, of which we select a few medium-sized ones to search for. mwparserfromhell is also used to extract the external links.

Sources are fetched and then parsed differently depending on the document type (HTML is handled by beautifulsoup, PDFs are handled by pdfminer), and normalized to a plain text form. We then create multiple Markov chains – the article chain is built from word trigrams from the article text, and a source chain is built from each source text. A delta chain is created for each source chain, representing the intersection of it and the article chain by examining which nodes are shared.

But how do we use these chains to decide whether a violation is present?

Determining violation confidence

One of the most nuanced aspects of the detector is figuring out the likelihood that a given article is a violation of a given source. We call this number, a value between 0 and 1, the “confidence” of a violation. Values between 0 and 0.4 indicate no violation (green background in results page), between 0.4 and 0.75 a “possible” violation (yellow background), and between 0.75 and 1 a “suspected” violation (red background).

To calculate the confidence of a violation, the copyvio detector uses the maximum value of two functions, one of which accounts for the size of the delta chain (\(\Delta\)) in relation to the article chain (\(A\)), and the other of which accounts for just the size of \(\Delta\). This ensures a high confidence value when both chains are small, but not when \(A\) is significantly larger than \(\Delta\).

The article–delta confidence function, \(C_{A\Delta}\), is piecewise-defined such that confidence increases at an exponential rate as \(\frac{\Delta}{A}\) increases, until the value of \(C_{A\Delta}\) reaches the “suspected” violation threshold, at which point confidence increases at a decreasing rate, with \(\lim_{\frac{\Delta}{A} \to 1}C_{A\Delta}(A, \Delta)=1\) holding true. The exact coefficients used are shown below:

$$C_{A\Delta}(A, \Delta)=\begin{cases} -\ln(1-\frac{\Delta}{A}) & \frac{\Delta}{A} \le 0.52763 \\[0.5em] -0.8939(\frac{\Delta}{A})^2+1.8948\frac{\Delta}{A}-0.0009 & \frac{\Delta}{A} \gt 0.52763 \end{cases}$$

A graph can be viewed here, with the x-axis indicating \(\frac{\Delta}{A}\) and the y-axis indicating confidence. The background is colored red, yellow, and green when a violation is considered suspected, possible, or not present, respectively.

The delta confidence function, \(C_{\Delta}\), is also piecewise-defined. A number of confidence values were derived experimentally, and the function was extrapolated from there such that \(\lim_{Δ \to +\infty}C_{\Delta}(\Delta)=1\). The reference points were \(\{(0, 0), (100, 0.5), (250, 0.75), (500, 0.9), (1000, 0.95)\}\). The function is defined as follows:

$$C_{\Delta}(\Delta)=\begin{cases} \frac{\Delta}{\Delta+100} & \Delta\leq 100 \\[0.5em] \frac{\Delta-25}{\Delta+50} & 100\lt \Delta\leq 250\; \\[0.5em] \frac{10.5\Delta-750}{10\Delta} & 250\lt \Delta\leq 500\; \\[0.5em] \frac{\Delta-50}{\Delta} & \Delta\gt500 \end{cases}$$

A graph can be viewed here, with the x-axis indicating \(\Delta\). The background coloring is the same as before.

Now that we have these two definitions, we can define the primary confidence function, \(C\), as follows:

$$C(A, \Delta) = \max(C_{A\Delta}(A, \Delta), C_{\Delta}(\Delta))$$

By feeding \(A\) and \(\Delta\) into \(C\), we get our final confidence value.

Multithreaded worker model

At a high level, the detector needs to be able to rapidly handle a lot of requests at the same time, but without falling victim to denial-of-service attacks. Since the tool needs to download many webpages very quickly, it is vulnerable to abuse if the same request is repeated many times without delay. Therefore, all requests made to the tool share the same set of persistent worker subprocesses, referred to as global worker mode. However, the underlying detection machinery in earwigbot also supports a local worker mode, which spawns individual workers for each copyvio check so that idle processes aren’t kept running all the time.

But how do these workers handle fetching URLs? The “safe” solution is to only handle one URL at a time per request, but this is too slow when twenty-five pages need to be checked in a few seconds – one single slow website will cause a huge delay. The detector’s solution is to keep unprocessed URLs in site-specific queues, so that at any given point, only one worker is handling URLs for a particular domain. This way, no individual website is overloaded by simultaneous requests, but the copyvio check as a whole is completed quickly.

Other features enable efficiency: copyvio check results are cached for a period of time so that the Foundation doesn’t have to pay Yahoo! for the same information multiple times; and if a possible source is found to have a confidence value within the “suspected violation” range, yet-to-be-processed URLs are skipped and the check short-circuits.