Personal website https://benkurtovic.com/
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

2014-08-20-copyvio-detector.md 6.7 KiB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129
  1. ---
  2. layout: post
  3. title: Copyvio Detector
  4. tags: Wikipedia
  5. description: A technical writeup of some recent developments
  6. ---
  7. This is an technical writeup of some recent developments involving the
  8. [copyright violation](//en.wikipedia.org/wiki/WP:COPYVIO) detector for
  9. Wikipedia articles that I maintain, located at
  10. [tools.wmflabs.org/copyvios](//tools.wmflabs.org/copyvios). Its source code is
  11. available on [GitHub](//github.com/earwig/copyvios).
  12. ## Dealing with sources
  13. Of course, the central component of the detector is finding and parsing
  14. potential sources of copyright violations. These sources are obtained through
  15. two methods: investigating external links found in the article, and searching
  16. for article content elsewhere on the web using a search engine
  17. ([Yahoo! BOSS](//developer.yahoo.com/boss/search/), paid for by the Wikimedia
  18. Foundation).
  19. To use the search engine, we must first break the article text up into plain
  20. text search queries, or "chunks". This involves some help from
  21. [mwparserfromhell](//github.com/earwig/mwparserfromhell), which is used to
  22. strip out non-text wikicode from the article, and the [Python Natural Language
  23. Toolkit](http://www.nltk.org/), which is then used to split this up into
  24. sentences, of which we select a few medium-sized ones to search for.
  25. mwparserfromhell is also used to extract the external links.
  26. Sources are fetched and then parsed differently depending on the document type
  27. (HTML is handled by
  28. [beautifulsoup](http://www.crummy.com/software/BeautifulSoup/), PDFs are
  29. handled by [pdfminer](http://www.unixuser.org/~euske/python/pdfminer/)), and
  30. normalized to a plain text form. We then create multiple
  31. [Markov chains](https://en.wikipedia.org/wiki/Markov_chain) – the *article
  32. chain* is built from word [5-grams](https://en.wikipedia.org/wiki/N-gram) from
  33. the article text, and a *source chain* is built from each source text. A *delta
  34. chain* is created for each source chain, representing the intersection of it
  35. and the article chain by examining which nodes are shared.
  36. But how do we use these chains to decide whether a violation is present?
  37. ## Determining violation confidence
  38. One of the most nuanced aspects of the detector is figuring out the likelihood
  39. that a given article is a violation of a given source. We call this number, a
  40. value between 0 and 1, the "confidence" of a violation. Values between 0 and
  41. 0.4 indicate no violation (green background in results page), between 0.4 and
  42. 0.75 a "possible" violation (yellow background), and between 0.75 and 1 a
  43. "suspected" violation (red background).
  44. To calculate the confidence of a violation, the copyvio detector uses the
  45. maximum value of two functions, one of which accounts for the size of the delta
  46. chain (<span>\\(\Delta\\)</span>) in relation to the article chain
  47. (<span>\\(A\\)</span>), and the other of which accounts for just the size of
  48. <span>\\(\Delta\\)</span>. This ensures a high confidence value when both
  49. chains are small, but not when <span>\\(A\\)</span> is significantly larger
  50. than <span>\\(\Delta\\)</span>.
  51. The article–delta confidence function, <span>\\(C_{A\Delta}\\)</span>, is
  52. piecewise-defined such that confidence increases at an exponential rate as
  53. <span>\\(\frac{\Delta}{A}\\)</span> increases, until the value of
  54. <span>\\(C_{A\Delta}\\)</span> reaches the "suspected" violation threshold, at
  55. which point confidence increases at a decreasing rate, with
  56. <span>\\(\lim_{\frac{\Delta}{A} \to 1}C\_{A\Delta}(A, \Delta)=1\\)</span>
  57. holding true. The exact coefficients used are shown below:
  58. <div>$$C_{A\Delta}(A, \Delta)=\begin{cases} -\ln(1-\frac{\Delta}{A}) &
  59. \frac{\Delta}{A} \le 0.52763 \\[0.5em]
  60. -0.8939(\frac{\Delta}{A})^2+1.8948\frac{\Delta}{A}-0.0009 &
  61. \frac{\Delta}{A} \gt 0.52763 \end{cases}$$</div>
  62. A graph can be viewed
  63. [here](/static/content/copyvio-detector/article-delta_confidence_function.pdf),
  64. with the x-axis indicating <span>\\(\frac{\Delta}{A}\\)</span> and the y-axis
  65. indicating confidence. The background is colored red, yellow, and green when a
  66. violation is considered suspected, possible, or not present, respectively.
  67. The delta confidence function, <span>\\(C_{\Delta}\\)</span>, is also
  68. piecewise-defined. A number of confidence values were derived experimentally,
  69. and the function was extrapolated from there such that
  70. <span>\\(\lim_{Δ \to +\infty}C\_{\Delta}(\Delta)=1\\)</span>. The reference
  71. points were <span>\\(\\{(0, 0), (100, 0.5), (250, 0.75), (500, 0.9),
  72. (1000, 0.95)\\}\\)</span>. The function is defined as follows:
  73. <div>$$C_{\Delta}(\Delta)=\begin{cases} \frac{\Delta}{\Delta+100} & \Delta\leq
  74. 100 \\[0.5em] \frac{\Delta-25}{\Delta+50} & 100\lt \Delta\leq 250\; \\[0.5em]
  75. \frac{10.5\Delta-750}{10\Delta} & 250\lt \Delta\leq 500\; \\[0.5em]
  76. \frac{\Delta-50}{\Delta} & \Delta\gt500 \end{cases}$$</div>
  77. A graph can be viewed
  78. [here](/static/content/copyvio-detector/delta_confidence_function.pdf), with
  79. the x-axis indicating <span>\\(\Delta\\)</span>. The background coloring is the
  80. same as before.
  81. Now that we have these two definitions, we can define the primary confidence
  82. function, <span>\\(C\\)</span>, as follows:
  83. <div>$$C(A, \Delta) = \max(C_{A\Delta}(A, \Delta), C_{\Delta}(\Delta))$$</div>
  84. By feeding <span>\\(A\\)</span> and <span>\\(\Delta\\)</span> into
  85. <span>\\(C\\)</span>, we get our final confidence value.
  86. ## Multithreaded worker model
  87. At a high level, the detector needs to be able to rapidly handle a lot of
  88. requests at the same time, but without falling victim to denial-of-service
  89. attacks. Since the tool needs to download many webpages very quickly, it is
  90. vulnerable to abuse if the same request is repeated many times without delay.
  91. Therefore, all requests made to the tool share the same set of persistent
  92. worker subprocesses, referred to as *global worker* mode. However, the
  93. underlying detection machinery in earwigbot also supports a *local worker*
  94. mode, which spawns individual workers for each copyvio check so that idle
  95. processes aren't kept running all the time.
  96. But how do these workers handle fetching URLs? The "safe" solution is to only
  97. handle one URL at a time per request, but this is too slow when twenty-five
  98. pages need to be checked in a few seconds – one single slow website will cause
  99. a huge delay. The detector's solution is to keep unprocessed URLs in
  100. site-specific queues, so that at any given point, only one worker is handling
  101. URLs for a particular domain. This way, no individual website is overloaded by
  102. simultaneous requests, but the copyvio check as a whole is completed quickly.
  103. Other features enable efficiency: copyvio check results are cached for a period
  104. of time so that the Foundation doesn't have to pay Yahoo! for the same
  105. information multiple times; and if a possible source is found to have a
  106. confidence value within the "suspected violation" range, yet-to-be-processed
  107. URLs are skipped and the check short-circuits.