A copyright violation detector running on Wikimedia Cloud Services https://tools.wmflabs.org/copyvios/
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 

327 lines
19 KiB

  1. <%!
  2. from json import dumps
  3. from flask import url_for
  4. %>\
  5. <%def name="do_indent(size)">
  6. <br />
  7. % for i in xrange(size):
  8. <div class="indent"></div>
  9. % endfor
  10. </%def>\
  11. <%def name="walk_json(obj, indent=0)">
  12. % if isinstance(obj, type({})):
  13. {
  14. % for key in obj:
  15. ${do_indent(indent + 1)}
  16. "${key | h}": ${walk_json(obj[key], indent + 1)}${"," if not loop.last else ""}
  17. % endfor
  18. ${do_indent(indent)}
  19. }
  20. % elif isinstance(obj, (list, tuple, set)):
  21. [
  22. % for elem in obj:
  23. ${do_indent(indent + 1)}
  24. ${walk_json(elem, indent + 1)}${"," if not loop.last else ""}
  25. % endfor
  26. ${do_indent(indent)}
  27. ]
  28. % else:
  29. ${dumps(obj) | h}
  30. % endif
  31. </%def>\
  32. <!DOCTYPE html>
  33. <html lang="en">
  34. <head>
  35. <meta charset="utf-8">
  36. <title>API &ndash; Earwig's Copyvio Detector</title>
  37. <link rel="stylesheet" href="${request.script_root}${url_for('static', file='api.min.css')}" type="text/css" />
  38. </head>
  39. <body>
  40. % if help:
  41. <div id="help">
  42. <h1>Copyvio Detector API</h1>
  43. <p>This is the first version of the <a href="https://en.wikipedia.org/wiki/Application_programming_interface">API</a> for <a href="${request.script_root}/">Earwig's Copyvio Detector</a>. Please <a href="https://github.com/earwig/copyvios/issues">report any issues</a> you encounter.</p>
  44. <h2>Requests</h2>
  45. <p>The API responds to GET requests made to <span class="code">https://copyvios.toolforge.org/api.json</span>. Parameters are described in the tables below:</p>
  46. <table class="parameters">
  47. <tr>
  48. <th colspan="4">Always</th>
  49. </tr>
  50. <tr>
  51. <th>Parameter</th>
  52. <th>Values</th>
  53. <th>Required?</th>
  54. <th>Description</th>
  55. </tr>
  56. <tr>
  57. <td>action</td>
  58. <td><span class="code">compare</span>, <span class="code">search</span>, <span class="code">sites</span></td>
  59. <td>Yes</td>
  60. <td>The API will do URL comparisons in <span class="code">compare</span> mode, run full copyvio checks in <span class="code">search</span> mode, and list all known site languages and projects in <span class="code">sites</span> mode.</td>
  61. </tr>
  62. <tr>
  63. <td>format</td>
  64. <td><span class="code">json</span>, <span class="code">jsonfm</span></td>
  65. <td>No&nbsp;(default:&nbsp;<span class="code">json</span>)</td>
  66. <td>The default output format is <a href="https://www.json.org/">JSON</a>. <span class="code">jsonfm</span> mode produces the same output, but renders it as a formatted HTML document for debugging.</td>
  67. </tr>
  68. <tr>
  69. <td>version</td>
  70. <td>integer</td>
  71. <td>No (default: <span class="code">1</span>)</td>
  72. <td>Currently, the API only has one version. You can skip this parameter, but it is recommended to include it for forward compatibility.</td>
  73. </tr>
  74. </table>
  75. <table class="parameters">
  76. <tr>
  77. <th colspan="4"><span class="code">compare</span> Mode</th>
  78. </tr>
  79. <tr>
  80. <th>Parameter</th>
  81. <th>Values</th>
  82. <th>Required?</th>
  83. <th>Description</th>
  84. </tr>
  85. <tr>
  86. <td>project</td>
  87. <td>string</td>
  88. <td>Yes</td>
  89. <td>The project code of the site the page lives on. Examples are <span class="code">wikipedia</span> and <span class="code">wiktionary</span>. A list of acceptable values can be retrieved using <span class="code">action=sites</span>.</td>
  90. </tr>
  91. <tr>
  92. <td>lang</td>
  93. <td>string</td>
  94. <td>Yes</td>
  95. <td>The language code of the site the page lives on. Examples are <span class="code">en</span> and <span class="code">de</span>. A list of acceptable values can be retrieved using <span class="code">action=sites</span>.</td>
  96. </tr>
  97. <tr>
  98. <td>title</td>
  99. <td>string</td>
  100. <td>Yes&nbsp;(either&nbsp;<span class="code">title</span>&nbsp;or&nbsp;<span class="code">oldid</span>)</td>
  101. <td>The title of the page or article to make a comparison against. Namespace must be included if the page isn't in the mainspace.</td>
  102. </tr>
  103. <tr>
  104. <td>oldid</td>
  105. <td>integer</td>
  106. <td>Yes (either <span class="code">title</span> or <span class="code">oldid</span>)</td>
  107. <td>The revision ID (also called oldid) of the page revision to make a comparison against. If both a title and oldid are given, the oldid will be used.</td>
  108. </tr>
  109. <tr>
  110. <td>url</td>
  111. <td>string</td>
  112. <td>Yes</td>
  113. <td>The URL of the suspected violation source that will be compared to the page.</td>
  114. </tr>
  115. <tr>
  116. <td>detail</td>
  117. <td>boolean</td>
  118. <td>No (default: <span class="code">false</span>)</td>
  119. <td>Whether to include the detailed HTML text comparison available in the regular interface. If not, only the confidence percentage is available.</td>
  120. </tr>
  121. </table>
  122. <table class="parameters">
  123. <tr>
  124. <th colspan="4"><span class="code">search</span> Mode</th>
  125. </tr>
  126. <tr>
  127. <th>Parameter</th>
  128. <th>Values</th>
  129. <th>Required?</th>
  130. <th>Description</th>
  131. </tr>
  132. <tr>
  133. <td>project</td>
  134. <td>string</td>
  135. <td>Yes</td>
  136. <td>The project code of the site the page lives on. Examples are <span class="code">wikipedia</span> and <span class="code">wiktionary</span>. A list of acceptable values can be retrieved using <span class="code">action=sites</span>.</td>
  137. </tr>
  138. <tr>
  139. <td>lang</td>
  140. <td>string</td>
  141. <td>Yes</td>
  142. <td>The language code of the site the page lives on. Examples are <span class="code">en</span> and <span class="code">de</span>. A list of acceptable values can be retrieved using <span class="code">action=sites</span>.</td>
  143. </tr>
  144. <tr>
  145. <td>title</td>
  146. <td>string</td>
  147. <td>Yes&nbsp;(either&nbsp;<span class="code">title</span>&nbsp;or&nbsp;<span class="code">oldid</span>)</td>
  148. <td>The title of the page or article to make a check against. Namespace must be included if the page isn't in the mainspace.</td>
  149. </tr>
  150. <tr>
  151. <td>oldid</td>
  152. <td>integer</td>
  153. <td>Yes (either <span class="code">title</span> or <span class="code">oldid</span>)</td>
  154. <td>The revision ID (also called oldid) of the page revision to make a check against. If both a title and oldid are given, the oldid will be used.</td>
  155. </tr>
  156. <tr>
  157. <td>use_engine</td>
  158. <td>boolean</td>
  159. <td>No (default: <span class="code">true</span>)</td>
  160. <td>Whether to use a search engine (<a href="https://developers.google.com/custom-search/">Google</a>) as a source of URLs to compare against the page.</td>
  161. </tr>
  162. <tr>
  163. <td>use_links</td>
  164. <td>boolean</td>
  165. <td>No (default: <span class="code">true</span>)</td>
  166. <td>Whether to compare the page against external links found in its wikitext.</td>
  167. </tr>
  168. <tr>
  169. <td>nocache</td>
  170. <td>boolean</td>
  171. <td>No (default: <span class="code">false</span>)</td>
  172. <td>Whether to bypass search results cached from previous checks. It is recommended that you don't pass this option unless a user specifically asks for it.</td>
  173. </tr>
  174. <tr>
  175. <td>noredirect</td>
  176. <td>boolean</td>
  177. <td>No (default: <span class="code">false</span>)</td>
  178. <td>Whether to avoid following redirects if the given page is a redirect.</td>
  179. </tr>
  180. <tr>
  181. <td>noskip</td>
  182. <td>boolean</td>
  183. <td>No (default: <span class="code">false</span>)</td>
  184. <td>If a suspected source is found during a check to have a sufficiently high confidence value, the check will end prematurely, and other pending URLs will be skipped. Passing this option will prevent this behavior, resulting in complete (but more time-consuming) checks.</td>
  185. </tr>
  186. </table>
  187. <h2>Responses</h2>
  188. <p>The JSON response object always contains a <span class="code">status</span> key, whose value is either <span class="code">ok</span> or <span class="code">error</span>. If an error has occurred, the response will look like this:</p>
  189. <pre>{
  190. "status": "error",
  191. "error": {
  192. "code": <span class="resp-dtype">string</span> <span class="resp-desc">error code</span>,
  193. "info": <span class="resp-dtype">string</span> <span class="resp-desc">human-readable description of error</span>
  194. }
  195. }</pre>
  196. <p>Valid responses for <span class="code">action=compare</span> and <span class="code">action=search</span> are formatted like this:</p>
  197. <pre>{
  198. "status": "ok",
  199. "meta": {
  200. "time": <span class="resp-dtype">float</span> <span class="resp-desc">time to generate results, in seconds</span>,
  201. "queries": <span class="resp-dtype">int</span> <span class="resp-desc">number of search engine queries made</span>,
  202. "cached": <span class="resp-dtype">boolean</span> <span class="resp-desc">whether these results are cached from an earlier search (always false in the case of action=compare)</span>,
  203. "redirected": <span class="resp-dtype">boolean</span> <span class="resp-desc">whether a redirect was followed</span>,
  204. <span class="resp-cond">only if cached=true</span> "cache_time": <span class="resp-dtype">string</span> <span class="resp-desc">human-readable time of the original search that the results are cached from</span>
  205. },
  206. "page": {
  207. "title": <span class="resp-dtype">string</span> <span class="resp-desc">the normalized title of the page checked</span>,
  208. "url": <span class="resp-dtype">string</span> <span class="resp-desc">the full URL of the page checked</span>
  209. },
  210. <span class="resp-cond">only if redirected=true</span> "original_page": {
  211. "title": <span class="resp-dtype">string</span> <span class="resp-desc">the normalized title of the original page whose redirect was followed</span>,
  212. "url": <span class="resp-dtype">string</span> <span class="resp-desc">the full URL of the original page whose redirect was followed</span>
  213. },
  214. "best": {
  215. "url": <span class="resp-dtype">string</span> <span class="resp-desc">the URL of the best match found, or null if no matches were found</span>,
  216. "confidence": <span class="resp-dtype">float</span> <span class="resp-desc">the confidence of a violation in the best match, or 0.0 if no matches were found</span>,
  217. "violation": <span class="resp-dtype">string</span> <span class="resp-desc">one of "suspected", "possible", or "none"</span>
  218. },
  219. "sources": [
  220. {
  221. "url": <span class="resp-dtype">string</span> <span class="resp-desc">the URL of the source</span>,
  222. "confidence": <span class="resp-dtype">float</span> <span class="resp-desc">the confidence of a violation in the source</span>,
  223. "violation": <span class="resp-dtype">string</span> <span class="resp-desc">one of "suspected", "possible", or "none"</span>,
  224. "skipped": <span class="resp-dtype">boolean</span> <span class="resp-desc">whether the source was skipped due to the check finishing early (see note about noskip above) or an exclusion</span>,
  225. "excluded": <span class="resp-dtype">boolean</span> <span class="resp-desc">whether the source was skipped for being in the excluded URL list</span>
  226. },
  227. ...
  228. ],
  229. <span class="resp-cond">only if action=compare and detail=true</span> "detail": {
  230. "article": <span class="resp-dtype">string</span> <span class="resp-desc">article text, with shared passages marked with HTML</span>,
  231. "source": <span class="resp-dtype">string</span> <span class="resp-desc">source text, with shared passages marked with HTML</span>
  232. }
  233. }</pre>
  234. <p>In the case of <span class="code">action=search</span>, <span class="code">sources</span> will contain one entry for each source checked (or skipped if the check ends early), sorted in order of confidence, with skipped and excluded sources at the bottom.</p>
  235. <p>In the case of <span class="code">action=compare</span>, <span class="code">best</span> will always contain information about the URL that was given, so <span class="code">response["best"]["url"]</span> will never be <span class="code">null</span>. Also, <span class="code">sources</span> will always contain one entry, with the same data as <span class="code">best</span>, since only one source is checked in comparison mode.</p>
  236. <p>Valid responses for <span class="code">action=sites</span> are formatted like this:</p>
  237. <pre>{
  238. "status": "ok",
  239. "langs": [
  240. [
  241. <span class="resp-dtype">string</span> <span class="resp-desc">language code</span>,
  242. <span class="resp-dtype">string</span> <span class="resp-desc">human-readable language name</span>
  243. ],
  244. ...
  245. ],
  246. "projects": [
  247. [
  248. <span class="resp-dtype">string</span> <span class="resp-desc">project code</span>,
  249. <span class="resp-dtype">string</span> <span class="resp-desc">human-readable project name</span>
  250. ],
  251. ...
  252. ]
  253. }</pre>
  254. <h2>Etiquette</h2>
  255. <p>The tool uses the same workers to handle all requests, so making concurrent API calls is only going to slow you down. Most operations are not rate-limited, but full searches with <span class="code">use_engine=True</span> are globally limited to around a thousand per day. Be respectful!</p>
  256. <p>Aside from testing, you must set a reasonable <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent">user agent</a> that identifies your bot and and gives some way to contact you. You may be blocked if using an improper user agent (for example, the default user agent set by your HTTP library), or if your bot makes requests too frequently.</p>
  257. <h2>Example</h2>
  258. <p><a class="no-color" href="https://copyvios.toolforge.org/api.json?version=1&amp;action=search&amp;project=wikipedia&amp;lang=en&amp;title=User:EarwigBot/Copyvios/Tests/2"><span class="code">https://copyvios.toolforge.org/api.json?<span class="param-key">version</span>=<span class="param-val">1</span>&amp;<span class="param-key">action</span>=<span class="param-val">search</span>&amp;<span class="param-key">project</span>=<span class="param-val">wikipedia</span>&amp;<span class="param-key">lang</span>=<span class="param-val">en</span>&amp;<span class="param-key">title</span>=<span class="param-val">User:EarwigBot/Copyvios/Tests/2</span></span></a></p>
  259. <pre>{
  260. "status": "ok",
  261. "meta": {
  262. "time": 2.2474379539489746,
  263. "queries": 1,
  264. "cached": false,
  265. "redirected": false
  266. },
  267. "page": {
  268. "title": "User:EarwigBot/Copyvios/Tests/2",
  269. "url": "https://en.wikipedia.org/wiki/User:EarwigBot/Copyvios/Tests/2"
  270. },
  271. "best": {
  272. "url": "http://www.whitehouse.gov/administration/president-obama/",
  273. "confidence": 0.9886608511242603,
  274. "violation": "suspected"
  275. }
  276. "sources": [
  277. {
  278. "url": "http://www.whitehouse.gov/administration/president-obama/",
  279. "confidence": 0.9886608511242603,
  280. "violation": "suspected",
  281. "skipped": false,
  282. "excluded": false
  283. },
  284. {
  285. "url": "http://maige2009.blogspot.com/2013/07/barack-h-obama-is-44th-president-of.html",
  286. "confidence": 0.9864798816568047,
  287. "violation": "suspected",
  288. "skipped": false,
  289. "excluded": false
  290. },
  291. {
  292. "url": "http://jeuxdemonstre-apkdownload.rhcloud.com/luo-people-of-kenya-and-tanzania---wikipedia--the-free",
  293. "confidence": 0.0,
  294. "violation": "none",
  295. "skipped": false,
  296. "excluded": false
  297. },
  298. {
  299. "url": "http://www.whitehouse.gov/about/presidents/barackobama",
  300. "confidence": 0.0,
  301. "violation": "none",
  302. "skipped": true,
  303. "excluded": false
  304. },
  305. {
  306. "url": "http://jeuxdemonstre-apkdownload.rhcloud.com/president-barack-obama---the-white-house",
  307. "confidence": 0.0,
  308. "violation": "none",
  309. "skipped": true,
  310. "excluded": false
  311. }
  312. ]
  313. }
  314. </pre>
  315. </div>
  316. % endif
  317. % if result:
  318. <div id="result">
  319. <p>You are using <span class="code">jsonfm</span> output mode, which renders JSON data as a formatted HTML document. This is intended for testing and debugging only.</p>
  320. <div class="json">
  321. ${walk_json(result)}
  322. </div>
  323. </div>
  324. % endif
  325. </body>
  326. </html>