A copyright violation detector running on Wikimedia Cloud Services https://tools.wmflabs.org/copyvios/
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 

319 lines
18 KiB

  1. <%!
  2. from json import dumps
  3. from flask import url_for
  4. %>\
  5. <%def name="do_indent(size)">
  6. <br />
  7. % for i in xrange(size):
  8. <div class="indent"></div>
  9. % endfor
  10. </%def>\
  11. <%def name="walk_json(obj, indent=0)">
  12. % if isinstance(obj, type({})):
  13. {
  14. % for key in obj:
  15. ${do_indent(indent + 1)}
  16. "${key | h}": ${walk_json(obj[key], indent + 1)}${"," if not loop.last else ""}
  17. % endfor
  18. ${do_indent(indent)}
  19. }
  20. % elif isinstance(obj, (list, tuple, set)):
  21. [
  22. % for elem in obj:
  23. ${do_indent(indent + 1)}
  24. ${walk_json(elem, indent + 1)}${"," if not loop.last else ""}
  25. % endfor
  26. ${do_indent(indent)}
  27. ]
  28. % else:
  29. ${dumps(obj) | h}
  30. % endif
  31. </%def>\
  32. <!DOCTYPE html>
  33. <html lang="en">
  34. <head>
  35. <meta charset="utf-8">
  36. <title>API &ndash; Earwig's Copyvio Detector</title>
  37. <link rel="stylesheet" href="${request.script_root}${url_for('static', file='api.min.css')}" type="text/css" />
  38. </head>
  39. <body>
  40. % if help:
  41. <div id="help">
  42. <h1>Copyvio Detector API</h1>
  43. <p>This is the first version of the <a href="//en.wikipedia.org/wiki/Application_programming_interface">API</a> for <a href="${request.script_root}">Earwig's Copyvio Detector</a>. It works, but some bugs might still need to be ironed out, so please <a href="https://github.com/earwig/copyvios/issues">report any</a> if you see them.</p>
  44. <h2>Requests</h2>
  45. <p>The API responds to GET requests made to <span class="code">https://tools.wmflabs.org/copyvios/api.json</span>. Parameters are described in the tables below:</p>
  46. <table class="parameters">
  47. <tr>
  48. <th colspan="4">Always</th>
  49. </tr>
  50. <tr>
  51. <th>Parameter</th>
  52. <th>Values</th>
  53. <th>Required?</th>
  54. <th>Description</th>
  55. </tr>
  56. <tr>
  57. <td>action</td>
  58. <td><span class="code">compare</span>, <span class="code">search</span>, <span class="code">sites</span></td>
  59. <td>Yes</td>
  60. <td>The API will do URL comparisons in <span class="code">compare</span> mode, run full copyvio checks in <span class="code">search</span> mode, and list all known site languages and projects in <span class="code">sites</span> mode.</td>
  61. </tr>
  62. <tr>
  63. <td>format</td>
  64. <td><span class="code">json</span>, <span class="code">jsonfm</span></td>
  65. <td>No&nbsp;(default:&nbsp;<span class="code">json</span>)</td>
  66. <td>The default output format is <a href="http://json.org/">JSON</a>. <span class="code">jsonfm</span> mode produces the same output, but renders it as a formatted HTML document for debugging.</td>
  67. </tr>
  68. <tr>
  69. <td>version</td>
  70. <td>integer</td>
  71. <td>No (default: <span class="code">1</span>)</td>
  72. <td>Currently, the API only has one version. You can skip this parameter, but it is recommended to include it for forward compatibility.</td>
  73. </tr>
  74. </table>
  75. <table class="parameters">
  76. <tr>
  77. <th colspan="4"><span class="code">compare</span> Mode</th>
  78. </tr>
  79. <tr>
  80. <th>Parameter</th>
  81. <th>Values</th>
  82. <th>Required?</th>
  83. <th>Description</th>
  84. </tr>
  85. <tr>
  86. <td>project</td>
  87. <td>string</td>
  88. <td>Yes</td>
  89. <td>The project code of the site the page lives on. Examples are <span class="code">wikipedia</span> and <span class="code">wiktionary</span>. A list of acceptable values can be retrieved using <span class="code">action=sites</span>.</td>
  90. </tr>
  91. <tr>
  92. <td>lang</td>
  93. <td>string</td>
  94. <td>Yes</td>
  95. <td>The language code of the site the page lives on. Examples are <span class="code">en</span> and <span class="code">de</span>. A list of acceptable values can be retrieved using <span class="code">action=sites</span>.</td>
  96. </tr>
  97. <tr>
  98. <td>title</td>
  99. <td>string</td>
  100. <td>Yes&nbsp;(either&nbsp;<span class="code">title</span>&nbsp;or&nbsp;<span class="code">oldid</span>)</td>
  101. <td>The title of the page or article to make a comparison against. Namespace must be included if the page isn't in the mainspace.</td>
  102. </tr>
  103. <tr>
  104. <td>oldid</td>
  105. <td>integer</td>
  106. <td>Yes (either <span class="code">title</span> or <span class="code">oldid</span>)</td>
  107. <td>The revision ID (also called oldid) of the page revision to make a comparison against. If both a title and oldid are given, the oldid will be used.</td>
  108. </tr>
  109. <tr>
  110. <td>url</td>
  111. <td>string</td>
  112. <td>Yes</td>
  113. <td>The URL of the suspected violation source that will be compared to the page.</td>
  114. </tr>
  115. </table>
  116. <table class="parameters">
  117. <tr>
  118. <th colspan="4"><span class="code">search</span> Mode</th>
  119. </tr>
  120. <tr>
  121. <th>Parameter</th>
  122. <th>Values</th>
  123. <th>Required?</th>
  124. <th>Description</th>
  125. </tr>
  126. <tr>
  127. <td>project</td>
  128. <td>string</td>
  129. <td>Yes</td>
  130. <td>The project code of the site the page lives on. Examples are <span class="code">wikipedia</span> and <span class="code">wiktionary</span>. A list of acceptable values can be retrieved using <span class="code">action=sites</span>.</td>
  131. </tr>
  132. <tr>
  133. <td>lang</td>
  134. <td>string</td>
  135. <td>Yes</td>
  136. <td>The language code of the site the page lives on. Examples are <span class="code">en</span> and <span class="code">de</span>. A list of acceptable values can be retrieved using <span class="code">action=sites</span>.</td>
  137. </tr>
  138. <tr>
  139. <td>title</td>
  140. <td>string</td>
  141. <td>Yes&nbsp;(either&nbsp;<span class="code">title</span>&nbsp;or&nbsp;<span class="code">oldid</span>)</td>
  142. <td>The title of the page or article to make a check against. Namespace must be included if the page isn't in the mainspace.</td>
  143. </tr>
  144. <tr>
  145. <td>oldid</td>
  146. <td>integer</td>
  147. <td>Yes (either <span class="code">title</span> or <span class="code">oldid</span>)</td>
  148. <td>The revision ID (also called oldid) of the page revision to make a check against. If both a title and oldid are given, the oldid will be used.</td>
  149. </tr>
  150. <tr>
  151. <td>use_engine</td>
  152. <td>boolean</td>
  153. <td>No (default: <span class="code">true</span>)</td>
  154. <td>Whether to use a search engine (<a href="//developer.yahoo.com/boss/search/">Yahoo! BOSS</a>) as a source of URLs to compare against the page.</td>
  155. </tr>
  156. <tr>
  157. <td>use_links</td>
  158. <td>boolean</td>
  159. <td>No (default: <span class="code">true</span>)</td>
  160. <td>Whether to compare the page against external links found in its wikitext.</td>
  161. </tr>
  162. <tr>
  163. <td>nocache</td>
  164. <td>boolean</td>
  165. <td>No (default: <span class="code">false</span>)</td>
  166. <td>Whether to bypass search results cached from previous checks. It is recommended that you don't pass this option unless a user specifically asks for it.</td>
  167. </tr>
  168. <tr>
  169. <td>noredirect</td>
  170. <td>boolean</td>
  171. <td>No (default: <span class="code">false</span>)</td>
  172. <td>Whether to avoid following redirects if the given page is a redirect.</td>
  173. </tr>
  174. <tr>
  175. <td>noskip</td>
  176. <td>boolean</td>
  177. <td>No (default: <span class="code">false</span>)</td>
  178. <td>If a suspected source is found during a check to have a sufficiently high confidence value, the check will end prematurely, and other pending URLs will be skipped. Passing this option will prevent this behavior, resulting in complete (but more time-consuming) checks.</td>
  179. </tr>
  180. </table>
  181. <h2>Responses</h2>
  182. <p>The JSON response object always contains a <span class="code">status</span> key, whose value is either <span class="code">ok</span> or <span class="code">error</span>. If an error has occurred, the response will look like this:</p>
  183. <pre>{
  184. "status": "error",
  185. "error": {
  186. "code": <span class="resp-dtype">string</span> <span class="resp-desc">error code</span>,
  187. "info": <span class="resp-dtype">string</span> <span class="resp-desc">human-readable description of error</span>
  188. }
  189. }</pre>
  190. <p>Valid responses for <span class="code">action=compare</span> and <span class="code">action=search</span> are formatted like this:</p>
  191. <pre>{
  192. "status": "ok",
  193. "meta": {
  194. "time": <span class="resp-dtype">float</span> <span class="resp-desc">time to generate results, in seconds</span>,
  195. "queries": <span class="resp-dtype">int</span> <span class="resp-desc">number of search engine queries made</span>,
  196. "cached": <span class="resp-dtype">boolean</span> <span class="resp-desc">whether these results are cached from an earlier search (always false in the case of action=compare)</span>,
  197. "redirected": <span class="resp-dtype">boolean</span> <span class="resp-desc">whether a redirect was followed</span>,
  198. <span class="resp-cond">only if cached=true</span> "cache_time": <span class="resp-dtype">string</span> <span class="resp-desc">human-readable time of the original search that the results are cached from</span>
  199. },
  200. "page": {
  201. "title": <span class="resp-dtype">string</span> <span class="resp-desc">the normalized title of the page checked</span>,
  202. "url": <span class="resp-dtype">string</span> <span class="resp-desc">the full URL of the page checked</span>
  203. },
  204. <span class="resp-cond">only if redirected=true</span> "original_page": {
  205. "title": <span class="resp-dtype">string</span> <span class="resp-desc">the normalized title of the original page whose redirect was followed</span>,
  206. "url": <span class="resp-dtype">string</span> <span class="resp-desc">the full URL of the original page whose redirect was followed</span>
  207. },
  208. "best": {
  209. "url": <span class="resp-dtype">string</span> <span class="resp-desc">the URL of the best match found, or null if no matches were found</span>,
  210. "confidence": <span class="resp-dtype">float</span> <span class="resp-desc">the confidence of a violation in the best match, or 0.0 if no matches were found</span>,
  211. "violation": <span class="resp-dtype">string</span> <span class="resp-desc">one of "suspected", "possible", or "none"</span>
  212. },
  213. "sources": [
  214. {
  215. "url": <span class="resp-dtype">string</span> <span class="resp-desc">the URL of the source</span>,
  216. "confidence": <span class="resp-dtype">float</span> <span class="resp-desc">the confidence of a violation in the source</span>,
  217. "violation": <span class="resp-dtype">string</span> <span class="resp-desc">one of "suspected", "possible", or "none"</span>,
  218. "skipped": <span class="resp-dtype">boolean</span> <span class="resp-desc">whether the source was skipped due to the check finishing early (see note about noskip above) or an exclusion</span>,
  219. "excluded": <span class="resp-dtype">boolean</span> <span class="resp-desc">whether the source was skipped for being in the excluded URL list</span>
  220. },
  221. ...
  222. ]
  223. }</pre>
  224. <p>In the case of <span class="code">action=search</span>, <span class="code">sources</span> will contain one entry for each source checked (or skipped if the check ends early), sorted in order of confidence, with skipped and excluded sources at the bottom.</p>
  225. <p>In the case of <span class="code">action=compare</span>, <span class="code">best</span> will always contain information about the URL that was given, so <span class="code">response["best"]["url"]</span> will never be <span class="code">null</span>. Also, <span class="code">sources</span> will always contain one entry, with the same data as <span class="code">best</span>, since only one source is checked in comparison mode.</p>
  226. <p>Valid responses for <span class="code">action=sites</span> are formatted like this:</p>
  227. <pre>{
  228. "status": "ok",
  229. "langs": [
  230. [
  231. <span class="resp-dtype">string</span> <span class="resp-desc">language code</span>,
  232. <span class="resp-dtype">string</span> <span class="resp-desc">human-readable language name</span>
  233. ],
  234. ...
  235. ],
  236. "projects": [
  237. [
  238. <span class="resp-dtype">string</span> <span class="resp-desc">project code</span>,
  239. <span class="resp-dtype">string</span> <span class="resp-desc">human-readable project name</span>
  240. ],
  241. ...
  242. ]
  243. }</pre>
  244. <h2>Caveats</h2>
  245. <ul>
  246. <li>There is currently no way to get the contents of the article or suspected source, nor can you get the data behind the visual comparison available from the main tool. This may be changed in a future version if there is sufficient demand for it.</li>
  247. <li>Requests are typically not rate-limited, but the tool uses the same workers to handle all requests, so making simultaneous API calls is only going to slow you down. In general, you are fine making an unlimited number of requests, as long as they are not concurrent and you wait a few seconds between them.</li>
  248. </ul>
  249. <h2>Example</h2>
  250. <p><a class="no-color" href="https://tools.wmflabs.org/copyvios/api.json?version=1&amp;action=search&amp;project=wikipedia&amp;lang=en&amp;title=User:EarwigBot/Copyvios/Tests/2"><span class="code">https://tools.wmflabs.org/copyvios/api.json?<span class="param-key">version</span>=<span class="param-val">1</span>&amp;<span class="param-key">action</span>=<span class="param-val">search</span>&amp;<span class="param-key">project</span>=<span class="param-val">wikipedia</span>&amp;<span class="param-key">lang</span>=<span class="param-val">en</span>&amp;<span class="param-key">title</span>=<span class="param-val">User:EarwigBot/Copyvios/Tests/2</span></span></a></p>
  251. <pre>{
  252. "status": "ok",
  253. "meta": {
  254. "time": 2.2474379539489746,
  255. "queries": 1,
  256. "cached": false,
  257. "redirected": false
  258. },
  259. "page": {
  260. "title": "User:EarwigBot/Copyvios/Tests/2",
  261. "url": "https://en.wikipedia.org/wiki/User:EarwigBot/Copyvios/Tests/2"
  262. },
  263. "best": {
  264. "url": "http://www.whitehouse.gov/administration/president-obama/",
  265. "confidence": 0.9886608511242603,
  266. "violation": "suspected"
  267. }
  268. "sources": [
  269. {
  270. "url": "http://www.whitehouse.gov/administration/president-obama/",
  271. "confidence": 0.9886608511242603,
  272. "violation": "suspected",
  273. "skipped": false,
  274. "excluded": false
  275. },
  276. {
  277. "url": "http://maige2009.blogspot.com/2013/07/barack-h-obama-is-44th-president-of.html",
  278. "confidence": 0.9864798816568047,
  279. "violation": "suspected",
  280. "skipped": false,
  281. "excluded": false
  282. },
  283. {
  284. "url": "http://jeuxdemonstre-apkdownload.rhcloud.com/luo-people-of-kenya-and-tanzania---wikipedia--the-free",
  285. "confidence": 0.0,
  286. "violation": "none",
  287. "skipped": false,
  288. "excluded": false
  289. },
  290. {
  291. "url": "http://www.whitehouse.gov/about/presidents/barackobama",
  292. "confidence": 0.0,
  293. "violation": "none",
  294. "skipped": true,
  295. "excluded": false
  296. },
  297. {
  298. "url": "http://jeuxdemonstre-apkdownload.rhcloud.com/president-barack-obama---the-white-house",
  299. "confidence": 0.0,
  300. "violation": "none",
  301. "skipped": true,
  302. "excluded": false
  303. }
  304. ]
  305. }
  306. </pre>
  307. </div>
  308. % endif
  309. % if result:
  310. <div id="result">
  311. <p>You are using <span class="code">jsonfm</span> output mode, which renders JSON data as a formatted HTML document. This is intended for testing and debugging only.</p>
  312. <div class="json">
  313. ${walk_json(result)}
  314. </div>
  315. </div>
  316. % endif
  317. </body>
  318. </html>