A copyright violation detector running on Wikimedia Cloud Services https://tools.wmflabs.org/copyvios/
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 

326 lines
18 KiB

  1. <%!
  2. from json import dumps
  3. from flask import url_for
  4. %>\
  5. <%def name="do_indent(size)">
  6. <br />
  7. % for i in xrange(size):
  8. <div class="indent"></div>
  9. % endfor
  10. </%def>\
  11. <%def name="walk_json(obj, indent=0)">
  12. % if isinstance(obj, type({})):
  13. {
  14. % for key in obj:
  15. ${do_indent(indent + 1)}
  16. "${key | h}": ${walk_json(obj[key], indent + 1)}${"," if not loop.last else ""}
  17. % endfor
  18. ${do_indent(indent)}
  19. }
  20. % elif isinstance(obj, (list, tuple, set)):
  21. [
  22. % for elem in obj:
  23. ${do_indent(indent + 1)}
  24. ${walk_json(elem, indent + 1)}${"," if not loop.last else ""}
  25. % endfor
  26. ${do_indent(indent)}
  27. ]
  28. % else:
  29. ${dumps(obj) | h}
  30. % endif
  31. </%def>\
  32. <!DOCTYPE html>
  33. <html lang="en">
  34. <head>
  35. <meta charset="utf-8">
  36. <title>API &ndash; Earwig's Copyvio Detector</title>
  37. <link rel="stylesheet" href="${request.script_root}${url_for('static', file='api.min.css')}" type="text/css" />
  38. </head>
  39. <body>
  40. % if help:
  41. <div id="help">
  42. <h1>Copyvio Detector API</h1>
  43. <p>This is the first version of the <a href="https://en.wikipedia.org/wiki/Application_programming_interface">API</a> for <a href="${request.script_root}/">Earwig's Copyvio Detector</a>. It works, but some bugs might still need to be ironed out, so please <a href="https://github.com/earwig/copyvios/issues">report any</a> if you see them.</p>
  44. <h2>Requests</h2>
  45. <p>The API responds to GET requests made to <span class="code">https://copyvios.toolforge.org/api.json</span>. Parameters are described in the tables below:</p>
  46. <table class="parameters">
  47. <tr>
  48. <th colspan="4">Always</th>
  49. </tr>
  50. <tr>
  51. <th>Parameter</th>
  52. <th>Values</th>
  53. <th>Required?</th>
  54. <th>Description</th>
  55. </tr>
  56. <tr>
  57. <td>action</td>
  58. <td><span class="code">compare</span>, <span class="code">search</span>, <span class="code">sites</span></td>
  59. <td>Yes</td>
  60. <td>The API will do URL comparisons in <span class="code">compare</span> mode, run full copyvio checks in <span class="code">search</span> mode, and list all known site languages and projects in <span class="code">sites</span> mode.</td>
  61. </tr>
  62. <tr>
  63. <td>format</td>
  64. <td><span class="code">json</span>, <span class="code">jsonfm</span></td>
  65. <td>No&nbsp;(default:&nbsp;<span class="code">json</span>)</td>
  66. <td>The default output format is <a href="https://www.json.org/">JSON</a>. <span class="code">jsonfm</span> mode produces the same output, but renders it as a formatted HTML document for debugging.</td>
  67. </tr>
  68. <tr>
  69. <td>version</td>
  70. <td>integer</td>
  71. <td>No (default: <span class="code">1</span>)</td>
  72. <td>Currently, the API only has one version. You can skip this parameter, but it is recommended to include it for forward compatibility.</td>
  73. </tr>
  74. </table>
  75. <table class="parameters">
  76. <tr>
  77. <th colspan="4"><span class="code">compare</span> Mode</th>
  78. </tr>
  79. <tr>
  80. <th>Parameter</th>
  81. <th>Values</th>
  82. <th>Required?</th>
  83. <th>Description</th>
  84. </tr>
  85. <tr>
  86. <td>project</td>
  87. <td>string</td>
  88. <td>Yes</td>
  89. <td>The project code of the site the page lives on. Examples are <span class="code">wikipedia</span> and <span class="code">wiktionary</span>. A list of acceptable values can be retrieved using <span class="code">action=sites</span>.</td>
  90. </tr>
  91. <tr>
  92. <td>lang</td>
  93. <td>string</td>
  94. <td>Yes</td>
  95. <td>The language code of the site the page lives on. Examples are <span class="code">en</span> and <span class="code">de</span>. A list of acceptable values can be retrieved using <span class="code">action=sites</span>.</td>
  96. </tr>
  97. <tr>
  98. <td>title</td>
  99. <td>string</td>
  100. <td>Yes&nbsp;(either&nbsp;<span class="code">title</span>&nbsp;or&nbsp;<span class="code">oldid</span>)</td>
  101. <td>The title of the page or article to make a comparison against. Namespace must be included if the page isn't in the mainspace.</td>
  102. </tr>
  103. <tr>
  104. <td>oldid</td>
  105. <td>integer</td>
  106. <td>Yes (either <span class="code">title</span> or <span class="code">oldid</span>)</td>
  107. <td>The revision ID (also called oldid) of the page revision to make a comparison against. If both a title and oldid are given, the oldid will be used.</td>
  108. </tr>
  109. <tr>
  110. <td>url</td>
  111. <td>string</td>
  112. <td>Yes</td>
  113. <td>The URL of the suspected violation source that will be compared to the page.</td>
  114. </tr>
  115. <tr>
  116. <td>detail</td>
  117. <td>boolean</td>
  118. <td>No (default: <span class="code">false</span>)</td>
  119. <td>Whether to include the detailed HTML text comparison available in the regular interface. If not, only the confidence percentage is available.</td>
  120. </tr>
  121. </table>
  122. <table class="parameters">
  123. <tr>
  124. <th colspan="4"><span class="code">search</span> Mode</th>
  125. </tr>
  126. <tr>
  127. <th>Parameter</th>
  128. <th>Values</th>
  129. <th>Required?</th>
  130. <th>Description</th>
  131. </tr>
  132. <tr>
  133. <td>project</td>
  134. <td>string</td>
  135. <td>Yes</td>
  136. <td>The project code of the site the page lives on. Examples are <span class="code">wikipedia</span> and <span class="code">wiktionary</span>. A list of acceptable values can be retrieved using <span class="code">action=sites</span>.</td>
  137. </tr>
  138. <tr>
  139. <td>lang</td>
  140. <td>string</td>
  141. <td>Yes</td>
  142. <td>The language code of the site the page lives on. Examples are <span class="code">en</span> and <span class="code">de</span>. A list of acceptable values can be retrieved using <span class="code">action=sites</span>.</td>
  143. </tr>
  144. <tr>
  145. <td>title</td>
  146. <td>string</td>
  147. <td>Yes&nbsp;(either&nbsp;<span class="code">title</span>&nbsp;or&nbsp;<span class="code">oldid</span>)</td>
  148. <td>The title of the page or article to make a check against. Namespace must be included if the page isn't in the mainspace.</td>
  149. </tr>
  150. <tr>
  151. <td>oldid</td>
  152. <td>integer</td>
  153. <td>Yes (either <span class="code">title</span> or <span class="code">oldid</span>)</td>
  154. <td>The revision ID (also called oldid) of the page revision to make a check against. If both a title and oldid are given, the oldid will be used.</td>
  155. </tr>
  156. <tr>
  157. <td>use_engine</td>
  158. <td>boolean</td>
  159. <td>No (default: <span class="code">true</span>)</td>
  160. <td>Whether to use a search engine (<a href="https://developers.google.com/custom-search/">Google</a>) as a source of URLs to compare against the page.</td>
  161. </tr>
  162. <tr>
  163. <td>use_links</td>
  164. <td>boolean</td>
  165. <td>No (default: <span class="code">true</span>)</td>
  166. <td>Whether to compare the page against external links found in its wikitext.</td>
  167. </tr>
  168. <tr>
  169. <td>nocache</td>
  170. <td>boolean</td>
  171. <td>No (default: <span class="code">false</span>)</td>
  172. <td>Whether to bypass search results cached from previous checks. It is recommended that you don't pass this option unless a user specifically asks for it.</td>
  173. </tr>
  174. <tr>
  175. <td>noredirect</td>
  176. <td>boolean</td>
  177. <td>No (default: <span class="code">false</span>)</td>
  178. <td>Whether to avoid following redirects if the given page is a redirect.</td>
  179. </tr>
  180. <tr>
  181. <td>noskip</td>
  182. <td>boolean</td>
  183. <td>No (default: <span class="code">false</span>)</td>
  184. <td>If a suspected source is found during a check to have a sufficiently high confidence value, the check will end prematurely, and other pending URLs will be skipped. Passing this option will prevent this behavior, resulting in complete (but more time-consuming) checks.</td>
  185. </tr>
  186. </table>
  187. <h2>Responses</h2>
  188. <p>The JSON response object always contains a <span class="code">status</span> key, whose value is either <span class="code">ok</span> or <span class="code">error</span>. If an error has occurred, the response will look like this:</p>
  189. <pre>{
  190. "status": "error",
  191. "error": {
  192. "code": <span class="resp-dtype">string</span> <span class="resp-desc">error code</span>,
  193. "info": <span class="resp-dtype">string</span> <span class="resp-desc">human-readable description of error</span>
  194. }
  195. }</pre>
  196. <p>Valid responses for <span class="code">action=compare</span> and <span class="code">action=search</span> are formatted like this:</p>
  197. <pre>{
  198. "status": "ok",
  199. "meta": {
  200. "time": <span class="resp-dtype">float</span> <span class="resp-desc">time to generate results, in seconds</span>,
  201. "queries": <span class="resp-dtype">int</span> <span class="resp-desc">number of search engine queries made</span>,
  202. "cached": <span class="resp-dtype">boolean</span> <span class="resp-desc">whether these results are cached from an earlier search (always false in the case of action=compare)</span>,
  203. "redirected": <span class="resp-dtype">boolean</span> <span class="resp-desc">whether a redirect was followed</span>,
  204. <span class="resp-cond">only if cached=true</span> "cache_time": <span class="resp-dtype">string</span> <span class="resp-desc">human-readable time of the original search that the results are cached from</span>
  205. },
  206. "page": {
  207. "title": <span class="resp-dtype">string</span> <span class="resp-desc">the normalized title of the page checked</span>,
  208. "url": <span class="resp-dtype">string</span> <span class="resp-desc">the full URL of the page checked</span>
  209. },
  210. <span class="resp-cond">only if redirected=true</span> "original_page": {
  211. "title": <span class="resp-dtype">string</span> <span class="resp-desc">the normalized title of the original page whose redirect was followed</span>,
  212. "url": <span class="resp-dtype">string</span> <span class="resp-desc">the full URL of the original page whose redirect was followed</span>
  213. },
  214. "best": {
  215. "url": <span class="resp-dtype">string</span> <span class="resp-desc">the URL of the best match found, or null if no matches were found</span>,
  216. "confidence": <span class="resp-dtype">float</span> <span class="resp-desc">the confidence of a violation in the best match, or 0.0 if no matches were found</span>,
  217. "violation": <span class="resp-dtype">string</span> <span class="resp-desc">one of "suspected", "possible", or "none"</span>
  218. },
  219. "sources": [
  220. {
  221. "url": <span class="resp-dtype">string</span> <span class="resp-desc">the URL of the source</span>,
  222. "confidence": <span class="resp-dtype">float</span> <span class="resp-desc">the confidence of a violation in the source</span>,
  223. "violation": <span class="resp-dtype">string</span> <span class="resp-desc">one of "suspected", "possible", or "none"</span>,
  224. "skipped": <span class="resp-dtype">boolean</span> <span class="resp-desc">whether the source was skipped due to the check finishing early (see note about noskip above) or an exclusion</span>,
  225. "excluded": <span class="resp-dtype">boolean</span> <span class="resp-desc">whether the source was skipped for being in the excluded URL list</span>
  226. },
  227. ...
  228. ],
  229. <span class="resp-cond">only if action=compare and detail=true</span> "detail": {
  230. "article": <span class="resp-dtype">string</span> <span class="resp-desc">article text, with shared passages marked with HTML</span>,
  231. "source": <span class="resp-dtype">string</span> <span class="resp-desc">source text, with shared passages marked with HTML</span>
  232. }
  233. }</pre>
  234. <p>In the case of <span class="code">action=search</span>, <span class="code">sources</span> will contain one entry for each source checked (or skipped if the check ends early), sorted in order of confidence, with skipped and excluded sources at the bottom.</p>
  235. <p>In the case of <span class="code">action=compare</span>, <span class="code">best</span> will always contain information about the URL that was given, so <span class="code">response["best"]["url"]</span> will never be <span class="code">null</span>. Also, <span class="code">sources</span> will always contain one entry, with the same data as <span class="code">best</span>, since only one source is checked in comparison mode.</p>
  236. <p>Valid responses for <span class="code">action=sites</span> are formatted like this:</p>
  237. <pre>{
  238. "status": "ok",
  239. "langs": [
  240. [
  241. <span class="resp-dtype">string</span> <span class="resp-desc">language code</span>,
  242. <span class="resp-dtype">string</span> <span class="resp-desc">human-readable language name</span>
  243. ],
  244. ...
  245. ],
  246. "projects": [
  247. [
  248. <span class="resp-dtype">string</span> <span class="resp-desc">project code</span>,
  249. <span class="resp-dtype">string</span> <span class="resp-desc">human-readable project name</span>
  250. ],
  251. ...
  252. ]
  253. }</pre>
  254. <h2>Etiquette</h2>
  255. The tool uses the same workers to handle all requests, so making concurrent API calls is only going to slow you down. Most operations are not rate-limited, but full searches with <span class="code">use_engine=True</span> are globally limited to around a thousand per day. Be respectful!
  256. <h2>Example</h2>
  257. <p><a class="no-color" href="https://copyvios.toolforge.org/api.json?version=1&amp;action=search&amp;project=wikipedia&amp;lang=en&amp;title=User:EarwigBot/Copyvios/Tests/2"><span class="code">https://copyvios.toolforge.org/api.json?<span class="param-key">version</span>=<span class="param-val">1</span>&amp;<span class="param-key">action</span>=<span class="param-val">search</span>&amp;<span class="param-key">project</span>=<span class="param-val">wikipedia</span>&amp;<span class="param-key">lang</span>=<span class="param-val">en</span>&amp;<span class="param-key">title</span>=<span class="param-val">User:EarwigBot/Copyvios/Tests/2</span></span></a></p>
  258. <pre>{
  259. "status": "ok",
  260. "meta": {
  261. "time": 2.2474379539489746,
  262. "queries": 1,
  263. "cached": false,
  264. "redirected": false
  265. },
  266. "page": {
  267. "title": "User:EarwigBot/Copyvios/Tests/2",
  268. "url": "https://en.wikipedia.org/wiki/User:EarwigBot/Copyvios/Tests/2"
  269. },
  270. "best": {
  271. "url": "http://www.whitehouse.gov/administration/president-obama/",
  272. "confidence": 0.9886608511242603,
  273. "violation": "suspected"
  274. }
  275. "sources": [
  276. {
  277. "url": "http://www.whitehouse.gov/administration/president-obama/",
  278. "confidence": 0.9886608511242603,
  279. "violation": "suspected",
  280. "skipped": false,
  281. "excluded": false
  282. },
  283. {
  284. "url": "http://maige2009.blogspot.com/2013/07/barack-h-obama-is-44th-president-of.html",
  285. "confidence": 0.9864798816568047,
  286. "violation": "suspected",
  287. "skipped": false,
  288. "excluded": false
  289. },
  290. {
  291. "url": "http://jeuxdemonstre-apkdownload.rhcloud.com/luo-people-of-kenya-and-tanzania---wikipedia--the-free",
  292. "confidence": 0.0,
  293. "violation": "none",
  294. "skipped": false,
  295. "excluded": false
  296. },
  297. {
  298. "url": "http://www.whitehouse.gov/about/presidents/barackobama",
  299. "confidence": 0.0,
  300. "violation": "none",
  301. "skipped": true,
  302. "excluded": false
  303. },
  304. {
  305. "url": "http://jeuxdemonstre-apkdownload.rhcloud.com/president-barack-obama---the-white-house",
  306. "confidence": 0.0,
  307. "violation": "none",
  308. "skipped": true,
  309. "excluded": false
  310. }
  311. ]
  312. }
  313. </pre>
  314. </div>
  315. % endif
  316. % if result:
  317. <div id="result">
  318. <p>You are using <span class="code">jsonfm</span> output mode, which renders JSON data as a formatted HTML document. This is intended for testing and debugging only.</p>
  319. <div class="json">
  320. ${walk_json(result)}
  321. </div>
  322. </div>
  323. % endif
  324. </body>
  325. </html>