<%! from json import dumps from flask import url_for %>\ <%def name="do_indent(size)">
% for i in xrange(size):
% endfor \ <%def name="walk_json(obj, indent=0)"> % if isinstance(obj, type({})): { % for key in obj: ${do_indent(indent + 1)} "${key | h}": ${walk_json(obj[key], indent + 1)}${"," if not loop.last else ""} % endfor ${do_indent(indent)} } % elif isinstance(obj, (list, tuple, set)): [ % for elem in obj: ${do_indent(indent + 1)} ${walk_json(elem, indent + 1)}${"," if not loop.last else ""} % endfor ${do_indent(indent)} ] % else: ${dumps(obj) | h} % endif \ API – Earwig's Copyvio Detector % if help:

Copyvio Detector API

This is the first version of the API for Earwig's Copyvio Detector. Please report any issues you encounter.

Requests

The API responds to GET requests made to https://copyvios.toolforge.org/api.json. Parameters are described in the tables below:

Always
Parameter Values Required? Description
action compare, search, sites Yes The API will do URL comparisons in compare mode, run full copyvio checks in search mode, and list all known site languages and projects in sites mode.
format json, jsonfm No (default: json) The default output format is JSON. jsonfm mode produces the same output, but renders it as a formatted HTML document for debugging.
version integer No (default: 1) Currently, the API only has one version. You can skip this parameter, but it is recommended to include it for forward compatibility.
compare Mode
Parameter Values Required? Description
project string Yes The project code of the site the page lives on. Examples are wikipedia and wiktionary. A list of acceptable values can be retrieved using action=sites.
lang string Yes The language code of the site the page lives on. Examples are en and de. A list of acceptable values can be retrieved using action=sites.
title string Yes (either title or oldid) The title of the page or article to make a comparison against. Namespace must be included if the page isn't in the mainspace.
oldid integer Yes (either title or oldid) The revision ID (also called oldid) of the page revision to make a comparison against. If both a title and oldid are given, the oldid will be used.
url string Yes The URL of the suspected violation source that will be compared to the page.
detail boolean No (default: false) Whether to include the detailed HTML text comparison available in the regular interface. If not, only the similarity percentage is available.
search Mode
Parameter Values Required? Description
project string Yes The project code of the site the page lives on. Examples are wikipedia and wiktionary. A list of acceptable values can be retrieved using action=sites.
lang string Yes The language code of the site the page lives on. Examples are en and de. A list of acceptable values can be retrieved using action=sites.
title string Yes (either title or oldid) The title of the page or article to make a check against. Namespace must be included if the page isn't in the mainspace.
oldid integer Yes (either title or oldid) The revision ID (also called oldid) of the page revision to make a check against. If both a title and oldid are given, the oldid will be used.
use_engine boolean No (default: true) Whether to use a search engine (Google) as a source of URLs to compare against the page.
use_links boolean No (default: true) Whether to compare the page against external links found in its wikitext.
nocache boolean No (default: false) Whether to bypass search results cached from previous checks. It is recommended that you don't pass this option unless a user specifically asks for it.
noredirect boolean No (default: false) Whether to avoid following redirects if the given page is a redirect.
noskip boolean No (default: false) If a suspected source is found during a check to have a sufficiently high similarity value, the check will end prematurely, and other pending URLs will be skipped. Passing this option will prevent this behavior, resulting in complete (but more time-consuming) checks.

Responses

The JSON response object always contains a status key, whose value is either ok or error. If an error has occurred, the response will look like this:

{
    "status": "error",
    "error": {
        "code": string error code,
        "info": string human-readable description of error
    }
}

Valid responses for action=compare and action=search are formatted like this:

{
    "status": "ok",
    "meta": {
        "time":       float time to generate results, in seconds,
        "queries":    int number of search engine queries made,
        "cached":     boolean whether these results are cached from an earlier search (always false in the case of action=compare),
        "redirected": boolean whether a redirect was followed,
        only if cached=true "cache_time": string human-readable time of the original search that the results are cached from
    },
    "page": {
        "title": string the normalized title of the page checked,
        "url":   string the full URL of the page checked
    },
    only if redirected=true "original_page": {
        "title": string the normalized title of the original page whose redirect was followed,
        "url":   string the full URL of the original page whose redirect was followed
    },
    "best": {
        "url":        string the URL of the best match found, or null if no matches were found,
        "confidence": float the similarity of a violation in the best match, or 0.0 if no matches were found,
        "violation":  string one of "suspected", "possible", or "none"
    },
    "sources": [
        {
            "url":        string the URL of the source,
            "confidence": float the similarity of the source to the page checked as a ratio between 0.0 and 1.0,
            "violation":  string one of "suspected", "possible", or "none",
            "skipped":    boolean whether the source was skipped due to the check finishing early (see note about noskip above) or an exclusion,
            "excluded":    boolean whether the source was skipped for being in the excluded URL list
        },
        ...
    ],
    only if action=compare and detail=true "detail": {
        "article": string article text, with shared passages marked with HTML,
        "source":  string source text, with shared passages marked with HTML
    }
}

In the case of action=search, sources will contain one entry for each source checked (or skipped if the check ends early), sorted by similarity, with skipped and excluded sources at the bottom.

In the case of action=compare, best will always contain information about the URL that was given, so response["best"]["url"] will never be null. Also, sources will always contain one entry, with the same data as best, since only one source is checked in comparison mode.

Valid responses for action=sites are formatted like this:

{
    "status": "ok",
    "langs": [
        [
            string language code,
            string human-readable language name
        ],
        ...
    ],
    "projects": [
        [
            string project code,
            string human-readable project name
        ],
        ...
    ]
}

Etiquette

The tool uses the same workers to handle all requests, so making concurrent API calls is only going to slow you down. Most operations are not rate-limited, but full searches with use_engine=True are globally limited to around a thousand per day. Be respectful!

Aside from testing, you must set a reasonable user agent that identifies your bot and and gives some way to contact you. You may be blocked if using an improper user agent (for example, the default user agent set by your HTTP library), or if your bot makes requests too frequently.

Example

https://copyvios.toolforge.org/api.json?version=1&action=search&project=wikipedia&lang=en&title=User:EarwigBot/Copyvios/Tests/2

{
    "status": "ok",
    "meta": {
        "time": 2.2474379539489746,
        "queries": 1,
        "cached": false,
        "redirected": false
    },
    "page": {
        "title": "User:EarwigBot/Copyvios/Tests/2",
        "url": "https://en.wikipedia.org/wiki/User:EarwigBot/Copyvios/Tests/2"
    },
    "best": {
        "url": "http://www.whitehouse.gov/administration/president-obama/",
        "confidence": 0.9886608511242603,
        "violation": "suspected"
    }
    "sources": [
        {
            "url": "http://www.whitehouse.gov/administration/president-obama/",
            "confidence": 0.9886608511242603,
            "violation": "suspected",
            "skipped": false,
            "excluded": false
        },
        {
            "url": "http://maige2009.blogspot.com/2013/07/barack-h-obama-is-44th-president-of.html",
            "confidence": 0.9864798816568047,
            "violation": "suspected",
            "skipped": false,
            "excluded": false
        },
        {
            "url": "http://jeuxdemonstre-apkdownload.rhcloud.com/luo-people-of-kenya-and-tanzania---wikipedia--the-free",
            "confidence": 0.0,
            "violation": "none",
            "skipped": false,
            "excluded": false
        },
        {
            "url": "http://www.whitehouse.gov/about/presidents/barackobama",
            "confidence": 0.0,
            "violation": "none",
            "skipped": true,
            "excluded": false
        },
        {
            "url": "http://jeuxdemonstre-apkdownload.rhcloud.com/president-barack-obama---the-white-house",
            "confidence": 0.0,
            "violation": "none",
            "skipped": true,
            "excluded": false
        }
    ]
}
% endif % if result:

You are using jsonfm output mode, which renders JSON data as a formatted HTML document. This is intended for testing and debugging only.

${walk_json(result)}
% endif