diff --git a/README.rst b/README.rst index 6fd3be5..6316ed9 100644 --- a/README.rst +++ b/README.rst @@ -115,36 +115,47 @@ Likewise, use ``unicode(code)`` in Python 2. Limitations ----------- -While the MediaWiki parser generates HTML, mwparserfromhell acts as an interface to -the source code. mwparserfromhell therefore is unaware of template definitions since -if it would substitute templates with their output you would no longer be working -with the source code. This has several implications: -* Start and end tags generated by templates aren't recognized e.g. ``foobar{{bold-end}}``. +While the MediaWiki parser generates HTML and has access to the contents of +templates, among other things, mwparserfromhell acts as a direct interface to +the source code only. This has several implications: -* Templates adjacent to external links e.g. ``http://example.com{{foo}}`` are - considered part of the link. +* Syntax elements produced by a template transclusion cannot be detected. For + example, imagine a hypothetical page ``"Template:End-bold"`` that contained + the text ````. While MediaWiki would correctly understand that + ``foobar{{end-bold}}`` translates to ``foobar``, mwparserfromhell + has no way of examining the contents of ``{{end-bold}}``. Instead, it would + treat the bold tag as unfinished, possibly extending further down the page. -* Crossed constructs like ``{{echo|''Hello}}, world!''`` are not supported, - the first node is treated as plain text. +* Templates adjacent to external links, as in ``http://example.com{{foo}}``, + are considered part of the link. In reality, this would depend on the + contents of the template. - The current workaround for cases where you are not interested in text - formatting is to pass ``skip_style_tags=True`` to ``mwparserfromhell.parse()``. - This treats ``''`` and ``'''`` like plain text. +* When different syntax elements cross over each other, as in + ``{{echo|''Hello}}, world!''``, the parser gets confused because this cannot + be represented by an ordinary syntax tree. Instead, the parser will treat the + first syntax construct as plain text. In this case, only the italic tag would + be properly parsed. - A future version of mwparserfromhell will include multiple parsing modes to get - around this restriction. + **Workaround:** Since this commonly occurs with text formatting and text + formatting is often not of interest to users, you may pass + *skip_style_tags=True* to ``mwparserfromhell.parse()``. This treats ``''`` + and ``'''`` as plain text. -Configuration unawareness -------------------------- + A future version of mwparserfromhell may include multiple parsing modes to + get around this restriction more sensibly. -* `word-ending links`_ are not supported since the linktrail rules are language-specific. +Additionally, the parser lacks awareness of certain wiki-specific settings: -* Localized namespace names aren't recognized, e.g. ``[[File:...]]`` - links are treated as regular wikilinks. +* `word-ending links`_ are not supported, since the linktrail rules are + language-specific. -* Anything that looks like an XML tag is parsed as a tag - since the available tags are extension-dependent. +* Localized namespace names aren't recognized, so file links (such as + ``[[File:...]]``) are treated as regular wikilinks. + +* Anything that looks like an XML tag is treated as a tag, even if it is not a + recognized tag name, since the list of valid tags depends on loaded MediaWiki + extensions. Integration ----------- diff --git a/docs/caveats.rst b/docs/caveats.rst deleted file mode 100644 index 927aa54..0000000 --- a/docs/caveats.rst +++ /dev/null @@ -1,17 +0,0 @@ -Caveats -======= - -An inherent limitation in wikicode prevents us from generating complete parse -trees in certain cases. For example, the string ``{{echo|''Hello}}, world!''`` -produces the valid output ``Hello, world!`` in MediaWiki, assuming -``{{echo}}`` is a template that returns its first parameter. But since -representing this in mwparserfromhell's node tree would be impossible, we -compromise by treating the first node (i.e., the template) as plain text, -parsing only the italics. - -The current workaround for cases where you are not interested in text -formatting is to pass *skip_style_tags=True* to :func:`mwparserfromhell.parse`. -This treats ``''`` and ``'''`` like plain text. - -A future version of mwparserfromhell will include multiple parsing modes to get -around this restriction. diff --git a/docs/index.rst b/docs/index.rst index 6593881..06dc2f9 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -40,7 +40,7 @@ Contents :maxdepth: 2 usage - caveats + limitations integration changelog API Reference diff --git a/docs/limitations.rst b/docs/limitations.rst new file mode 100644 index 0000000..7d5f7e7 --- /dev/null +++ b/docs/limitations.rst @@ -0,0 +1,45 @@ +Limitations +=========== + +While the MediaWiki parser generates HTML and has access to the contents of +templates, among other things, mwparserfromhell acts as a direct interface to +the source code only. This has several implications: + +* Syntax elements produced by a template transclusion cannot be detected. For + example, imagine a hypothetical page ``"Template:End-bold"`` that contained + the text ````. While MediaWiki would correctly understand that + ``foobar{{end-bold}}`` translates to ``foobar``, mwparserfromhell + has no way of examining the contents of ``{{end-bold}}``. Instead, it would + treat the bold tag as unfinished, possibly extending further down the page. + +* Templates adjacent to external links, as in ``http://example.com{{foo}}``, + are considered part of the link. In reality, this would depend on the + contents of the template. + +* When different syntax elements cross over each other, as in + ``{{echo|''Hello}}, world!''``, the parser gets confused because this cannot + be represented by an ordinary syntax tree. Instead, the parser will treat the + first syntax construct as plain text. In this case, only the italic tag would + be properly parsed. + + **Workaround:** Since this commonly occurs with text formatting and text + formatting is often not of interest to users, you may pass + *skip_style_tags=True* to ``mwparserfromhell.parse()``. This treats ``''`` + and ``'''`` as plain text. + + A future version of mwparserfromhell may include multiple parsing modes to + get around this restriction more sensibly. + +Additionally, the parser lacks awareness of certain wiki-specific settings: + +* `word-ending links`_ are not supported, since the linktrail rules are + language-specific. + +* Localized namespace names aren't recognized, so file links (such as + ``[[File:...]]``) are treated as regular wikilinks. + +* Anything that looks like an XML tag is treated as a tag, even if it is not a + recognized tag name, since the list of valid tags depends on loaded MediaWiki + extensions. + +.. _word-ending links: https://www.mediawiki.org/wiki/Help:Links#linktrail