From a25304dc444a769c1159ca736aa2bc5a1e68c06a Mon Sep 17 00:00:00 2001 From: Larivact Date: Sun, 4 Jun 2017 11:45:15 +0200 Subject: [PATCH 1/4] partially rewrite Caveats, external link caveat "inherent limitation in wikicode" sounds misleading it's about generating an AST instead of HTML. --- README.rst | 27 +++++++++++++++------------ 1 file changed, 15 insertions(+), 12 deletions(-) diff --git a/README.rst b/README.rst index b7d324c..86143c6 100644 --- a/README.rst +++ b/README.rst @@ -115,21 +115,24 @@ Likewise, use ``unicode(code)`` in Python 2. Caveats ------- +mwparserfromhell generates an abstract syntax tree instead of HTML. +This has several implications: -An inherent limitation in wikicode prevents us from generating complete parse -trees in certain cases. For example, the string ``{{echo|''Hello}}, world!''`` -produces the valid output ``Hello, world!`` in MediaWiki, assuming -``{{echo}}`` is a template that returns its first parameter. But since -representing this in mwparserfromhell's node tree would be impossible, we -compromise by treating the first node (i.e., the template) as plain text, -parsing only the italics. +* Crossed constructs like ``{{echo|''Hello}}, world!''`` are not supported, + since they cannot be represented in the node tree. We compromise by treating + the first node (i.e. the template) as plain text, parsing only the italics. -The current workaround for cases where you are not interested in text -formatting is to pass ``skip_style_tags=True`` to ``mwparserfromhell.parse()``. -This treats ``''`` and ``'''`` like plain text. + The current workaround for cases where you are not interested in text + formatting is to pass ``skip_style_tags=True`` to ``mwparserfromhell.parse()``. + This treats ``''`` and ``'''`` like plain text. -A future version of mwparserfromhell will include multiple parsing modes to get -around this restriction. + A future version of mwparserfromhell will include multiple parsing modes to get + around this restriction. + +* Templates adjacent to external links e.g. ``http://example.com{{foo}}`` are + considered part of the link, since mwparserfromhell does not know the + definition of templates and even if it did the template could only be + partially part of the link which also couldn't be represented in the AST. Integration ----------- From 2d89f611be365e181d2fa3df2bfbab6fde2ab07c Mon Sep 17 00:00:00 2001 From: Larivact Date: Sun, 4 Jun 2017 22:37:05 +0200 Subject: [PATCH 2/4] rewrite Caveats >not supported, since they cannot be represented in the node tree. It's not that they cannot be represented, it's that they would have to be evaluated. --- README.rst | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/README.rst b/README.rst index 86143c6..5ac605a 100644 --- a/README.rst +++ b/README.rst @@ -115,12 +115,18 @@ Likewise, use ``unicode(code)`` in Python 2. Caveats ------- -mwparserfromhell generates an abstract syntax tree instead of HTML. +While the MediaWiki parser generates HTML, mwparserfromhell acts as an interface to +the source code. mwparserfromhell therefore is unaware of template definitions since +if it would substitute templates with their output you could no longer change the templates. This has several implications: -* Crossed constructs like ``{{echo|''Hello}}, world!''`` are not supported, - since they cannot be represented in the node tree. We compromise by treating - the first node (i.e. the template) as plain text, parsing only the italics. +* Start and end tags generated by templates aren't recognized e.g. ``foobar{{bold-end}}``. + +* Templates adjacent to external links e.g. ``http://example.com{{foo}}`` are + considered part of the link. + +* Crossed constructs like ``{{echo|''Hello}}, world!''`` are not supported. + We compromise by treating the first node as plain text. The current workaround for cases where you are not interested in text formatting is to pass ``skip_style_tags=True`` to ``mwparserfromhell.parse()``. @@ -129,11 +135,6 @@ This has several implications: A future version of mwparserfromhell will include multiple parsing modes to get around this restriction. -* Templates adjacent to external links e.g. ``http://example.com{{foo}}`` are - considered part of the link, since mwparserfromhell does not know the - definition of templates and even if it did the template could only be - partially part of the link which also couldn't be represented in the AST. - Integration ----------- From 4d4a25152e7f504f27e8deaa9dc60cbec1981ac1 Mon Sep 17 00:00:00 2001 From: Larivact Date: Mon, 5 Jun 2017 07:38:06 +0200 Subject: [PATCH 3/4] Caveats -> Limitations, add Config unawareness --- README.rst | 24 ++++++++++++++++++------ 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/README.rst b/README.rst index 5ac605a..00fbd0b 100644 --- a/README.rst +++ b/README.rst @@ -113,20 +113,20 @@ saving the page!) by calling ``str()`` on it:: Likewise, use ``unicode(code)`` in Python 2. -Caveats -------- +Limitations +----------- While the MediaWiki parser generates HTML, mwparserfromhell acts as an interface to the source code. mwparserfromhell therefore is unaware of template definitions since -if it would substitute templates with their output you could no longer change the templates. -This has several implications: +if it would substitute templates with their output you would no longer be working +with the source code. This has several implications: * Start and end tags generated by templates aren't recognized e.g. ``foobar{{bold-end}}``. * Templates adjacent to external links e.g. ``http://example.com{{foo}}`` are considered part of the link. -* Crossed constructs like ``{{echo|''Hello}}, world!''`` are not supported. - We compromise by treating the first node as plain text. +* Crossed constructs like ``{{echo|''Hello}}, world!''`` are not supported, + the first node is treated as plain text. The current workaround for cases where you are not interested in text formatting is to pass ``skip_style_tags=True`` to ``mwparserfromhell.parse()``. @@ -135,6 +135,17 @@ This has several implications: A future version of mwparserfromhell will include multiple parsing modes to get around this restriction. +Configuration unawareness +------------------------- + +* `word-ending links`_ are not supported since the linktrail rules are language-specific. + +* Localized namespace names aren't recognized, e.g. ``[[File:...]]`` + links are treated as regular wikilinks. + +* Anything that looks like an XML tag is parsed as a tag since, + the available tags are extension-dependent. + Integration ----------- @@ -178,6 +189,7 @@ Python 3 code (via the API_):: .. _GitHub: https://github.com/earwig/mwparserfromhell .. _Python Package Index: http://pypi.python.org .. _get pip: http://pypi.python.org/pypi/pip +.. _word-ending links: https://www.mediawiki.org/wiki/Help:Links#linktrail .. _EarwigBot: https://github.com/earwig/earwigbot .. _Pywikibot: https://www.mediawiki.org/wiki/Manual:Pywikibot .. _API: http://mediawiki.org/wiki/API From 2e486f7544c607d0d4d966114f28c6ad651cca52 Mon Sep 17 00:00:00 2001 From: Larivact Date: Mon, 5 Jun 2017 11:44:27 +0200 Subject: [PATCH 4/4] fix comma --- README.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.rst b/README.rst index 00fbd0b..6fd3be5 100644 --- a/README.rst +++ b/README.rst @@ -143,8 +143,8 @@ Configuration unawareness * Localized namespace names aren't recognized, e.g. ``[[File:...]]`` links are treated as regular wikilinks. -* Anything that looks like an XML tag is parsed as a tag since, - the available tags are extension-dependent. +* Anything that looks like an XML tag is parsed as a tag + since the available tags are extension-dependent. Integration -----------