diff --git a/CHANGELOG b/CHANGELOG index 9772f8b..67214fa 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -1,4 +1,24 @@ -v0.1.1 (19da4d2144) to v0.2: +v0.3 (released August 24, 2013): + +- Added complete support for HTML Tags, including forms like foo, + , and wiki-markup tags like bold ('''), italics (''), and + lists (*, #, ; and :). +- Added support for ExternalLinks (http://example.com/ and + [http://example.com/ Example]). +- Wikicode's filter methods are now passed 'recursive=True' by default instead + of False. This is a breaking change if you rely on any filter() methods being + non-recursive by default. +- Added a matches() method to Wikicode for page/template name comparisons. +- The 'obj' param of Wikicode.insert_before(), insert_after(), replace(), and + remove() now accepts other Wikicode objects and strings representing parts of + wikitext, instead of just nodes. These methods also make all possible + substitutions instead of just one. +- Renamed Template.has_param() to has() for consistency with Template's other + methods; has_param() is now an alias. +- The C tokenizer extension now works on Python 3 in addition to Python 2.7. +- Various bugfixes, internal changes, and cleanup. + +v0.2 (released June 20, 2013): - The parser now fully supports Python 3 in addition to Python 2.7. - Added a C tokenizer extension that is significantly faster than its Python @@ -24,10 +44,14 @@ v0.1.1 (19da4d2144) to v0.2: - Fixed some broken example code in the README; other copyedits. - Other bugfixes and code cleanup. -v0.1 (ba94938fe8) to v0.1.1 (19da4d2144): +v0.1.1 (released September 21, 2012): - Added support for Comments () and Wikilinks ([[foo]]). - Added corresponding ifilter_links() and filter_links() methods to Wikicode. - Fixed a bug when parsing incomplete templates. - Fixed strip_code() to affect the contents of headings. - Various copyedits in documentation and comments. + +v0.1 (released August 23, 2012): + +- Initial release. diff --git a/README.rst b/README.rst index 77c01eb..b5fd912 100644 --- a/README.rst +++ b/README.rst @@ -9,7 +9,8 @@ mwparserfromhell that provides an easy-to-use and outrageously powerful parser for MediaWiki_ wikicode. It supports Python 2 and Python 3. -Developed by Earwig_ with help from `Σ`_. +Developed by Earwig_ with help from `Σ`_. Full documentation is available on +ReadTheDocs_. Installation ------------ @@ -18,7 +19,7 @@ The easiest way to install the parser is through the `Python Package Index`_, so you can install the latest release with ``pip install mwparserfromhell`` (`get pip`_). Alternatively, get the latest development version:: - git clone git://github.com/earwig/mwparserfromhell.git + git clone https://github.com/earwig/mwparserfromhell.git cd mwparserfromhell python setup.py install @@ -59,13 +60,20 @@ For example:: >>> print template.get("eggs").value spam -Since every node you reach is also a ``Wikicode`` object, it's trivial to get -nested templates:: +Since nodes can contain other nodes, getting nested templates is trivial:: + + >>> text = "{{foo|{{bar}}={{baz|{{spam}}}}}}" + >>> mwparserfromhell.parse(text).filter_templates() + ['{{foo|{{bar}}={{baz|{{spam}}}}}}', '{{bar}}', '{{baz|{{spam}}}}', '{{spam}}'] + +You can also pass ``recursive=False`` to ``filter_templates()`` and explore +templates manually. This is possible because nodes can contain additional +``Wikicode`` objects:: >>> code = mwparserfromhell.parse("{{foo|this {{includes a|template}}}}") - >>> print code.filter_templates() + >>> print code.filter_templates(recursive=False) ['{{foo|this {{includes a|template}}}}'] - >>> foo = code.filter_templates()[0] + >>> foo = code.filter_templates(recursive=False)[0] >>> print foo.get(1).value this {{includes a|template}} >>> print foo.get(1).value.filter_templates()[0] @@ -73,21 +81,16 @@ nested templates:: >>> print foo.get(1).value.filter_templates()[0].get(1).value template -Additionally, you can include nested templates in ``filter_templates()`` by -passing ``recursive=True``:: - - >>> text = "{{foo|{{bar}}={{baz|{{spam}}}}}}" - >>> mwparserfromhell.parse(text).filter_templates(recursive=True) - ['{{foo|{{bar}}={{baz|{{spam}}}}}}', '{{bar}}', '{{baz|{{spam}}}}', '{{spam}}'] - Templates can be easily modified to add, remove, or alter params. ``Wikicode`` -can also be treated like a list with ``append()``, ``insert()``, ``remove()``, -``replace()``, and more:: +objects can be treated like lists, with ``append()``, ``insert()``, +``remove()``, ``replace()``, and more. They also have a ``matches()`` method +for comparing page or template names, which takes care of capitalization and +whitespace:: >>> text = "{{cleanup}} '''Foo''' is a [[bar]]. {{uncategorized}}" >>> code = mwparserfromhell.parse(text) >>> for template in code.filter_templates(): - ... if template.name == "cleanup" and not template.has_param("date"): + ... if template.name.matches("Cleanup") and not template.has("date"): ... template.add("date", "July 2012") ... >>> print code @@ -142,6 +145,7 @@ following code (via the API_):: return mwparserfromhell.parse(text) .. _MediaWiki: http://mediawiki.org +.. _ReadTheDocs: http://mwparserfromhell.readthedocs.org .. _Earwig: http://en.wikipedia.org/wiki/User:The_Earwig .. _Σ: http://en.wikipedia.org/wiki/User:%CE%A3 .. _Python Package Index: http://pypi.python.org diff --git a/docs/api/mwparserfromhell.nodes.rst b/docs/api/mwparserfromhell.nodes.rst index d1016f9..7043070 100644 --- a/docs/api/mwparserfromhell.nodes.rst +++ b/docs/api/mwparserfromhell.nodes.rst @@ -25,6 +25,14 @@ nodes Package :undoc-members: :show-inheritance: +:mod:`external_link` Module +--------------------------- + +.. automodule:: mwparserfromhell.nodes.external_link + :members: + :undoc-members: + :show-inheritance: + :mod:`heading` Module --------------------- @@ -46,6 +54,7 @@ nodes Package .. automodule:: mwparserfromhell.nodes.tag :members: + :undoc-members: :show-inheritance: :mod:`template` Module diff --git a/docs/api/mwparserfromhell.rst b/docs/api/mwparserfromhell.rst index 3ca09c9..0da522e 100644 --- a/docs/api/mwparserfromhell.rst +++ b/docs/api/mwparserfromhell.rst @@ -30,6 +30,12 @@ mwparserfromhell Package :members: :undoc-members: +:mod:`definitions` Module +------------------------- + +.. automodule:: mwparserfromhell.definitions + :members: + :mod:`utils` Module ------------------- diff --git a/docs/changelog.rst b/docs/changelog.rst index 0e8bbef..b6db9d9 100644 --- a/docs/changelog.rst +++ b/docs/changelog.rst @@ -1,10 +1,38 @@ Changelog ========= +v0.3 +---- + +`Released August 24, 2013 `_ +(`changes `__): + +- Added complete support for HTML :py:class:`Tags <.Tag>`, including forms like + ``foo``, ````, and wiki-markup tags like bold + (``'''``), italics (``''``), and lists (``*``, ``#``, ``;`` and ``:``). +- Added support for :py:class:`.ExternalLink`\ s (``http://example.com/`` and + ``[http://example.com/ Example]``). +- :py:class:`Wikicode's <.Wikicode>` :py:meth:`.filter` methods are now passed + *recursive=True* by default instead of *False*. **This is a breaking change + if you rely on any filter() methods being non-recursive by default.** +- Added a :py:meth:`.matches` method to :py:class:`~.Wikicode` for + page/template name comparisons. +- The *obj* param of :py:meth:`Wikicode.insert_before() <.insert_before>`, + :py:meth:`~.insert_after`, :py:meth:`~.Wikicode.replace`, and + :py:meth:`~.Wikicode.remove` now accepts :py:class:`~.Wikicode` objects and + strings representing parts of wikitext, instead of just nodes. These methods + also make all possible substitutions instead of just one. +- Renamed :py:meth:`Template.has_param() <.has_param>` to + :py:meth:`~.Template.has` for consistency with :py:class:`~.Template`\ 's + other methods; :py:meth:`~.has_param` is now an alias. +- The C tokenizer extension now works on Python 3 in addition to Python 2.7. +- Various bugfixes, internal changes, and cleanup. + v0.2 ---- -19da4d2144_ to master_ (released June 20, 2013) +`Released June 20, 2013 `_ +(`changes `__): - The parser now fully supports Python 3 in addition to Python 2.7. - Added a C tokenizer extension that is significantly faster than its Python @@ -38,7 +66,8 @@ v0.2 v0.1.1 ------ -ba94938fe8_ to 19da4d2144_ (released September 21, 2012) +`Released September 21, 2012 `_ +(`changes `__): - Added support for :py:class:`Comments <.Comment>` (````) and :py:class:`Wikilinks <.Wikilink>` (``[[foo]]``). @@ -51,8 +80,6 @@ ba94938fe8_ to 19da4d2144_ (released September 21, 2012) v0.1 ---- -ba94938fe8_ (released August 23, 2012) +`Released August 23, 2012 `_: -.. _master: https://github.com/earwig/mwparserfromhell/tree/v0.2 -.. _19da4d2144: https://github.com/earwig/mwparserfromhell/tree/v0.1.1 -.. _ba94938fe8: https://github.com/earwig/mwparserfromhell/tree/v0.1 +- Initial release. diff --git a/docs/index.rst b/docs/index.rst index 4355b61..a6d2df3 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -1,15 +1,18 @@ -MWParserFromHell v0.2 Documentation -=================================== +MWParserFromHell v\ |version| Documentation +=========================================== :py:mod:`mwparserfromhell` (the *MediaWiki Parser from Hell*) is a Python package that provides an easy-to-use and outrageously powerful parser for MediaWiki_ wikicode. It supports Python 2 and Python 3. -Developed by Earwig_ with help from `Σ`_. +Developed by Earwig_ with contributions from `Σ`_, Legoktm_, and others. +Development occurs on GitHub_. .. _MediaWiki: http://mediawiki.org .. _Earwig: http://en.wikipedia.org/wiki/User:The_Earwig .. _Σ: http://en.wikipedia.org/wiki/User:%CE%A3 +.. _Legoktm: http://en.wikipedia.org/wiki/User:Legoktm +.. _GitHub: https://github.com/earwig/mwparserfromhell Installation ------------ @@ -18,7 +21,7 @@ The easiest way to install the parser is through the `Python Package Index`_, so you can install the latest release with ``pip install mwparserfromhell`` (`get pip`_). Alternatively, get the latest development version:: - git clone git://github.com/earwig/mwparserfromhell.git + git clone https://github.com/earwig/mwparserfromhell.git cd mwparserfromhell python setup.py install diff --git a/docs/usage.rst b/docs/usage.rst index 2fd19af..974c670 100644 --- a/docs/usage.rst +++ b/docs/usage.rst @@ -27,13 +27,20 @@ some extra methods. For example:: >>> print template.get("eggs").value spam -Since every node you reach is also a :py:class:`~.Wikicode` object, it's -trivial to get nested templates:: +Since nodes can contain other nodes, getting nested templates is trivial:: + + >>> text = "{{foo|{{bar}}={{baz|{{spam}}}}}}" + >>> mwparserfromhell.parse(text).filter_templates() + ['{{foo|{{bar}}={{baz|{{spam}}}}}}', '{{bar}}', '{{baz|{{spam}}}}', '{{spam}}'] + +You can also pass *recursive=False* to :py:meth:`~.filter_templates` and +explore templates manually. This is possible because nodes can contain +additional :py:class:`~.Wikicode` objects:: >>> code = mwparserfromhell.parse("{{foo|this {{includes a|template}}}}") - >>> print code.filter_templates() + >>> print code.filter_templates(recursive=False) ['{{foo|this {{includes a|template}}}}'] - >>> foo = code.filter_templates()[0] + >>> foo = code.filter_templates(recursive=False)[0] >>> print foo.get(1).value this {{includes a|template}} >>> print foo.get(1).value.filter_templates()[0] @@ -41,22 +48,17 @@ trivial to get nested templates:: >>> print foo.get(1).value.filter_templates()[0].get(1).value template -Additionally, you can include nested templates in :py:meth:`~.filter_templates` -by passing *recursive=True*:: - - >>> text = "{{foo|{{bar}}={{baz|{{spam}}}}}}" - >>> mwparserfromhell.parse(text).filter_templates(recursive=True) - ['{{foo|{{bar}}={{baz|{{spam}}}}}}', '{{bar}}', '{{baz|{{spam}}}}', '{{spam}}'] - Templates can be easily modified to add, remove, or alter params. -:py:class:`~.Wikicode` can also be treated like a list with +:py:class:`~.Wikicode` objects can be treated like lists, with :py:meth:`~.Wikicode.append`, :py:meth:`~.Wikicode.insert`, -:py:meth:`~.Wikicode.remove`, :py:meth:`~.Wikicode.replace`, and more:: +:py:meth:`~.Wikicode.remove`, :py:meth:`~.Wikicode.replace`, and more. They +also have a :py:meth:`~.Wikicode.matches` method for comparing page or template +names, which takes care of capitalization and whitespace:: >>> text = "{{cleanup}} '''Foo''' is a [[bar]]. {{uncategorized}}" >>> code = mwparserfromhell.parse(text) >>> for template in code.filter_templates(): - ... if template.name == "cleanup" and not template.has_param("date"): + ... if template.name.matches("Cleanup") and not template.has("date"): ... template.add("date", "July 2012") ... >>> print code diff --git a/mwparserfromhell/__init__.py b/mwparserfromhell/__init__.py index 5db2d4c..6a45a11 100644 --- a/mwparserfromhell/__init__.py +++ b/mwparserfromhell/__init__.py @@ -31,9 +31,10 @@ from __future__ import unicode_literals __author__ = "Ben Kurtovic" __copyright__ = "Copyright (C) 2012, 2013 Ben Kurtovic" __license__ = "MIT License" -__version__ = "0.2" +__version__ = "0.3" __email__ = "ben.kurtovic@verizon.net" -from . import compat, nodes, parser, smart_list, string_mixin, utils, wikicode +from . import (compat, definitions, nodes, parser, smart_list, string_mixin, + utils, wikicode) parse = utils.parse_anything diff --git a/mwparserfromhell/compat.py b/mwparserfromhell/compat.py old mode 100755 new mode 100644 index bb81513..864605c --- a/mwparserfromhell/compat.py +++ b/mwparserfromhell/compat.py @@ -15,14 +15,12 @@ py3k = sys.version_info[0] == 3 if py3k: bytes = bytes str = str - basestring = str maxsize = sys.maxsize import html.entities as htmlentities else: bytes = str str = unicode - basestring = basestring maxsize = sys.maxint import htmlentitydefs as htmlentities diff --git a/mwparserfromhell/definitions.py b/mwparserfromhell/definitions.py new file mode 100644 index 0000000..9449bcb --- /dev/null +++ b/mwparserfromhell/definitions.py @@ -0,0 +1,91 @@ +# -*- coding: utf-8 -*- +# +# Copyright (C) 2012-2013 Ben Kurtovic +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. + +"""Contains data about certain markup, like HTML tags and external links.""" + +from __future__ import unicode_literals + +__all__ = ["get_html_tag", "is_parsable", "is_visible", "is_single", + "is_single_only", "is_scheme"] + +URI_SCHEMES = { + # [mediawiki/core.git]/includes/DefaultSettings.php @ 374a0ad943 + "http": True, "https": True, "ftp": True, "ftps": True, "ssh": True, + "sftp": True, "irc": True, "ircs": True, "xmpp": False, "sip": False, + "sips": False, "gopher": True, "telnet": True, "nntp": True, + "worldwind": True, "mailto": False, "tel": False, "sms": False, + "news": False, "svn": True, "git": True, "mms": True, "bitcoin": False, + "magnet": False, "urn": False, "geo": False +} + +PARSER_BLACKLIST = [ + # enwiki extensions @ 2013-06-28 + "categorytree", "gallery", "hiero", "imagemap", "inputbox", "math", + "nowiki", "pre", "score", "section", "source", "syntaxhighlight", + "templatedata", "timeline" +] + +INVISIBLE_TAGS = [ + # enwiki extensions @ 2013-06-28 + "categorytree", "gallery", "imagemap", "inputbox", "math", "score", + "section", "templatedata", "timeline" +] + +# [mediawiki/core.git]/includes/Sanitizer.php @ 87a0aef762 +SINGLE_ONLY = ["br", "hr", "meta", "link", "img"] +SINGLE = SINGLE_ONLY + ["li", "dt", "dd"] + +MARKUP_TO_HTML = { + "#": "li", + "*": "li", + ";": "dt", + ":": "dd" +} + +def get_html_tag(markup): + """Return the HTML tag associated with the given wiki-markup.""" + return MARKUP_TO_HTML[markup] + +def is_parsable(tag): + """Return if the given *tag*'s contents should be passed to the parser.""" + return tag.lower() not in PARSER_BLACKLIST + +def is_visible(tag): + """Return whether or not the given *tag* contains visible text.""" + return tag.lower() not in INVISIBLE_TAGS + +def is_single(tag): + """Return whether or not the given *tag* can exist without a close tag.""" + return tag.lower() in SINGLE + +def is_single_only(tag): + """Return whether or not the given *tag* must exist without a close tag.""" + return tag.lower() in SINGLE_ONLY + +def is_scheme(scheme, slashes=True, reverse=False): + """Return whether *scheme* is valid for external links.""" + if reverse: # Convenience for C + scheme = scheme[::-1] + scheme = scheme.lower() + if slashes: + return scheme in URI_SCHEMES + return scheme in URI_SCHEMES and not URI_SCHEMES[scheme] diff --git a/mwparserfromhell/nodes/__init__.py b/mwparserfromhell/nodes/__init__.py index faaa0b2..ba97b3f 100644 --- a/mwparserfromhell/nodes/__init__.py +++ b/mwparserfromhell/nodes/__init__.py @@ -69,6 +69,7 @@ from . import extras from .text import Text from .argument import Argument from .comment import Comment +from .external_link import ExternalLink from .heading import Heading from .html_entity import HTMLEntity from .tag import Tag diff --git a/mwparserfromhell/nodes/external_link.py b/mwparserfromhell/nodes/external_link.py new file mode 100644 index 0000000..d74f6b3 --- /dev/null +++ b/mwparserfromhell/nodes/external_link.py @@ -0,0 +1,97 @@ +# -*- coding: utf-8 -*- +# +# Copyright (C) 2012-2013 Ben Kurtovic +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. + +from __future__ import unicode_literals + +from . import Node +from ..compat import str +from ..utils import parse_anything + +__all__ = ["ExternalLink"] + +class ExternalLink(Node): + """Represents an external link, like ``[http://example.com/ Example]``.""" + + def __init__(self, url, title=None, brackets=True): + super(ExternalLink, self).__init__() + self._url = url + self._title = title + self._brackets = brackets + + def __unicode__(self): + if self.brackets: + if self.title is not None: + return "[" + str(self.url) + " " + str(self.title) + "]" + return "[" + str(self.url) + "]" + return str(self.url) + + def __iternodes__(self, getter): + yield None, self + for child in getter(self.url): + yield self.url, child + if self.title is not None: + for child in getter(self.title): + yield self.title, child + + def __strip__(self, normalize, collapse): + if self.brackets: + if self.title: + return self.title.strip_code(normalize, collapse) + return None + return self.url.strip_code(normalize, collapse) + + def __showtree__(self, write, get, mark): + if self.brackets: + write("[") + get(self.url) + if self.title is not None: + get(self.title) + if self.brackets: + write("]") + + @property + def url(self): + """The URL of the link target, as a :py:class:`~.Wikicode` object.""" + return self._url + + @property + def title(self): + """The link title (if given), as a :py:class:`~.Wikicode` object.""" + return self._title + + @property + def brackets(self): + """Whether to enclose the URL in brackets or display it straight.""" + return self._brackets + + @url.setter + def url(self, value): + from ..parser import contexts + self._url = parse_anything(value, contexts.EXT_LINK_URI) + + @title.setter + def title(self, value): + self._title = None if value is None else parse_anything(value) + + @brackets.setter + def brackets(self, value): + self._brackets = bool(value) diff --git a/mwparserfromhell/nodes/extras/attribute.py b/mwparserfromhell/nodes/extras/attribute.py index ebb65ab..8f7f453 100644 --- a/mwparserfromhell/nodes/extras/attribute.py +++ b/mwparserfromhell/nodes/extras/attribute.py @@ -36,18 +36,34 @@ class Attribute(StringMixIn): whose value is ``"foo"``. """ - def __init__(self, name, value=None, quoted=True): + def __init__(self, name, value=None, quoted=True, pad_first=" ", + pad_before_eq="", pad_after_eq=""): super(Attribute, self).__init__() self._name = name self._value = value self._quoted = quoted + self._pad_first = pad_first + self._pad_before_eq = pad_before_eq + self._pad_after_eq = pad_after_eq def __unicode__(self): - if self.value: + result = self.pad_first + str(self.name) + self.pad_before_eq + if self.value is not None: + result += "=" + self.pad_after_eq if self.quoted: - return str(self.name) + '="' + str(self.value) + '"' - return str(self.name) + "=" + str(self.value) - return str(self.name) + return result + '"' + str(self.value) + '"' + return result + str(self.value) + return result + + def _set_padding(self, attr, value): + """Setter for the value of a padding attribute.""" + if not value: + setattr(self, attr, "") + else: + value = str(value) + if not value.isspace(): + raise ValueError("padding must be entirely whitespace") + setattr(self, attr, value) @property def name(self): @@ -64,14 +80,41 @@ class Attribute(StringMixIn): """Whether the attribute's value is quoted with double quotes.""" return self._quoted + @property + def pad_first(self): + """Spacing to insert right before the attribute.""" + return self._pad_first + + @property + def pad_before_eq(self): + """Spacing to insert right before the equal sign.""" + return self._pad_before_eq + + @property + def pad_after_eq(self): + """Spacing to insert right after the equal sign.""" + return self._pad_after_eq + @name.setter - def name(self, newval): - self._name = parse_anything(newval) + def name(self, value): + self._name = parse_anything(value) @value.setter def value(self, newval): - self._value = parse_anything(newval) + self._value = None if newval is None else parse_anything(newval) @quoted.setter - def quoted(self, newval): - self._quoted = bool(newval) + def quoted(self, value): + self._quoted = bool(value) + + @pad_first.setter + def pad_first(self, value): + self._set_padding("_pad_first", value) + + @pad_before_eq.setter + def pad_before_eq(self, value): + self._set_padding("_pad_before_eq", value) + + @pad_after_eq.setter + def pad_after_eq(self, value): + self._set_padding("_pad_after_eq", value) diff --git a/mwparserfromhell/nodes/tag.py b/mwparserfromhell/nodes/tag.py index eaf2b6e..06f43d0 100644 --- a/mwparserfromhell/nodes/tag.py +++ b/mwparserfromhell/nodes/tag.py @@ -22,8 +22,10 @@ from __future__ import unicode_literals -from . import Node, Text +from . import Node +from .extras import Attribute from ..compat import str +from ..definitions import is_visible from ..utils import parse_anything __all__ = ["Tag"] @@ -31,146 +33,85 @@ __all__ = ["Tag"] class Tag(Node): """Represents an HTML-style tag in wikicode, like ````.""" - TAG_UNKNOWN = 0 - - # Basic HTML: - TAG_ITALIC = 1 - TAG_BOLD = 2 - TAG_UNDERLINE = 3 - TAG_STRIKETHROUGH = 4 - TAG_UNORDERED_LIST = 5 - TAG_ORDERED_LIST = 6 - TAG_DEF_TERM = 7 - TAG_DEF_ITEM = 8 - TAG_BLOCKQUOTE = 9 - TAG_RULE = 10 - TAG_BREAK = 11 - TAG_ABBR = 12 - TAG_PRE = 13 - TAG_MONOSPACE = 14 - TAG_CODE = 15 - TAG_SPAN = 16 - TAG_DIV = 17 - TAG_FONT = 18 - TAG_SMALL = 19 - TAG_BIG = 20 - TAG_CENTER = 21 - - # MediaWiki parser hooks: - TAG_REF = 101 - TAG_GALLERY = 102 - TAG_MATH = 103 - TAG_NOWIKI = 104 - TAG_NOINCLUDE = 105 - TAG_INCLUDEONLY = 106 - TAG_ONLYINCLUDE = 107 - - # Additional parser hooks: - TAG_SYNTAXHIGHLIGHT = 201 - TAG_POEM = 202 - - # Lists of tags: - TAGS_INVISIBLE = set((TAG_REF, TAG_GALLERY, TAG_MATH, TAG_NOINCLUDE)) - TAGS_VISIBLE = set(range(300)) - TAGS_INVISIBLE - - def __init__(self, type_, tag, contents=None, attrs=None, showtag=True, - self_closing=False, open_padding=0, close_padding=0): + def __init__(self, tag, contents=None, attrs=None, wiki_markup=None, + self_closing=False, invalid=False, implicit=False, padding="", + closing_tag=None): super(Tag, self).__init__() - self._type = type_ self._tag = tag - self._contents = contents - if attrs: - self._attrs = attrs + if contents is None and not self_closing: + self._contents = parse_anything("") else: - self._attrs = [] - self._showtag = showtag + self._contents = contents + self._attrs = attrs if attrs else [] + self._wiki_markup = wiki_markup self._self_closing = self_closing - self._open_padding = open_padding - self._close_padding = close_padding + self._invalid = invalid + self._implicit = implicit + self._padding = padding + if closing_tag: + self._closing_tag = closing_tag + else: + self._closing_tag = tag def __unicode__(self): - if not self.showtag: - open_, close = self._translate() + if self.wiki_markup: if self.self_closing: - return open_ + return self.wiki_markup else: - return open_ + str(self.contents) + close + return self.wiki_markup + str(self.contents) + self.wiki_markup - result = "<" + str(self.tag) - if self.attrs: - result += " " + " ".join([str(attr) for attr in self.attrs]) + result = ("" + result += self.padding + (">" if self.implicit else "/>") else: - result += " " * self.open_padding + ">" + str(self.contents) - result += "" + result += self.padding + ">" + str(self.contents) + result += "" return result def __iternodes__(self, getter): yield None, self - if self.showtag: + if not self.wiki_markup: for child in getter(self.tag): yield self.tag, child - for attr in self.attrs: + for attr in self.attributes: for child in getter(attr.name): yield attr.name, child if attr.value: for child in getter(attr.value): yield attr.value, child - for child in getter(self.contents): - yield self.contents, child + if self.contents: + for child in getter(self.contents): + yield self.contents, child + if not self.self_closing and not self.wiki_markup and self.closing_tag: + for child in getter(self.closing_tag): + yield self.closing_tag, child def __strip__(self, normalize, collapse): - if self.type in self.TAGS_VISIBLE: + if self.contents and is_visible(self.tag): return self.contents.strip_code(normalize, collapse) return None def __showtree__(self, write, get, mark): - tagnodes = self.tag.nodes - if (not self.attrs and len(tagnodes) == 1 and isinstance(tagnodes[0], Text)): - write("<" + str(tagnodes[0]) + ">") + write("" if self.implicit else "/>") else: - write("<") - get(self.tag) - for attr in self.attrs: - get(attr.name) - if not attr.value: - continue - write(" = ") - mark() - get(attr.value) write(">") - get(self.contents) - if len(tagnodes) == 1 and isinstance(tagnodes[0], Text): - write("") - else: + get(self.contents) write("") - def _translate(self): - """If the HTML-style tag has a wikicode representation, return that. - - For example, ``Foo`` can be represented as ``'''Foo'''``. This - returns a tuple of the character starting the sequence and the - character ending it. - """ - translations = { - self.TAG_ITALIC: ("''", "''"), - self.TAG_BOLD: ("'''", "'''"), - self.TAG_UNORDERED_LIST: ("*", ""), - self.TAG_ORDERED_LIST: ("#", ""), - self.TAG_DEF_TERM: (";", ""), - self.TAG_DEF_ITEM: (":", ""), - self.TAG_RULE: ("----", ""), - } - return translations[self.type] - - @property - def type(self): - """The tag type.""" - return self._type - @property def tag(self): """The tag itself, as a :py:class:`~.Wikicode` object.""" @@ -182,7 +123,7 @@ class Tag(Node): return self._contents @property - def attrs(self): + def attributes(self): """The list of attributes affecting the tag. Each attribute is an instance of :py:class:`~.Attribute`. @@ -190,52 +131,142 @@ class Tag(Node): return self._attrs @property - def showtag(self): - """Whether to show the tag itself instead of a wikicode version.""" - return self._showtag + def wiki_markup(self): + """The wikified version of a tag to show instead of HTML. + + If set to a value, this will be displayed instead of the brackets. + For example, set to ``''`` to replace ```` or ``----`` to replace + ``
``. + """ + return self._wiki_markup @property def self_closing(self): - """Whether the tag is self-closing with no content.""" + """Whether the tag is self-closing with no content (like ``
``).""" return self._self_closing @property - def open_padding(self): - """How much spacing to insert before the first closing >.""" - return self._open_padding + def invalid(self): + """Whether the tag starts with a backslash after the opening bracket. + + This makes the tag look like a lone close tag. It is technically + invalid and is only parsable Wikicode when the tag itself is + single-only, like ``
`` and ````. See + :py:func:`.definitions.is_single_only`. + """ + return self._invalid @property - def close_padding(self): - """How much spacing to insert before the last closing >.""" - return self._close_padding + def implicit(self): + """Whether the tag is implicitly self-closing, with no ending slash. - @type.setter - def type(self, value): - value = int(value) - if value not in self.TAGS_INVISIBLE | self.TAGS_VISIBLE: - raise ValueError(value) - self._type = value + This is only possible for specific "single" tags like ``
`` and + ``
  • ``. See :py:func:`.definitions.is_single`. This field only has an + effect if :py:attr:`self_closing` is also ``True``. + """ + return self._implicit + + @property + def padding(self): + """Spacing to insert before the first closing ``>``.""" + return self._padding + + @property + def closing_tag(self): + """The closing tag, as a :py:class:`~.Wikicode` object. + + This will usually equal :py:attr:`tag`, unless there is additional + spacing, comments, or the like. + """ + return self._closing_tag @tag.setter def tag(self, value): - self._tag = parse_anything(value) + self._tag = self._closing_tag = parse_anything(value) @contents.setter def contents(self, value): self._contents = parse_anything(value) - @showtag.setter - def showtag(self, value): - self._showtag = bool(value) + @wiki_markup.setter + def wiki_markup(self, value): + self._wiki_markup = str(value) if value else None @self_closing.setter def self_closing(self, value): self._self_closing = bool(value) - @open_padding.setter - def open_padding(self, value): - self._open_padding = int(value) + @invalid.setter + def invalid(self, value): + self._invalid = bool(value) + + @implicit.setter + def implicit(self, value): + self._implicit = bool(value) - @close_padding.setter - def close_padding(self, value): - self._close_padding = int(value) + @padding.setter + def padding(self, value): + if not value: + self._padding = "" + else: + value = str(value) + if not value.isspace(): + raise ValueError("padding must be entirely whitespace") + self._padding = value + + @closing_tag.setter + def closing_tag(self, value): + self._closing_tag = parse_anything(value) + + def has(self, name): + """Return whether any attribute in the tag has the given *name*. + + Note that a tag may have multiple attributes with the same name, but + only the last one is read by the MediaWiki parser. + """ + for attr in self.attributes: + if attr.name == name.strip(): + return True + return False + + def get(self, name): + """Get the attribute with the given *name*. + + The returned object is a :py:class:`~.Attribute` instance. Raises + :py:exc:`ValueError` if no attribute has this name. Since multiple + attributes can have the same name, we'll return the last match, since + all but the last are ignored by the MediaWiki parser. + """ + for attr in reversed(self.attributes): + if attr.name == name.strip(): + return attr + raise ValueError(name) + + def add(self, name, value=None, quoted=True, pad_first=" ", + pad_before_eq="", pad_after_eq=""): + """Add an attribute with the given *name* and *value*. + + *name* and *value* can be anything parasable by + :py:func:`.utils.parse_anything`; *value* can be omitted if the + attribute is valueless. *quoted* is a bool telling whether to wrap the + *value* in double quotes (this is recommended). *pad_first*, + *pad_before_eq*, and *pad_after_eq* are whitespace used as padding + before the name, before the equal sign (or after the name if no value), + and after the equal sign (ignored if no value), respectively. + """ + if value is not None: + value = parse_anything(value) + attr = Attribute(parse_anything(name), value, quoted) + attr.pad_first = pad_first + attr.pad_before_eq = pad_before_eq + attr.pad_after_eq = pad_after_eq + self.attributes.append(attr) + return attr + + def remove(self, name): + """Remove all attributes with the given *name*.""" + attrs = [attr for attr in self.attributes if attr.name == name.strip()] + if not attrs: + raise ValueError(name) + for attr in attrs: + self.attributes.remove(attr) diff --git a/mwparserfromhell/nodes/template.py b/mwparserfromhell/nodes/template.py index 6dfc4f0..a6b1665 100644 --- a/mwparserfromhell/nodes/template.py +++ b/mwparserfromhell/nodes/template.py @@ -26,7 +26,7 @@ import re from . import HTMLEntity, Node, Text from .extras import Parameter -from ..compat import basestring, str +from ..compat import str from ..utils import parse_anything __all__ = ["Template"] @@ -84,7 +84,7 @@ class Template(Node): replacement = str(HTMLEntity(value=ord(char))) for node in code.filter_text(recursive=False): if char in node: - code.replace(node, node.replace(char, replacement)) + code.replace(node, node.replace(char, replacement), False) def _blank_param_value(self, value): """Remove the content from *value* while keeping its whitespace. @@ -164,15 +164,15 @@ class Template(Node): def name(self, value): self._name = parse_anything(value) - def has_param(self, name, ignore_empty=True): + def has(self, name, ignore_empty=True): """Return ``True`` if any parameter in the template is named *name*. With *ignore_empty*, ``False`` will be returned even if the template contains a parameter with the name *name*, if the parameter's value is empty. Note that a template may have multiple parameters with the - same name. + same name, but only the last one is read by the MediaWiki parser. """ - name = name.strip() if isinstance(name, basestring) else str(name) + name = str(name).strip() for param in self.params: if param.name.strip() == name: if ignore_empty and not param.value.strip(): @@ -180,6 +180,9 @@ class Template(Node): return True return False + has_param = lambda self, *args, **kwargs: self.has(*args, **kwargs) + has_param.__doc__ = "Alias for :py:meth:`has`." + def get(self, name): """Get the parameter whose name is *name*. @@ -188,7 +191,7 @@ class Template(Node): parameters can have the same name, we'll return the last match, since the last parameter is the only one read by the MediaWiki parser. """ - name = name.strip() if isinstance(name, basestring) else str(name) + name = str(name).strip() for param in reversed(self.params): if param.name.strip() == name: return param @@ -226,7 +229,7 @@ class Template(Node): name, value = parse_anything(name), parse_anything(value) self._surface_escape(value, "|") - if self.has_param(name): + if self.has(name): self.remove(name, keep_field=True) existing = self.get(name) if showkey is not None: @@ -291,7 +294,7 @@ class Template(Node): the first instance if none have dependents, otherwise the one with dependents will be kept). """ - name = name.strip() if isinstance(name, basestring) else str(name) + name = str(name).strip() removed = False to_remove = [] for i, param in enumerate(self.params): diff --git a/mwparserfromhell/parser/__init__.py b/mwparserfromhell/parser/__init__.py index 1fb95b5..22c3dc2 100644 --- a/mwparserfromhell/parser/__init__.py +++ b/mwparserfromhell/parser/__init__.py @@ -46,16 +46,15 @@ class Parser(object): :py:class:`~.Node`\ s by the :py:class:`~.Builder`. """ - def __init__(self, text): - self.text = text + def __init__(self): if use_c and CTokenizer: self._tokenizer = CTokenizer() else: self._tokenizer = Tokenizer() self._builder = Builder() - def parse(self): - """Return a string as a parsed :py:class:`~.Wikicode` object tree.""" - tokens = self._tokenizer.tokenize(self.text) + def parse(self, text, context=0): + """Parse *text*, returning a :py:class:`~.Wikicode` object tree.""" + tokens = self._tokenizer.tokenize(text, context) code = self._builder.build(tokens) return code diff --git a/mwparserfromhell/parser/builder.py b/mwparserfromhell/parser/builder.py index 2cd7831..d31f450 100644 --- a/mwparserfromhell/parser/builder.py +++ b/mwparserfromhell/parser/builder.py @@ -24,8 +24,8 @@ from __future__ import unicode_literals from . import tokens from ..compat import str -from ..nodes import (Argument, Comment, Heading, HTMLEntity, Tag, Template, - Text, Wikilink) +from ..nodes import (Argument, Comment, ExternalLink, Heading, HTMLEntity, Tag, + Template, Text, Wikilink) from ..nodes.extras import Attribute, Parameter from ..smart_list import SmartList from ..wikicode import Wikicode @@ -83,7 +83,7 @@ class Builder(object): tokens.TemplateClose)): self._tokens.append(token) value = self._pop() - if not key: + if key is None: key = self._wrap([Text(str(default))]) return Parameter(key, value, showkey) else: @@ -142,6 +142,22 @@ class Builder(object): else: self._write(self._handle_token(token)) + def _handle_external_link(self, token): + """Handle when an external link is at the head of the tokens.""" + brackets, url = token.brackets, None + self._push() + while self._tokens: + token = self._tokens.pop() + if isinstance(token, tokens.ExternalLinkSeparator): + url = self._pop() + self._push() + elif isinstance(token, tokens.ExternalLinkClose): + if url is not None: + return ExternalLink(url, self._pop(), brackets) + return ExternalLink(self._pop(), brackets=brackets) + else: + self._write(self._handle_token(token)) + def _handle_entity(self): """Handle a case where an HTML entity is at the head of the tokens.""" token = self._tokens.pop() @@ -170,7 +186,7 @@ class Builder(object): self._write(self._handle_token(token)) def _handle_comment(self): - """Handle a case where a hidden comment is at the head of the tokens.""" + """Handle a case where an HTML comment is at the head of the tokens.""" self._push() while self._tokens: token = self._tokens.pop() @@ -180,7 +196,7 @@ class Builder(object): else: self._write(self._handle_token(token)) - def _handle_attribute(self): + def _handle_attribute(self, start): """Handle a case where a tag attribute is at the head of the tokens.""" name, quoted = None, False self._push() @@ -191,37 +207,46 @@ class Builder(object): self._push() elif isinstance(token, tokens.TagAttrQuote): quoted = True - elif isinstance(token, (tokens.TagAttrStart, - tokens.TagCloseOpen)): + elif isinstance(token, (tokens.TagAttrStart, tokens.TagCloseOpen, + tokens.TagCloseSelfclose)): self._tokens.append(token) - if name is not None: - return Attribute(name, self._pop(), quoted) - return Attribute(self._pop(), quoted=quoted) + if name: + value = self._pop() + else: + name, value = self._pop(), None + return Attribute(name, value, quoted, start.pad_first, + start.pad_before_eq, start.pad_after_eq) else: self._write(self._handle_token(token)) def _handle_tag(self, token): """Handle a case where a tag is at the head of the tokens.""" - type_, showtag = token.type, token.showtag - attrs = [] + close_tokens = (tokens.TagCloseSelfclose, tokens.TagCloseClose) + implicit, attrs, contents, closing_tag = False, [], None, None + wiki_markup, invalid = token.wiki_markup, token.invalid or False self._push() while self._tokens: token = self._tokens.pop() if isinstance(token, tokens.TagAttrStart): - attrs.append(self._handle_attribute()) + attrs.append(self._handle_attribute(token)) elif isinstance(token, tokens.TagCloseOpen): - open_pad = token.padding + padding = token.padding or "" tag = self._pop() self._push() - elif isinstance(token, tokens.TagCloseSelfclose): - tag = self._pop() - return Tag(type_, tag, attrs=attrs, showtag=showtag, - self_closing=True, open_padding=token.padding) elif isinstance(token, tokens.TagOpenClose): contents = self._pop() - elif isinstance(token, tokens.TagCloseClose): - return Tag(type_, tag, contents, attrs, showtag, False, - open_pad, token.padding) + self._push() + elif isinstance(token, close_tokens): + if isinstance(token, tokens.TagCloseSelfclose): + tag = self._pop() + self_closing = True + padding = token.padding or "" + implicit = token.implicit or False + else: + self_closing = False + closing_tag = self._pop() + return Tag(tag, contents, attrs, wiki_markup, self_closing, + invalid, implicit, padding, closing_tag) else: self._write(self._handle_token(token)) @@ -235,6 +260,8 @@ class Builder(object): return self._handle_argument() elif isinstance(token, tokens.WikilinkOpen): return self._handle_wikilink() + elif isinstance(token, tokens.ExternalLinkOpen): + return self._handle_external_link(token) elif isinstance(token, tokens.HTMLEntityStart): return self._handle_entity() elif isinstance(token, tokens.HeadingStart): diff --git a/mwparserfromhell/parser/contexts.py b/mwparserfromhell/parser/contexts.py index 896d137..33da8f7 100644 --- a/mwparserfromhell/parser/contexts.py +++ b/mwparserfromhell/parser/contexts.py @@ -51,6 +51,12 @@ Local (stack-specific) contexts: * :py:const:`WIKILINK_TITLE` * :py:const:`WIKILINK_TEXT` +* :py:const:`EXT_LINK` + + * :py:const:`EXT_LINK_URI` + * :py:const:`EXT_LINK_TITLE` + * :py:const:`EXT_LINK_BRACKETS` + * :py:const:`HEADING` * :py:const:`HEADING_LEVEL_1` @@ -60,7 +66,21 @@ Local (stack-specific) contexts: * :py:const:`HEADING_LEVEL_5` * :py:const:`HEADING_LEVEL_6` -* :py:const:`COMMENT` +* :py:const:`TAG` + + * :py:const:`TAG_OPEN` + * :py:const:`TAG_ATTR` + * :py:const:`TAG_BODY` + * :py:const:`TAG_CLOSE` + +* :py:const:`STYLE` + + * :py:const:`STYLE_ITALICS` + * :py:const:`STYLE_BOLD` + * :py:const:`STYLE_PASS_AGAIN` + * :py:const:`STYLE_SECOND_PASS` + +* :py:const:`DL_TERM` * :py:const:`SAFETY_CHECK` @@ -74,41 +94,76 @@ Local (stack-specific) contexts: Global contexts: * :py:const:`GL_HEADING` + +Aggregate contexts: + +* :py:const:`FAIL` +* :py:const:`UNSAFE` +* :py:const:`DOUBLE` +* :py:const:`INVALID_LINK` + """ # Local contexts: -TEMPLATE = 0b00000000000000000111 -TEMPLATE_NAME = 0b00000000000000000001 -TEMPLATE_PARAM_KEY = 0b00000000000000000010 -TEMPLATE_PARAM_VALUE = 0b00000000000000000100 - -ARGUMENT = 0b00000000000000011000 -ARGUMENT_NAME = 0b00000000000000001000 -ARGUMENT_DEFAULT = 0b00000000000000010000 - -WIKILINK = 0b00000000000001100000 -WIKILINK_TITLE = 0b00000000000000100000 -WIKILINK_TEXT = 0b00000000000001000000 - -HEADING = 0b00000001111110000000 -HEADING_LEVEL_1 = 0b00000000000010000000 -HEADING_LEVEL_2 = 0b00000000000100000000 -HEADING_LEVEL_3 = 0b00000000001000000000 -HEADING_LEVEL_4 = 0b00000000010000000000 -HEADING_LEVEL_5 = 0b00000000100000000000 -HEADING_LEVEL_6 = 0b00000001000000000000 - -COMMENT = 0b00000010000000000000 - -SAFETY_CHECK = 0b11111100000000000000 -HAS_TEXT = 0b00000100000000000000 -FAIL_ON_TEXT = 0b00001000000000000000 -FAIL_NEXT = 0b00010000000000000000 -FAIL_ON_LBRACE = 0b00100000000000000000 -FAIL_ON_RBRACE = 0b01000000000000000000 -FAIL_ON_EQUALS = 0b10000000000000000000 +TEMPLATE_NAME = 1 << 0 +TEMPLATE_PARAM_KEY = 1 << 1 +TEMPLATE_PARAM_VALUE = 1 << 2 +TEMPLATE = TEMPLATE_NAME + TEMPLATE_PARAM_KEY + TEMPLATE_PARAM_VALUE + +ARGUMENT_NAME = 1 << 3 +ARGUMENT_DEFAULT = 1 << 4 +ARGUMENT = ARGUMENT_NAME + ARGUMENT_DEFAULT + +WIKILINK_TITLE = 1 << 5 +WIKILINK_TEXT = 1 << 6 +WIKILINK = WIKILINK_TITLE + WIKILINK_TEXT + +EXT_LINK_URI = 1 << 7 +EXT_LINK_TITLE = 1 << 8 +EXT_LINK_BRACKETS = 1 << 9 +EXT_LINK = EXT_LINK_URI + EXT_LINK_TITLE + EXT_LINK_BRACKETS + +HEADING_LEVEL_1 = 1 << 10 +HEADING_LEVEL_2 = 1 << 11 +HEADING_LEVEL_3 = 1 << 12 +HEADING_LEVEL_4 = 1 << 13 +HEADING_LEVEL_5 = 1 << 14 +HEADING_LEVEL_6 = 1 << 15 +HEADING = (HEADING_LEVEL_1 + HEADING_LEVEL_2 + HEADING_LEVEL_3 + + HEADING_LEVEL_4 + HEADING_LEVEL_5 + HEADING_LEVEL_6) + +TAG_OPEN = 1 << 16 +TAG_ATTR = 1 << 17 +TAG_BODY = 1 << 18 +TAG_CLOSE = 1 << 19 +TAG = TAG_OPEN + TAG_ATTR + TAG_BODY + TAG_CLOSE + +STYLE_ITALICS = 1 << 20 +STYLE_BOLD = 1 << 21 +STYLE_PASS_AGAIN = 1 << 22 +STYLE_SECOND_PASS = 1 << 23 +STYLE = STYLE_ITALICS + STYLE_BOLD + STYLE_PASS_AGAIN + STYLE_SECOND_PASS + +DL_TERM = 1 << 24 + +HAS_TEXT = 1 << 25 +FAIL_ON_TEXT = 1 << 26 +FAIL_NEXT = 1 << 27 +FAIL_ON_LBRACE = 1 << 28 +FAIL_ON_RBRACE = 1 << 29 +FAIL_ON_EQUALS = 1 << 30 +SAFETY_CHECK = (HAS_TEXT + FAIL_ON_TEXT + FAIL_NEXT + FAIL_ON_LBRACE + + FAIL_ON_RBRACE + FAIL_ON_EQUALS) # Global contexts: -GL_HEADING = 0b1 +GL_HEADING = 1 << 0 + +# Aggregate contexts: + +FAIL = TEMPLATE + ARGUMENT + WIKILINK + EXT_LINK_TITLE + HEADING + TAG + STYLE +UNSAFE = (TEMPLATE_NAME + WIKILINK + EXT_LINK_TITLE + TEMPLATE_PARAM_KEY + + ARGUMENT_NAME + TAG_CLOSE) +DOUBLE = TEMPLATE_PARAM_KEY + TAG_CLOSE +INVALID_LINK = TEMPLATE_NAME + ARGUMENT_NAME + WIKILINK + EXT_LINK diff --git a/mwparserfromhell/parser/tokenizer.c b/mwparserfromhell/parser/tokenizer.c index df65d0e..c9527ab 100644 --- a/mwparserfromhell/parser/tokenizer.c +++ b/mwparserfromhell/parser/tokenizer.c @@ -24,28 +24,71 @@ SOFTWARE. #include "tokenizer.h" /* + Determine whether the given Py_UNICODE is a marker. +*/ +static int is_marker(Py_UNICODE this) +{ + int i; + + for (i = 0; i < NUM_MARKERS; i++) { + if (*MARKERS[i] == this) + return 1; + } + return 0; +} + +/* Given a context, return the heading level encoded within it. */ static int heading_level_from_context(int n) { int level; + n /= LC_HEADING_LEVEL_1; for (level = 1; n > 1; n >>= 1) level++; return level; } -static PyObject* -Tokenizer_new(PyTypeObject* type, PyObject* args, PyObject* kwds) +/* + Call the given function in definitions.py, using 'in1', 'in2', and 'in3' as + parameters, and return its output as a bool. +*/ +static int call_def_func(const char* funcname, PyObject* in1, PyObject* in2, + PyObject* in3) { - Tokenizer* self = (Tokenizer*) type->tp_alloc(type, 0); - return (PyObject*) self; + PyObject* func = PyObject_GetAttrString(definitions, funcname); + PyObject* result = PyObject_CallFunctionObjArgs(func, in1, in2, in3, NULL); + int ans = (result == Py_True) ? 1 : 0; + + Py_DECREF(func); + Py_DECREF(result); + return ans; +} + +/* + Sanitize the name of a tag so it can be compared with others for equality. +*/ +static PyObject* strip_tag_name(PyObject* token) +{ + PyObject *text, *rstripped, *lowered; + + text = PyObject_GetAttrString(token, "text"); + if (!text) + return NULL; + rstripped = PyObject_CallMethod(text, "rstrip", NULL); + Py_DECREF(text); + if (!rstripped) + return NULL; + lowered = PyObject_CallMethod(rstripped, "lower", NULL); + Py_DECREF(rstripped); + return lowered; } -static struct Textbuffer* -Textbuffer_new(void) +static Textbuffer* Textbuffer_new(void) { - struct Textbuffer* buffer = malloc(sizeof(struct Textbuffer)); + Textbuffer* buffer = malloc(sizeof(Textbuffer)); + if (!buffer) { PyErr_NoMemory(); return NULL; @@ -57,60 +100,151 @@ Textbuffer_new(void) PyErr_NoMemory(); return NULL; } - buffer->next = NULL; + buffer->prev = buffer->next = NULL; return buffer; } -static void -Tokenizer_dealloc(Tokenizer* self) +static void Textbuffer_dealloc(Textbuffer* self) { - struct Stack *this = self->topstack, *next; - Py_XDECREF(self->text); + Textbuffer* next; - while (this) { - Py_DECREF(this->stack); - Textbuffer_dealloc(this->textbuffer); - next = this->next; - free(this); - this = next; + while (self) { + free(self->data); + next = self->next; + free(self); + self = next; + } +} + +/* + Write a Unicode codepoint to the given textbuffer. +*/ +static int Textbuffer_write(Textbuffer** this, Py_UNICODE code) +{ + Textbuffer* self = *this; + + if (self->size == TEXTBUFFER_BLOCKSIZE) { + Textbuffer* new = Textbuffer_new(); + if (!new) + return -1; + new->next = self; + self->prev = new; + *this = self = new; + } + self->data[self->size++] = code; + return 0; +} + +/* + Return the contents of the textbuffer as a Python Unicode object. +*/ +static PyObject* Textbuffer_render(Textbuffer* self) +{ + PyObject *result = PyUnicode_FromUnicode(self->data, self->size); + PyObject *left, *concat; + + while (self->next) { + self = self->next; + left = PyUnicode_FromUnicode(self->data, self->size); + concat = PyUnicode_Concat(left, result); + Py_DECREF(left); + Py_DECREF(result); + result = concat; + } + return result; +} + +static TagData* TagData_new(void) +{ + TagData *self = malloc(sizeof(TagData)); + + #define ALLOC_BUFFER(name) \ + name = Textbuffer_new(); \ + if (!name) { \ + TagData_dealloc(self); \ + return NULL; \ + } + + if (!self) { + PyErr_NoMemory(); + return NULL; } - self->ob_type->tp_free((PyObject*) self); + self->context = TAG_NAME; + ALLOC_BUFFER(self->pad_first) + ALLOC_BUFFER(self->pad_before_eq) + ALLOC_BUFFER(self->pad_after_eq) + self->reset = 0; + return self; +} + +static void TagData_dealloc(TagData* self) +{ + #define DEALLOC_BUFFER(name) \ + if (name) \ + Textbuffer_dealloc(name); + + DEALLOC_BUFFER(self->pad_first); + DEALLOC_BUFFER(self->pad_before_eq); + DEALLOC_BUFFER(self->pad_after_eq); + free(self); +} + +static int TagData_reset_buffers(TagData* self) +{ + #define RESET_BUFFER(name) \ + Textbuffer_dealloc(name); \ + name = Textbuffer_new(); \ + if (!name) \ + return -1; + + RESET_BUFFER(self->pad_first) + RESET_BUFFER(self->pad_before_eq) + RESET_BUFFER(self->pad_after_eq) + return 0; +} + +static PyObject* +Tokenizer_new(PyTypeObject* type, PyObject* args, PyObject* kwds) +{ + Tokenizer* self = (Tokenizer*) type->tp_alloc(type, 0); + return (PyObject*) self; } -static void -Textbuffer_dealloc(struct Textbuffer* this) +static void Tokenizer_dealloc(Tokenizer* self) { - struct Textbuffer* next; + Stack *this = self->topstack, *next; + Py_XDECREF(self->text); + while (this) { - free(this->data); + Py_DECREF(this->stack); + Textbuffer_dealloc(this->textbuffer); next = this->next; free(this); this = next; } + Py_TYPE(self)->tp_free((PyObject*) self); } -static int -Tokenizer_init(Tokenizer* self, PyObject* args, PyObject* kwds) +static int Tokenizer_init(Tokenizer* self, PyObject* args, PyObject* kwds) { static char* kwlist[] = {NULL}; + if (!PyArg_ParseTupleAndKeywords(args, kwds, "", kwlist)) return -1; self->text = Py_None; Py_INCREF(Py_None); self->topstack = NULL; - self->head = 0; - self->length = 0; - self->global = 0; + self->head = self->length = self->global = self->depth = self->cycles = 0; return 0; } /* Add a new token stack, context, and textbuffer to the list. */ -static int -Tokenizer_push(Tokenizer* self, int context) +static int Tokenizer_push(Tokenizer* self, int context) { - struct Stack* top = malloc(sizeof(struct Stack)); + Stack* top = malloc(sizeof(Stack)); + if (!top) { PyErr_NoMemory(); return -1; @@ -128,32 +262,13 @@ Tokenizer_push(Tokenizer* self, int context) } /* - Return the contents of the textbuffer as a Python Unicode object. -*/ -static PyObject* -Textbuffer_render(struct Textbuffer* self) -{ - PyObject *result = PyUnicode_FromUnicode(self->data, self->size); - PyObject *left, *concat; - while (self->next) { - self = self->next; - left = PyUnicode_FromUnicode(self->data, self->size); - concat = PyUnicode_Concat(left, result); - Py_DECREF(left); - Py_DECREF(result); - result = concat; - } - return result; -} - -/* Push the textbuffer onto the stack as a Text node and clear it. */ -static int -Tokenizer_push_textbuffer(Tokenizer* self) +static int Tokenizer_push_textbuffer(Tokenizer* self) { PyObject *text, *kwargs, *token; - struct Textbuffer* buffer = self->topstack->textbuffer; + Textbuffer* buffer = self->topstack->textbuffer; + if (buffer->size == 0 && !buffer->next) return 0; text = Textbuffer_render(buffer); @@ -185,10 +300,10 @@ Tokenizer_push_textbuffer(Tokenizer* self) /* Pop and deallocate the top token stack/context/textbuffer. */ -static void -Tokenizer_delete_top_of_stack(Tokenizer* self) +static void Tokenizer_delete_top_of_stack(Tokenizer* self) { - struct Stack* top = self->topstack; + Stack* top = self->topstack; + Py_DECREF(top->stack); Textbuffer_dealloc(top->textbuffer); self->topstack = top->next; @@ -199,10 +314,10 @@ Tokenizer_delete_top_of_stack(Tokenizer* self) /* Pop the current stack/context/textbuffer, returing the stack. */ -static PyObject* -Tokenizer_pop(Tokenizer* self) +static PyObject* Tokenizer_pop(Tokenizer* self) { PyObject* stack; + if (Tokenizer_push_textbuffer(self)) return NULL; stack = self->topstack->stack; @@ -215,11 +330,11 @@ Tokenizer_pop(Tokenizer* self) Pop the current stack/context/textbuffer, returing the stack. We will also replace the underlying stack's context with the current stack's. */ -static PyObject* -Tokenizer_pop_keeping_context(Tokenizer* self) +static PyObject* Tokenizer_pop_keeping_context(Tokenizer* self) { PyObject* stack; int context; + if (Tokenizer_push_textbuffer(self)) return NULL; stack = self->topstack->stack; @@ -234,70 +349,133 @@ Tokenizer_pop_keeping_context(Tokenizer* self) Fail the current tokenization route. Discards the current stack/context/textbuffer and raises a BadRoute exception. */ -static void* -Tokenizer_fail_route(Tokenizer* self) +static void* Tokenizer_fail_route(Tokenizer* self) { + int context = self->topstack->context; PyObject* stack = Tokenizer_pop(self); + Py_XDECREF(stack); - FAIL_ROUTE(); + FAIL_ROUTE(context); return NULL; } /* - Write a token to the end of the current token stack. + Write a token to the current token stack. */ -static int -Tokenizer_write(Tokenizer* self, PyObject* token) +static int Tokenizer_emit_token(Tokenizer* self, PyObject* token, int first) { + PyObject* instance; + if (Tokenizer_push_textbuffer(self)) return -1; - if (PyList_Append(self->topstack->stack, token)) + instance = PyObject_CallObject(token, NULL); + if (!instance) return -1; + if (first ? PyList_Insert(self->topstack->stack, 0, instance) : + PyList_Append(self->topstack->stack, instance)) { + Py_DECREF(instance); + return -1; + } + Py_DECREF(instance); return 0; } /* - Write a token to the beginning of the current token stack. + Write a token to the current token stack, with kwargs. Steals a reference + to kwargs. */ -static int -Tokenizer_write_first(Tokenizer* self, PyObject* token) +static int Tokenizer_emit_token_kwargs(Tokenizer* self, PyObject* token, + PyObject* kwargs, int first) { - if (Tokenizer_push_textbuffer(self)) + PyObject* instance; + + if (Tokenizer_push_textbuffer(self)) { + Py_DECREF(kwargs); + return -1; + } + instance = PyObject_Call(token, NOARGS, kwargs); + if (!instance) { + Py_DECREF(kwargs); return -1; - if (PyList_Insert(self->topstack->stack, 0, token)) + } + if (first ? PyList_Insert(self->topstack->stack, 0, instance): + PyList_Append(self->topstack->stack, instance)) { + Py_DECREF(instance); + Py_DECREF(kwargs); return -1; + } + Py_DECREF(instance); + Py_DECREF(kwargs); return 0; } /* - Write text to the current textbuffer. + Write a Unicode codepoint to the current textbuffer. */ -static int -Tokenizer_write_text(Tokenizer* self, Py_UNICODE text) +static int Tokenizer_emit_char(Tokenizer* self, Py_UNICODE code) { - struct Textbuffer* buf = self->topstack->textbuffer; - if (buf->size == TEXTBUFFER_BLOCKSIZE) { - struct Textbuffer* new = Textbuffer_new(); - if (!new) + return Textbuffer_write(&(self->topstack->textbuffer), code); +} + +/* + Write a string of text to the current textbuffer. +*/ +static int Tokenizer_emit_text(Tokenizer* self, const char* text) +{ + int i = 0; + + while (text[i]) { + if (Tokenizer_emit_char(self, text[i])) return -1; - new->next = buf; - self->topstack->textbuffer = new; - buf = new; + i++; } - buf->data[buf->size] = text; - buf->size++; return 0; } /* - Write a series of tokens to the current stack at once. + Write the contents of another textbuffer to the current textbuffer, + deallocating it in the process. */ static int -Tokenizer_write_all(Tokenizer* self, PyObject* tokenlist) +Tokenizer_emit_textbuffer(Tokenizer* self, Textbuffer* buffer, int reverse) +{ + Textbuffer *original = buffer; + int i; + + if (reverse) { + do { + for (i = buffer->size - 1; i >= 0; i--) { + if (Tokenizer_emit_char(self, buffer->data[i])) { + Textbuffer_dealloc(original); + return -1; + } + } + } while ((buffer = buffer->next)); + } + else { + while (buffer->next) + buffer = buffer->next; + do { + for (i = 0; i < buffer->size; i++) { + if (Tokenizer_emit_char(self, buffer->data[i])) { + Textbuffer_dealloc(original); + return -1; + } + } + } while ((buffer = buffer->prev)); + } + Textbuffer_dealloc(original); + return 0; +} + +/* + Write a series of tokens to the current stack at once. +*/ +static int Tokenizer_emit_all(Tokenizer* self, PyObject* tokenlist) { int pushed = 0; PyObject *stack, *token, *left, *right, *text; - struct Textbuffer* buffer; + Textbuffer* buffer; Py_ssize_t size; if (PyList_GET_SIZE(tokenlist) > 0) { @@ -351,23 +529,17 @@ Tokenizer_write_all(Tokenizer* self, PyObject* tokenlist) Pop the current stack, write text, and then write the stack. 'text' is a NULL-terminated array of chars. */ -static int -Tokenizer_write_text_then_stack(Tokenizer* self, const char* text) +static int Tokenizer_emit_text_then_stack(Tokenizer* self, const char* text) { PyObject* stack = Tokenizer_pop(self); - int i = 0; - while (1) { - if (!text[i]) - break; - if (Tokenizer_write_text(self, (Py_UNICODE) text[i])) { - Py_XDECREF(stack); - return -1; - } - i++; + + if (Tokenizer_emit_text(self, text)) { + Py_DECREF(stack); + return -1; } if (stack) { if (PyList_GET_SIZE(stack) > 0) { - if (Tokenizer_write_all(self, stack)) { + if (Tokenizer_emit_all(self, stack)) { Py_DECREF(stack); return -1; } @@ -381,10 +553,10 @@ Tokenizer_write_text_then_stack(Tokenizer* self, const char* text) /* Read the value at a relative point in the wikicode, forwards. */ -static PyObject* -Tokenizer_read(Tokenizer* self, Py_ssize_t delta) +static PyObject* Tokenizer_read(Tokenizer* self, Py_ssize_t delta) { Py_ssize_t index = self->head + delta; + if (index >= self->length) return EMPTY; return PyList_GET_ITEM(self->text, index); @@ -393,10 +565,10 @@ Tokenizer_read(Tokenizer* self, Py_ssize_t delta) /* Read the value at a relative point in the wikicode, backwards. */ -static PyObject* -Tokenizer_read_backwards(Tokenizer* self, Py_ssize_t delta) +static PyObject* Tokenizer_read_backwards(Tokenizer* self, Py_ssize_t delta) { Py_ssize_t index; + if (delta > self->head) return EMPTY; index = self->head - delta; @@ -404,10 +576,67 @@ Tokenizer_read_backwards(Tokenizer* self, Py_ssize_t delta) } /* + Parse a template at the head of the wikicode string. +*/ +static int Tokenizer_parse_template(Tokenizer* self) +{ + PyObject *template; + Py_ssize_t reset = self->head; + + template = Tokenizer_parse(self, LC_TEMPLATE_NAME, 1); + if (BAD_ROUTE) { + self->head = reset; + return 0; + } + if (!template) + return -1; + if (Tokenizer_emit_first(self, TemplateOpen)) { + Py_DECREF(template); + return -1; + } + if (Tokenizer_emit_all(self, template)) { + Py_DECREF(template); + return -1; + } + Py_DECREF(template); + if (Tokenizer_emit(self, TemplateClose)) + return -1; + return 0; +} + +/* + Parse an argument at the head of the wikicode string. +*/ +static int Tokenizer_parse_argument(Tokenizer* self) +{ + PyObject *argument; + Py_ssize_t reset = self->head; + + argument = Tokenizer_parse(self, LC_ARGUMENT_NAME, 1); + if (BAD_ROUTE) { + self->head = reset; + return 0; + } + if (!argument) + return -1; + if (Tokenizer_emit_first(self, ArgumentOpen)) { + Py_DECREF(argument); + return -1; + } + if (Tokenizer_emit_all(self, argument)) { + Py_DECREF(argument); + return -1; + } + Py_DECREF(argument); + if (Tokenizer_emit(self, ArgumentClose)) + return -1; + return 0; +} + +/* Parse a template or argument at the head of the wikicode string. */ -static int -Tokenizer_parse_template_or_argument(Tokenizer* self) +static int Tokenizer_parse_template_or_argument(Tokenizer* self) { unsigned int braces = 2, i; PyObject *tokenlist; @@ -421,17 +650,16 @@ Tokenizer_parse_template_or_argument(Tokenizer* self) return -1; while (braces) { if (braces == 1) { - if (Tokenizer_write_text_then_stack(self, "{")) + if (Tokenizer_emit_text_then_stack(self, "{")) return -1; return 0; } if (braces == 2) { if (Tokenizer_parse_template(self)) return -1; - if (BAD_ROUTE) { RESET_ROUTE(); - if (Tokenizer_write_text_then_stack(self, "{{")) + if (Tokenizer_emit_text_then_stack(self, "{{")) return -1; return 0; } @@ -448,7 +676,7 @@ Tokenizer_parse_template_or_argument(Tokenizer* self) RESET_ROUTE(); for (i = 0; i < braces; i++) text[i] = *"{"; text[braces] = *""; - if (Tokenizer_write_text_then_stack(self, text)) { + if (Tokenizer_emit_text_then_stack(self, text)) { Py_XDECREF(text); return -1; } @@ -466,134 +694,42 @@ Tokenizer_parse_template_or_argument(Tokenizer* self) tokenlist = Tokenizer_pop(self); if (!tokenlist) return -1; - if (Tokenizer_write_all(self, tokenlist)) { + if (Tokenizer_emit_all(self, tokenlist)) { Py_DECREF(tokenlist); return -1; } Py_DECREF(tokenlist); + if (self->topstack->context & LC_FAIL_NEXT) + self->topstack->context ^= LC_FAIL_NEXT; return 0; } /* - Parse a template at the head of the wikicode string. + Handle a template parameter at the head of the string. */ -static int -Tokenizer_parse_template(Tokenizer* self) +static int Tokenizer_handle_template_param(Tokenizer* self) { - PyObject *template, *token; - Py_ssize_t reset = self->head; + PyObject *stack; - template = Tokenizer_parse(self, LC_TEMPLATE_NAME); - if (BAD_ROUTE) { - self->head = reset; - return 0; + if (self->topstack->context & LC_TEMPLATE_NAME) + self->topstack->context ^= LC_TEMPLATE_NAME; + else if (self->topstack->context & LC_TEMPLATE_PARAM_VALUE) + self->topstack->context ^= LC_TEMPLATE_PARAM_VALUE; + if (self->topstack->context & LC_TEMPLATE_PARAM_KEY) { + stack = Tokenizer_pop_keeping_context(self); + if (!stack) + return -1; + if (Tokenizer_emit_all(self, stack)) { + Py_DECREF(stack); + return -1; + } + Py_DECREF(stack); } - if (!template) - return -1; - token = PyObject_CallObject(TemplateOpen, NULL); - if (!token) { - Py_DECREF(template); + else + self->topstack->context |= LC_TEMPLATE_PARAM_KEY; + if (Tokenizer_emit(self, TemplateParamSeparator)) return -1; - } - if (Tokenizer_write_first(self, token)) { - Py_DECREF(token); - Py_DECREF(template); - return -1; - } - Py_DECREF(token); - if (Tokenizer_write_all(self, template)) { - Py_DECREF(template); - return -1; - } - Py_DECREF(template); - token = PyObject_CallObject(TemplateClose, NULL); - if (!token) - return -1; - if (Tokenizer_write(self, token)) { - Py_DECREF(token); - return -1; - } - Py_DECREF(token); - return 0; -} - -/* - Parse an argument at the head of the wikicode string. -*/ -static int -Tokenizer_parse_argument(Tokenizer* self) -{ - PyObject *argument, *token; - Py_ssize_t reset = self->head; - - argument = Tokenizer_parse(self, LC_ARGUMENT_NAME); - if (BAD_ROUTE) { - self->head = reset; - return 0; - } - if (!argument) - return -1; - token = PyObject_CallObject(ArgumentOpen, NULL); - if (!token) { - Py_DECREF(argument); - return -1; - } - if (Tokenizer_write_first(self, token)) { - Py_DECREF(token); - Py_DECREF(argument); - return -1; - } - Py_DECREF(token); - if (Tokenizer_write_all(self, argument)) { - Py_DECREF(argument); - return -1; - } - Py_DECREF(argument); - token = PyObject_CallObject(ArgumentClose, NULL); - if (!token) - return -1; - if (Tokenizer_write(self, token)) { - Py_DECREF(token); - return -1; - } - Py_DECREF(token); - return 0; -} - -/* - Handle a template parameter at the head of the string. -*/ -static int -Tokenizer_handle_template_param(Tokenizer* self) -{ - PyObject *stack, *token; - - if (self->topstack->context & LC_TEMPLATE_NAME) - self->topstack->context ^= LC_TEMPLATE_NAME; - else if (self->topstack->context & LC_TEMPLATE_PARAM_VALUE) - self->topstack->context ^= LC_TEMPLATE_PARAM_VALUE; - if (self->topstack->context & LC_TEMPLATE_PARAM_KEY) { - stack = Tokenizer_pop_keeping_context(self); - if (!stack) - return -1; - if (Tokenizer_write_all(self, stack)) { - Py_DECREF(stack); - return -1; - } - Py_DECREF(stack); - } - else - self->topstack->context |= LC_TEMPLATE_PARAM_KEY; - - token = PyObject_CallObject(TemplateParamSeparator, NULL); - if (!token) - return -1; - if (Tokenizer_write(self, token)) { - Py_DECREF(token); - return -1; - } - Py_DECREF(token); - if (Tokenizer_push(self, self->topstack->context)) + if (Tokenizer_push(self, self->topstack->context)) return -1; return 0; } @@ -601,37 +737,29 @@ Tokenizer_handle_template_param(Tokenizer* self) /* Handle a template parameter's value at the head of the string. */ -static int -Tokenizer_handle_template_param_value(Tokenizer* self) +static int Tokenizer_handle_template_param_value(Tokenizer* self) { - PyObject *stack, *token; + PyObject *stack; stack = Tokenizer_pop_keeping_context(self); if (!stack) return -1; - if (Tokenizer_write_all(self, stack)) { + if (Tokenizer_emit_all(self, stack)) { Py_DECREF(stack); return -1; } Py_DECREF(stack); self->topstack->context ^= LC_TEMPLATE_PARAM_KEY; self->topstack->context |= LC_TEMPLATE_PARAM_VALUE; - token = PyObject_CallObject(TemplateParamEquals, NULL); - if (!token) + if (Tokenizer_emit(self, TemplateParamEquals)) return -1; - if (Tokenizer_write(self, token)) { - Py_DECREF(token); - return -1; - } - Py_DECREF(token); return 0; } /* Handle the end of a template at the head of the string. */ -static PyObject* -Tokenizer_handle_template_end(Tokenizer* self) +static PyObject* Tokenizer_handle_template_end(Tokenizer* self) { PyObject* stack; @@ -639,7 +767,7 @@ Tokenizer_handle_template_end(Tokenizer* self) stack = Tokenizer_pop_keeping_context(self); if (!stack) return NULL; - if (Tokenizer_write_all(self, stack)) { + if (Tokenizer_emit_all(self, stack)) { Py_DECREF(stack); return NULL; } @@ -653,30 +781,22 @@ Tokenizer_handle_template_end(Tokenizer* self) /* Handle the separator between an argument's name and default. */ -static int -Tokenizer_handle_argument_separator(Tokenizer* self) +static int Tokenizer_handle_argument_separator(Tokenizer* self) { - PyObject* token; self->topstack->context ^= LC_ARGUMENT_NAME; self->topstack->context |= LC_ARGUMENT_DEFAULT; - token = PyObject_CallObject(ArgumentSeparator, NULL); - if (!token) - return -1; - if (Tokenizer_write(self, token)) { - Py_DECREF(token); + if (Tokenizer_emit(self, ArgumentSeparator)) return -1; - } - Py_DECREF(token); return 0; } /* Handle the end of an argument at the head of the string. */ -static PyObject* -Tokenizer_handle_argument_end(Tokenizer* self) +static PyObject* Tokenizer_handle_argument_end(Tokenizer* self) { PyObject* stack = Tokenizer_pop(self); + self->head += 2; return stack; } @@ -684,79 +804,55 @@ Tokenizer_handle_argument_end(Tokenizer* self) /* Parse an internal wikilink at the head of the wikicode string. */ -static int -Tokenizer_parse_wikilink(Tokenizer* self) +static int Tokenizer_parse_wikilink(Tokenizer* self) { Py_ssize_t reset; - PyObject *wikilink, *token; - int i; + PyObject *wikilink; self->head += 2; reset = self->head - 1; - wikilink = Tokenizer_parse(self, LC_WIKILINK_TITLE); + wikilink = Tokenizer_parse(self, LC_WIKILINK_TITLE, 1); if (BAD_ROUTE) { RESET_ROUTE(); self->head = reset; - for (i = 0; i < 2; i++) { - if (Tokenizer_write_text(self, *"[")) - return -1; - } + if (Tokenizer_emit_text(self, "[[")) + return -1; return 0; } if (!wikilink) return -1; - token = PyObject_CallObject(WikilinkOpen, NULL); - if (!token) { - Py_DECREF(wikilink); - return -1; - } - if (Tokenizer_write(self, token)) { - Py_DECREF(token); + if (Tokenizer_emit(self, WikilinkOpen)) { Py_DECREF(wikilink); return -1; } - Py_DECREF(token); - if (Tokenizer_write_all(self, wikilink)) { + if (Tokenizer_emit_all(self, wikilink)) { Py_DECREF(wikilink); return -1; } Py_DECREF(wikilink); - token = PyObject_CallObject(WikilinkClose, NULL); - if (!token) + if (Tokenizer_emit(self, WikilinkClose)) return -1; - if (Tokenizer_write(self, token)) { - Py_DECREF(token); - return -1; - } - Py_DECREF(token); + if (self->topstack->context & LC_FAIL_NEXT) + self->topstack->context ^= LC_FAIL_NEXT; return 0; } /* Handle the separator between a wikilink's title and its text. */ -static int -Tokenizer_handle_wikilink_separator(Tokenizer* self) +static int Tokenizer_handle_wikilink_separator(Tokenizer* self) { - PyObject* token; self->topstack->context ^= LC_WIKILINK_TITLE; self->topstack->context |= LC_WIKILINK_TEXT; - token = PyObject_CallObject(WikilinkSeparator, NULL); - if (!token) - return -1; - if (Tokenizer_write(self, token)) { - Py_DECREF(token); + if (Tokenizer_emit(self, WikilinkSeparator)) return -1; - } - Py_DECREF(token); return 0; } /* Handle the end of a wikilink at the head of the string. */ -static PyObject* -Tokenizer_handle_wikilink_end(Tokenizer* self) +static PyObject* Tokenizer_handle_wikilink_end(Tokenizer* self) { PyObject* stack = Tokenizer_pop(self); self->head += 1; @@ -764,139 +860,468 @@ Tokenizer_handle_wikilink_end(Tokenizer* self) } /* - Parse a section heading at the head of the wikicode string. + Parse the URI scheme of a bracket-enclosed external link. */ -static int -Tokenizer_parse_heading(Tokenizer* self) +static int Tokenizer_parse_bracketed_uri_scheme(Tokenizer* self) { - Py_ssize_t reset = self->head; - int best = 1, i, context, diff; - HeadingData *heading; - PyObject *level, *kwargs, *token; + static const char* valid = "abcdefghijklmnopqrstuvwxyz0123456789+.-"; + Textbuffer* buffer; + PyObject* scheme; + Py_UNICODE this; + int slashes, i; - self->global |= GL_HEADING; - self->head += 1; - while (Tokenizer_READ(self, 0) == *"=") { - best++; - self->head++; + if (Tokenizer_push(self, LC_EXT_LINK_URI)) + return -1; + if (Tokenizer_READ(self, 0) == *"/" && Tokenizer_READ(self, 1) == *"/") { + if (Tokenizer_emit_text(self, "//")) + return -1; + self->head += 2; } - context = LC_HEADING_LEVEL_1 << (best > 5 ? 5 : best - 1); - heading = (HeadingData*) Tokenizer_parse(self, context); - if (BAD_ROUTE) { - RESET_ROUTE(); - self->head = reset + best - 1; - for (i = 0; i < best; i++) { - if (Tokenizer_write_text(self, *"=")) + else { + buffer = Textbuffer_new(); + if (!buffer) + return -1; + while ((this = Tokenizer_READ(self, 0)) != *"") { + i = 0; + while (1) { + if (!valid[i]) + goto end_of_loop; + if (this == valid[i]) + break; + i++; + } + Textbuffer_write(&buffer, this); + if (Tokenizer_emit_char(self, this)) { + Textbuffer_dealloc(buffer); return -1; + } + self->head++; } - self->global ^= GL_HEADING; - return 0; + end_of_loop: + if (this != *":") { + Textbuffer_dealloc(buffer); + Tokenizer_fail_route(self); + return 0; + } + if (Tokenizer_emit_char(self, *":")) { + Textbuffer_dealloc(buffer); + return -1; + } + self->head++; + slashes = (Tokenizer_READ(self, 0) == *"/" && + Tokenizer_READ(self, 1) == *"/"); + if (slashes) { + if (Tokenizer_emit_text(self, "//")) { + Textbuffer_dealloc(buffer); + return -1; + } + self->head += 2; + } + scheme = Textbuffer_render(buffer); + Textbuffer_dealloc(buffer); + if (!scheme) + return -1; + if (!IS_SCHEME(scheme, slashes, 0)) { + Py_DECREF(scheme); + Tokenizer_fail_route(self); + return 0; + } + Py_DECREF(scheme); } + return 0; +} - level = PyInt_FromSsize_t(heading->level); - if (!level) { - Py_DECREF(heading->title); - free(heading); - return -1; - } - kwargs = PyDict_New(); - if (!kwargs) { - Py_DECREF(level); - Py_DECREF(heading->title); - free(heading); - return -1; - } - PyDict_SetItemString(kwargs, "level", level); - Py_DECREF(level); - token = PyObject_Call(HeadingStart, NOARGS, kwargs); - Py_DECREF(kwargs); - if (!token) { - Py_DECREF(heading->title); - free(heading); - return -1; +/* + Parse the URI scheme of a free (no brackets) external link. +*/ +static int Tokenizer_parse_free_uri_scheme(Tokenizer* self) +{ + static const char* valid = "abcdefghijklmnopqrstuvwxyz0123456789+.-"; + Textbuffer *scheme_buffer = Textbuffer_new(), *temp_buffer; + PyObject *scheme; + Py_UNICODE chunk; + int slashes, i, j; + + if (!scheme_buffer) + return -1; + // We have to backtrack through the textbuffer looking for our scheme since + // it was just parsed as text: + temp_buffer = self->topstack->textbuffer; + while (temp_buffer) { + for (i = temp_buffer->size - 1; i >= 0; i--) { + chunk = temp_buffer->data[i]; + if (Py_UNICODE_ISSPACE(chunk) || is_marker(chunk)) + goto end_of_loop; + j = 0; + while (1) { + if (!valid[j]) { + Textbuffer_dealloc(scheme_buffer); + FAIL_ROUTE(0); + return 0; + } + if (chunk == valid[j]) + break; + j++; + } + Textbuffer_write(&scheme_buffer, chunk); + } + temp_buffer = temp_buffer->next; } - if (Tokenizer_write(self, token)) { - Py_DECREF(token); - Py_DECREF(heading->title); - free(heading); + end_of_loop: + scheme = Textbuffer_render(scheme_buffer); + if (!scheme) { + Textbuffer_dealloc(scheme_buffer); return -1; } - Py_DECREF(token); - if (heading->level < best) { - diff = best - heading->level; - for (i = 0; i < diff; i++) { - if (Tokenizer_write_text(self, *"=")) { - Py_DECREF(heading->title); - free(heading); - return -1; - } - } + slashes = (Tokenizer_READ(self, 0) == *"/" && + Tokenizer_READ(self, 1) == *"/"); + if (!IS_SCHEME(scheme, slashes, 1)) { + Py_DECREF(scheme); + Textbuffer_dealloc(scheme_buffer); + FAIL_ROUTE(0); + return 0; } - if (Tokenizer_write_all(self, heading->title)) { - Py_DECREF(heading->title); - free(heading); + Py_DECREF(scheme); + if (Tokenizer_push(self, LC_EXT_LINK_URI)) { + Textbuffer_dealloc(scheme_buffer); return -1; } - Py_DECREF(heading->title); - free(heading); - token = PyObject_CallObject(HeadingEnd, NULL); - if (!token) + if (Tokenizer_emit_textbuffer(self, scheme_buffer, 1)) return -1; - if (Tokenizer_write(self, token)) { - Py_DECREF(token); + if (Tokenizer_emit_char(self, *":")) return -1; + if (slashes) { + if (Tokenizer_emit_text(self, "//")) + return -1; + self->head += 2; } - Py_DECREF(token); - self->global ^= GL_HEADING; return 0; } /* - Handle the end of a section heading at the head of the string. + Handle text in a free external link, including trailing punctuation. */ -static HeadingData* -Tokenizer_handle_heading_end(Tokenizer* self) +static int +Tokenizer_handle_free_link_text(Tokenizer* self, int* parens, + Textbuffer** tail, Py_UNICODE this) { - Py_ssize_t reset = self->head, best; - int i, current, level, diff; - HeadingData *after, *heading; - PyObject *stack; + #define PUSH_TAIL_BUFFER(tail, error) \ + if ((tail)->size || (tail)->next) { \ + if (Tokenizer_emit_textbuffer(self, tail, 0)) \ + return error; \ + tail = Textbuffer_new(); \ + if (!(tail)) \ + return error; \ + } - self->head += 1; - best = 1; - while (Tokenizer_READ(self, 0) == *"=") { - best++; - self->head++; + if (this == *"(" && !(*parens)) { + *parens = 1; + PUSH_TAIL_BUFFER(*tail, -1) } - current = heading_level_from_context(self->topstack->context); - level = current > best ? (best > 6 ? 6 : best) : - (current > 6 ? 6 : current); - after = (HeadingData*) Tokenizer_parse(self, self->topstack->context); - if (BAD_ROUTE) { - RESET_ROUTE(); - if (level < best) { - diff = best - level; - for (i = 0; i < diff; i++) { - if (Tokenizer_write_text(self, *"=")) - return NULL; - } + else if (this == *"," || this == *";" || this == *"\\" || this == *"." || + this == *":" || this == *"!" || this == *"?" || + (!(*parens) && this == *")")) + return Textbuffer_write(tail, this); + else + PUSH_TAIL_BUFFER(*tail, -1) + return Tokenizer_emit_char(self, this); +} + +/* + Really parse an external link. +*/ +static PyObject* +Tokenizer_really_parse_external_link(Tokenizer* self, int brackets, + Textbuffer** extra) +{ + Py_UNICODE this, next; + int parens = 0; + + if (brackets ? Tokenizer_parse_bracketed_uri_scheme(self) : + Tokenizer_parse_free_uri_scheme(self)) + return NULL; + if (BAD_ROUTE) + return NULL; + this = Tokenizer_READ(self, 0); + if (this == *"" || this == *"\n" || this == *" " || this == *"]") + return Tokenizer_fail_route(self); + if (!brackets && this == *"[") + return Tokenizer_fail_route(self); + while (1) { + this = Tokenizer_READ(self, 0); + next = Tokenizer_READ(self, 1); + if (this == *"" || this == *"\n") { + if (brackets) + return Tokenizer_fail_route(self); + self->head--; + return Tokenizer_pop(self); } - self->head = reset + best - 1; - } - else { - for (i = 0; i < best; i++) { - if (Tokenizer_write_text(self, *"=")) { - Py_DECREF(after->title); - free(after); + if (this == *"{" && next == *"{" && Tokenizer_CAN_RECURSE(self)) { + PUSH_TAIL_BUFFER(*extra, NULL) + if (Tokenizer_parse_template_or_argument(self)) return NULL; + } + else if (this == *"[") { + if (!brackets) { + self->head--; + return Tokenizer_pop(self); } + if (Tokenizer_emit_char(self, *"[")) + return NULL; } - if (Tokenizer_write_all(self, after->title)) { - Py_DECREF(after->title); - free(after); - return NULL; + else if (this == *"]") { + if (!brackets) + self->head--; + return Tokenizer_pop(self); } - Py_DECREF(after->title); + else if (this == *"&") { + PUSH_TAIL_BUFFER(*extra, NULL) + if (Tokenizer_parse_entity(self)) + return NULL; + } + else if (this == *" ") { + if (brackets) { + if (Tokenizer_emit(self, ExternalLinkSeparator)) + return NULL; + self->topstack->context ^= LC_EXT_LINK_URI; + self->topstack->context |= LC_EXT_LINK_TITLE; + self->head++; + return Tokenizer_parse(self, 0, 0); + } + if (Textbuffer_write(extra, *" ")) + return NULL; + return Tokenizer_pop(self); + } + else if (!brackets) { + if (Tokenizer_handle_free_link_text(self, &parens, extra, this)) + return NULL; + } + else { + if (Tokenizer_emit_char(self, this)) + return NULL; + } + self->head++; + } +} + +/* + Remove the URI scheme of a new external link from the textbuffer. +*/ +static int +Tokenizer_remove_uri_scheme_from_textbuffer(Tokenizer* self, PyObject* link) +{ + PyObject *text = PyObject_GetAttrString(PyList_GET_ITEM(link, 0), "text"), + *split, *scheme; + Py_ssize_t length; + Textbuffer* temp; + + if (!text) + return -1; + split = PyObject_CallMethod(text, "split", "si", ":", 1); + Py_DECREF(text); + if (!split) + return -1; + scheme = PyList_GET_ITEM(split, 0); + length = PyUnicode_GET_SIZE(scheme); + while (length) { + temp = self->topstack->textbuffer; + if (length <= temp->size) { + temp->size -= length; + break; + } + length -= temp->size; + self->topstack->textbuffer = temp->next; + free(temp->data); + free(temp); + } + Py_DECREF(split); + return 0; +} + +/* + Parse an external link at the head of the wikicode string. +*/ +static int Tokenizer_parse_external_link(Tokenizer* self, int brackets) +{ + #define INVALID_CONTEXT self->topstack->context & AGG_INVALID_LINK + #define NOT_A_LINK \ + if (!brackets && self->topstack->context & LC_DLTERM) \ + return Tokenizer_handle_dl_term(self); \ + return Tokenizer_emit_char(self, Tokenizer_READ(self, 0)) + + Py_ssize_t reset = self->head; + PyObject *link, *kwargs; + Textbuffer *extra = 0; + + if (INVALID_CONTEXT || !(Tokenizer_CAN_RECURSE(self))) { + NOT_A_LINK; + } + extra = Textbuffer_new(); + if (!extra) + return -1; + self->head++; + link = Tokenizer_really_parse_external_link(self, brackets, &extra); + if (BAD_ROUTE) { + RESET_ROUTE(); + self->head = reset; + Textbuffer_dealloc(extra); + NOT_A_LINK; + } + if (!link) { + Textbuffer_dealloc(extra); + return -1; + } + if (!brackets) { + if (Tokenizer_remove_uri_scheme_from_textbuffer(self, link)) { + Textbuffer_dealloc(extra); + Py_DECREF(link); + return -1; + } + } + kwargs = PyDict_New(); + if (!kwargs) { + Textbuffer_dealloc(extra); + Py_DECREF(link); + return -1; + } + PyDict_SetItemString(kwargs, "brackets", brackets ? Py_True : Py_False); + if (Tokenizer_emit_kwargs(self, ExternalLinkOpen, kwargs)) { + Textbuffer_dealloc(extra); + Py_DECREF(link); + return -1; + } + if (Tokenizer_emit_all(self, link)) { + Textbuffer_dealloc(extra); + Py_DECREF(link); + return -1; + } + Py_DECREF(link); + if (Tokenizer_emit(self, ExternalLinkClose)) { + Textbuffer_dealloc(extra); + return -1; + } + if (extra->size || extra->next) + return Tokenizer_emit_textbuffer(self, extra, 0); + Textbuffer_dealloc(extra); + return 0; +} + +/* + Parse a section heading at the head of the wikicode string. +*/ +static int Tokenizer_parse_heading(Tokenizer* self) +{ + Py_ssize_t reset = self->head; + int best = 1, i, context, diff; + HeadingData *heading; + PyObject *level, *kwargs; + + self->global |= GL_HEADING; + self->head += 1; + while (Tokenizer_READ(self, 0) == *"=") { + best++; + self->head++; + } + context = LC_HEADING_LEVEL_1 << (best > 5 ? 5 : best - 1); + heading = (HeadingData*) Tokenizer_parse(self, context, 1); + if (BAD_ROUTE) { + RESET_ROUTE(); + self->head = reset + best - 1; + for (i = 0; i < best; i++) { + if (Tokenizer_emit_char(self, *"=")) + return -1; + } + self->global ^= GL_HEADING; + return 0; + } + level = NEW_INT_FUNC(heading->level); + if (!level) { + Py_DECREF(heading->title); + free(heading); + return -1; + } + kwargs = PyDict_New(); + if (!kwargs) { + Py_DECREF(level); + Py_DECREF(heading->title); + free(heading); + return -1; + } + PyDict_SetItemString(kwargs, "level", level); + Py_DECREF(level); + if (Tokenizer_emit_kwargs(self, HeadingStart, kwargs)) { + Py_DECREF(heading->title); + free(heading); + return -1; + } + if (heading->level < best) { + diff = best - heading->level; + for (i = 0; i < diff; i++) { + if (Tokenizer_emit_char(self, *"=")) { + Py_DECREF(heading->title); + free(heading); + return -1; + } + } + } + if (Tokenizer_emit_all(self, heading->title)) { + Py_DECREF(heading->title); + free(heading); + return -1; + } + Py_DECREF(heading->title); + free(heading); + if (Tokenizer_emit(self, HeadingEnd)) + return -1; + self->global ^= GL_HEADING; + return 0; +} + +/* + Handle the end of a section heading at the head of the string. +*/ +static HeadingData* Tokenizer_handle_heading_end(Tokenizer* self) +{ + Py_ssize_t reset = self->head, best; + int i, current, level, diff; + HeadingData *after, *heading; + PyObject *stack; + + self->head += 1; + best = 1; + while (Tokenizer_READ(self, 0) == *"=") { + best++; + self->head++; + } + current = heading_level_from_context(self->topstack->context); + level = current > best ? (best > 6 ? 6 : best) : + (current > 6 ? 6 : current); + after = (HeadingData*) Tokenizer_parse(self, self->topstack->context, 1); + if (BAD_ROUTE) { + RESET_ROUTE(); + if (level < best) { + diff = best - level; + for (i = 0; i < diff; i++) { + if (Tokenizer_emit_char(self, *"=")) + return NULL; + } + } + self->head = reset + best - 1; + } + else { + for (i = 0; i < best; i++) { + if (Tokenizer_emit_char(self, *"=")) { + Py_DECREF(after->title); + free(after); + return NULL; + } + } + if (Tokenizer_emit_all(self, after->title)) { + Py_DECREF(after->title); + free(after); + return NULL; + } + Py_DECREF(after->title); level = after->level; free(after); } @@ -916,10 +1341,9 @@ Tokenizer_handle_heading_end(Tokenizer* self) /* Actually parse an HTML entity and ensure that it is valid. */ -static int -Tokenizer_really_parse_entity(Tokenizer* self) +static int Tokenizer_really_parse_entity(Tokenizer* self) { - PyObject *token, *kwargs, *textobj; + PyObject *kwargs, *textobj; Py_UNICODE this; int numeric, hexadecimal, i, j, zeroes, test; char *valid, *text, *buffer, *def; @@ -930,14 +1354,8 @@ Tokenizer_really_parse_entity(Tokenizer* self) return 0; \ } - token = PyObject_CallObject(HTMLEntityStart, NULL); - if (!token) - return -1; - if (Tokenizer_write(self, token)) { - Py_DECREF(token); + if (Tokenizer_emit(self, HTMLEntityStart)) return -1; - } - Py_DECREF(token); self->head++; this = Tokenizer_READ(self, 0); if (this == *"") { @@ -946,14 +1364,8 @@ Tokenizer_really_parse_entity(Tokenizer* self) } if (this == *"#") { numeric = 1; - token = PyObject_CallObject(HTMLEntityNumeric, NULL); - if (!token) + if (Tokenizer_emit(self, HTMLEntityNumeric)) return -1; - if (Tokenizer_write(self, token)) { - Py_DECREF(token); - return -1; - } - Py_DECREF(token); self->head++; this = Tokenizer_READ(self, 0); if (this == *"") { @@ -966,15 +1378,8 @@ Tokenizer_really_parse_entity(Tokenizer* self) if (!kwargs) return -1; PyDict_SetItemString(kwargs, "char", Tokenizer_read(self, 0)); - token = PyObject_Call(HTMLEntityHex, NOARGS, kwargs); - Py_DECREF(kwargs); - if (!token) - return -1; - if (Tokenizer_write(self, token)) { - Py_DECREF(token); + if (Tokenizer_emit_kwargs(self, HTMLEntityHex, kwargs)) return -1; - } - Py_DECREF(token); self->head++; } else @@ -1007,7 +1412,7 @@ Tokenizer_really_parse_entity(Tokenizer* self) self->head++; continue; } - if (i >= 8) + if (i >= MAX_ENTITY_SIZE) FAIL_ROUTE_AND_EXIT() for (j = 0; j < NUM_MARKERS; j++) { if (this == *MARKERS[j]) @@ -1021,178 +1426,1020 @@ Tokenizer_really_parse_entity(Tokenizer* self) break; j++; } - text[i] = this; + text[i] = (char) this; + self->head++; + i++; + } + if (numeric) { + sscanf(text, (hexadecimal ? "%x" : "%d"), &test); + if (test < 1 || test > 0x10FFFF) + FAIL_ROUTE_AND_EXIT() + } + else { + i = 0; + while (1) { + def = entitydefs[i]; + if (!def) // We've reached the end of the defs without finding it + FAIL_ROUTE_AND_EXIT() + if (strcmp(text, def) == 0) + break; + i++; + } + } + if (zeroes) { + buffer = calloc(strlen(text) + zeroes + 1, sizeof(char)); + if (!buffer) { + free(text); + PyErr_NoMemory(); + return -1; + } + for (i = 0; i < zeroes; i++) + strcat(buffer, "0"); + strcat(buffer, text); + free(text); + text = buffer; + } + textobj = PyUnicode_FromString(text); + if (!textobj) { + free(text); + return -1; + } + free(text); + kwargs = PyDict_New(); + if (!kwargs) { + Py_DECREF(textobj); + return -1; + } + PyDict_SetItemString(kwargs, "text", textobj); + Py_DECREF(textobj); + if (Tokenizer_emit_kwargs(self, Text, kwargs)) + return -1; + if (Tokenizer_emit(self, HTMLEntityEnd)) + return -1; + return 0; +} + +/* + Parse an HTML entity at the head of the wikicode string. +*/ +static int Tokenizer_parse_entity(Tokenizer* self) +{ + Py_ssize_t reset = self->head; + PyObject *tokenlist; + + if (Tokenizer_push(self, 0)) + return -1; + if (Tokenizer_really_parse_entity(self)) + return -1; + if (BAD_ROUTE) { + RESET_ROUTE(); + self->head = reset; + if (Tokenizer_emit_char(self, *"&")) + return -1; + return 0; + } + tokenlist = Tokenizer_pop(self); + if (!tokenlist) + return -1; + if (Tokenizer_emit_all(self, tokenlist)) { + Py_DECREF(tokenlist); + return -1; + } + Py_DECREF(tokenlist); + return 0; +} + +/* + Parse an HTML comment at the head of the wikicode string. +*/ +static int Tokenizer_parse_comment(Tokenizer* self) +{ + Py_ssize_t reset = self->head + 3; + PyObject *comment; + Py_UNICODE this; + + self->head += 4; + if (Tokenizer_push(self, 0)) + return -1; + while (1) { + this = Tokenizer_READ(self, 0); + if (this == *"") { + comment = Tokenizer_pop(self); + Py_XDECREF(comment); + self->head = reset; + return Tokenizer_emit_text(self, "") + self.assertTrue(code1.matches("Cleanup")) + self.assertTrue(code1.matches("cleanup")) + self.assertTrue(code1.matches(" cleanup\n")) + self.assertFalse(code1.matches("CLEANup")) + self.assertFalse(code1.matches("Blah")) + self.assertTrue(code2.matches("stub")) + self.assertTrue(code2.matches("Stub")) + self.assertFalse(code2.matches("StuB")) def test_filter_family(self): """test the Wikicode.i?filter() family of functions""" @@ -219,11 +260,11 @@ class TestWikicode(TreeEqualityTestCase): code = parse("a{{b}}c[[d]]{{{e}}}{{f}}[[g]]") for func in (code.filter, ifilter(code)): - self.assertEqual(["a", "{{b}}", "c", "[[d]]", "{{{e}}}", "{{f}}", - "[[g]]"], func()) + self.assertEqual(["a", "{{b}}", "b", "c", "[[d]]", "d", "{{{e}}}", + "e", "{{f}}", "f", "[[g]]", "g"], func()) self.assertEqual(["{{{e}}}"], func(forcetype=Argument)) self.assertIs(code.get(4), func(forcetype=Argument)[0]) - self.assertEqual(["a", "c"], func(forcetype=Text)) + self.assertEqual(list("abcdefg"), func(forcetype=Text)) self.assertEqual([], func(forcetype=Heading)) self.assertRaises(TypeError, func, forcetype=True) @@ -235,11 +276,12 @@ class TestWikicode(TreeEqualityTestCase): self.assertEqual(["{{{e}}}"], get_filter("arguments")) self.assertIs(code.get(4), get_filter("arguments")[0]) self.assertEqual([], get_filter("comments")) + self.assertEqual([], get_filter("external_links")) self.assertEqual([], get_filter("headings")) self.assertEqual([], get_filter("html_entities")) self.assertEqual([], get_filter("tags")) self.assertEqual(["{{b}}", "{{f}}"], get_filter("templates")) - self.assertEqual(["a", "c"], get_filter("text")) + self.assertEqual(list("abcdefg"), get_filter("text")) self.assertEqual(["[[d]]", "[[g]]"], get_filter("wikilinks")) code2 = parse("{{a|{{b}}|{{c|d={{f}}{{h}}}}}}") @@ -252,13 +294,13 @@ class TestWikicode(TreeEqualityTestCase): code3 = parse("{{foobar}}{{FOO}}{{baz}}{{bz}}") for func in (code3.filter, ifilter(code3)): - self.assertEqual(["{{foobar}}", "{{FOO}}"], func(matches=r"foo")) + self.assertEqual(["{{foobar}}", "{{FOO}}"], func(recursive=False, matches=r"foo")) self.assertEqual(["{{foobar}}", "{{FOO}}"], - func(matches=r"^{{foo.*?}}")) + func(recursive=False, matches=r"^{{foo.*?}}")) self.assertEqual(["{{foobar}}"], - func(matches=r"^{{foo.*?}}", flags=re.UNICODE)) - self.assertEqual(["{{baz}}", "{{bz}}"], func(matches=r"^{{b.*?z")) - self.assertEqual(["{{baz}}"], func(matches=r"^{{b.+?z}}")) + func(recursive=False, matches=r"^{{foo.*?}}", flags=re.UNICODE)) + self.assertEqual(["{{baz}}", "{{bz}}"], func(recursive=False, matches=r"^{{b.*?z")) + self.assertEqual(["{{baz}}"], func(recursive=False, matches=r"^{{b.+?z}}")) self.assertEqual(["{{a|{{b}}|{{c|d={{f}}{{h}}}}}}"], code2.filter_templates(recursive=False)) diff --git a/tests/tokenizer/external_links.mwtest b/tests/tokenizer/external_links.mwtest new file mode 100644 index 0000000..af7a570 --- /dev/null +++ b/tests/tokenizer/external_links.mwtest @@ -0,0 +1,473 @@ +name: basic +label: basic external link +input: "http://example.com/" +output: [ExternalLinkOpen(brackets=False), Text(text="http://example.com/"), ExternalLinkClose()] + +--- + +name: basic_brackets +label: basic external link in brackets +input: "[http://example.com/]" +output: [ExternalLinkOpen(brackets=True), Text(text="http://example.com/"), ExternalLinkClose()] + +--- + +name: brackets_space +label: basic external link in brackets, with a space after +input: "[http://example.com/ ]" +output: [ExternalLinkOpen(brackets=True), Text(text="http://example.com/"), ExternalLinkSeparator(), ExternalLinkClose()] + +--- + +name: brackets_title +label: basic external link in brackets, with a title +input: "[http://example.com/ Example]" +output: [ExternalLinkOpen(brackets=True), Text(text="http://example.com/"), ExternalLinkSeparator(), Text(text="Example"), ExternalLinkClose()] + +--- + +name: brackets_multiword_title +label: basic external link in brackets, with a multi-word title +input: "[http://example.com/ Example Web Page]" +output: [ExternalLinkOpen(brackets=True), Text(text="http://example.com/"), ExternalLinkSeparator(), Text(text="Example Web Page"), ExternalLinkClose()] + +--- + +name: brackets_adjacent +label: three adjacent bracket-enclosed external links +input: "[http://foo.com/ Foo][http://bar.com/ Bar]\n[http://baz.com/ Baz]" +output: [ExternalLinkOpen(brackets=True), Text(text="http://foo.com/"), ExternalLinkSeparator(), Text(text="Foo"), ExternalLinkClose(), ExternalLinkOpen(brackets=True), Text(text="http://bar.com/"), ExternalLinkSeparator(), Text(text="Bar"), ExternalLinkClose(), Text(text="\n"), ExternalLinkOpen(brackets=True), Text(text="http://baz.com/"), ExternalLinkSeparator(), Text(text="Baz"), ExternalLinkClose()] + +--- + +name: brackets_newline_before +label: bracket-enclosed link with a newline before the title +input: "[http://example.com/ \nExample]" +output: [Text(text="["), ExternalLinkOpen(brackets=False), Text(text="http://example.com/"), ExternalLinkClose(), Text(text=" \nExample]")] + +--- + +name: brackets_newline_inside +label: bracket-enclosed link with a newline in the title +input: "[http://example.com/ Example \nWeb Page]" +output: [Text(text="["), ExternalLinkOpen(brackets=False), Text(text="http://example.com/"), ExternalLinkClose(), Text(text=" Example \nWeb Page]")] + +--- + +name: brackets_newline_after +label: bracket-enclosed link with a newline after the title +input: "[http://example.com/ Example\n]" +output: [Text(text="["), ExternalLinkOpen(brackets=False), Text(text="http://example.com/"), ExternalLinkClose(), Text(text=" Example\n]")] + +--- + +name: brackets_space_before +label: bracket-enclosed link with a space before the URL +input: "[ http://example.com Example]" +output: [Text(text="[ "), ExternalLinkOpen(brackets=False), Text(text="http://example.com"), ExternalLinkClose(), Text(text=" Example]")] + +--- + +name: brackets_title_like_url +label: bracket-enclosed link with a title that looks like a URL +input: "[http://example.com http://example.com]" +output: [ExternalLinkOpen(brackets=True), Text(text="http://example.com"), ExternalLinkSeparator(), Text(text="http://example.com"), ExternalLinkClose()] + +--- + +name: brackets_recursive +label: bracket-enclosed link with a bracket-enclosed link as the title +input: "[http://example.com [http://example.com]]" +output: [ExternalLinkOpen(brackets=True), Text(text="http://example.com"), ExternalLinkSeparator(), Text(text="[http://example.com"), ExternalLinkClose(), Text(text="]")] + +--- + +name: period_after +label: a period after a free link that is excluded +input: "http://example.com." +output: [ExternalLinkOpen(brackets=False), Text(text="http://example.com"), ExternalLinkClose(), Text(text=".")] + +--- + +name: colons_after +label: colons after a free link that are excluded +input: "http://example.com/foo:bar.:;baz!?," +output: [ExternalLinkOpen(brackets=False), Text(text="http://example.com/foo:bar.:;baz"), ExternalLinkClose(), Text(text="!?,")] + +--- + +name: close_paren_after_excluded +label: a closing parenthesis after a free link that is excluded +input: "http://example.)com)" +output: [ExternalLinkOpen(brackets=False), Text(text="http://example.)com"), ExternalLinkClose(), Text(text=")")] + +--- + +name: close_paren_after_included +label: a closing parenthesis after a free link that is included because of an opening parenthesis in the URL +input: "http://example.(com)" +output: [ExternalLinkOpen(brackets=False), Text(text="http://example.(com)"), ExternalLinkClose()] + +--- + +name: open_bracket_inside +label: an open bracket inside a free link that causes it to be ended abruptly +input: "http://foobar[baz.com" +output: [ExternalLinkOpen(brackets=False), Text(text="http://foobar"), ExternalLinkClose(), Text(text="[baz.com")] + +--- + +name: brackets_period_after +label: a period after a bracket-enclosed link that is included +input: "[http://example.com. Example]" +output: [ExternalLinkOpen(brackets=True), Text(text="http://example.com."), ExternalLinkSeparator(), Text(text="Example"), ExternalLinkClose()] + +--- + +name: brackets_colons_after +label: colons after a bracket-enclosed link that are included +input: "[http://example.com/foo:bar.:;baz!?, Example]" +output: [ExternalLinkOpen(brackets=True), Text(text="http://example.com/foo:bar.:;baz!?,"), ExternalLinkSeparator(), Text(text="Example"), ExternalLinkClose()] + +--- + +name: brackets_close_paren_after_included +label: a closing parenthesis after a bracket-enclosed link that is included +input: "[http://example.)com) Example]" +output: [ExternalLinkOpen(brackets=True), Text(text="http://example.)com)"), ExternalLinkSeparator(), Text(text="Example"), ExternalLinkClose()] + +--- + +name: brackets_close_paren_after_included_2 +label: a closing parenthesis after a bracket-enclosed link that is also included +input: "[http://example.(com) Example]" +output: [ExternalLinkOpen(brackets=True), Text(text="http://example.(com)"), ExternalLinkSeparator(), Text(text="Example"), ExternalLinkClose()] + +--- + +name: brackets_open_bracket_inside +label: an open bracket inside a bracket-enclosed link that is also included +input: "[http://foobar[baz.com Example]" +output: [ExternalLinkOpen(brackets=True), Text(text="http://foobar[baz.com"), ExternalLinkSeparator(), Text(text="Example"), ExternalLinkClose()] + +--- + +name: adjacent_space +label: two free links separated by a space +input: "http://example.com http://example.com" +output: [ExternalLinkOpen(brackets=False), Text(text="http://example.com"), ExternalLinkClose(), Text(text=" "), ExternalLinkOpen(brackets=False), Text(text="http://example.com"), ExternalLinkClose()] + +--- + +name: adjacent_newline +label: two free links separated by a newline +input: "http://example.com\nhttp://example.com" +output: [ExternalLinkOpen(brackets=False), Text(text="http://example.com"), ExternalLinkClose(), Text(text="\n"), ExternalLinkOpen(brackets=False), Text(text="http://example.com"), ExternalLinkClose()] + +--- + +name: adjacent_close_bracket +label: two free links separated by a close bracket +input: "http://example.com]http://example.com" +output: [ExternalLinkOpen(brackets=False), Text(text="http://example.com"), ExternalLinkClose(), Text(text="]"), ExternalLinkOpen(brackets=False), Text(text="http://example.com"), ExternalLinkClose()] + +--- + +name: html_entity_in_url +label: a HTML entity parsed correctly inside a free link +input: "http://exa mple.com/" +output: [ExternalLinkOpen(brackets=False), Text(text="http://exa"), HTMLEntityStart(), Text(text="nbsp"), HTMLEntityEnd(), Text(text="mple.com/"), ExternalLinkClose()] + +--- + +name: template_in_url +label: a template parsed correctly inside a free link +input: "http://exa{{template}}mple.com/" +output: [ExternalLinkOpen(brackets=False), Text(text="http://exa"), TemplateOpen(), Text(text="template"), TemplateClose(), Text(text="mple.com/"), ExternalLinkClose()] + +--- + +name: argument_in_url +label: an argument parsed correctly inside a free link +input: "http://exa{{{argument}}}mple.com/" +output: [ExternalLinkOpen(brackets=False), Text(text="http://exa"), ArgumentOpen(), Text(text="argument"), ArgumentClose(), Text(text="mple.com/"), ExternalLinkClose()] + +--- + +name: wikilink_in_url +label: a wikilink that destroys a free link +input: "http://exa[[wikilink]]mple.com/" +output: [ExternalLinkOpen(brackets=False), Text(text="http://exa"), ExternalLinkClose(), WikilinkOpen(), Text(text="wikilink"), WikilinkClose(), Text(text="mple.com/")] + +--- + +name: external_link_in_url +label: a bracketed link that destroys a free link +input: "http://exa[http://example.com/]mple.com/" +output: [ExternalLinkOpen(brackets=False), Text(text="http://exa"), ExternalLinkClose(), ExternalLinkOpen(brackets=True), Text(text="http://example.com/"), ExternalLinkClose(), Text(text="mple.com/")] + +--- + +name: spaces_padding +label: spaces padding a free link +input: " http://example.com " +output: [Text(text=" "), ExternalLinkOpen(brackets=False), Text(text="http://example.com"), ExternalLinkClose(), Text(text=" ")] + +--- + +name: text_and_spaces_padding +label: text and spaces padding a free link +input: "x http://example.com x" +output: [Text(text="x "), ExternalLinkOpen(brackets=False), Text(text="http://example.com"), ExternalLinkClose(), Text(text=" x")] + +--- + +name: template_before +label: a template before a free link +input: "{{foo}}http://example.com" +output: [TemplateOpen(), Text(text="foo"), TemplateClose(), ExternalLinkOpen(brackets=False), Text(text="http://example.com"), ExternalLinkClose()] + +--- + +name: spaces_padding_no_slashes +label: spaces padding a free link with no slashes after the colon +input: " mailto:example@example.com " +output: [Text(text=" "), ExternalLinkOpen(brackets=False), Text(text="mailto:example@example.com"), ExternalLinkClose(), Text(text=" ")] + +--- + +name: text_and_spaces_padding_no_slashes +label: text and spaces padding a free link with no slashes after the colon +input: "x mailto:example@example.com x" +output: [Text(text="x "), ExternalLinkOpen(brackets=False), Text(text="mailto:example@example.com"), ExternalLinkClose(), Text(text=" x")] + +--- + +name: template_before_no_slashes +label: a template before a free link with no slashes after the colon +input: "{{foo}}mailto:example@example.com" +output: [TemplateOpen(), Text(text="foo"), TemplateClose(), ExternalLinkOpen(brackets=False), Text(text="mailto:example@example.com"), ExternalLinkClose()] + +--- + +name: no_slashes +label: a free link with no slashes after the colon +input: "mailto:example@example.com" +output: [ExternalLinkOpen(brackets=False), Text(text="mailto:example@example.com"), ExternalLinkClose()] + +--- + +name: slashes_optional +label: a free link using a scheme that doesn't need slashes, but has them anyway +input: "mailto://example@example.com" +output: [ExternalLinkOpen(brackets=False), Text(text="mailto://example@example.com"), ExternalLinkClose()] + +--- + +name: short +label: a very short free link +input: "mailto://abc" +output: [ExternalLinkOpen(brackets=False), Text(text="mailto://abc"), ExternalLinkClose()] + +--- + +name: slashes_missing +label: slashes missing from a free link with a scheme that requires them +input: "http:example@example.com" +output: [Text(text="http:example@example.com")] + +--- + +name: no_scheme_but_slashes +label: no scheme in a free link, but slashes (protocol-relative free links are not supported) +input: "//example.com" +output: [Text(text="//example.com")] + +--- + +name: no_scheme_but_colon +label: no scheme in a free link, but a colon +input: " :example.com" +output: [Text(text=" :example.com")] + +--- + +name: no_scheme_but_colon_and_slashes +label: no scheme in a free link, but a colon and slashes +input: " ://example.com" +output: [Text(text=" ://example.com")] + +--- + +name: fake_scheme_no_slashes +label: a nonexistent scheme in a free link, without slashes +input: "fake:example.com" +output: [Text(text="fake:example.com")] + +--- + +name: fake_scheme_slashes +label: a nonexistent scheme in a free link, with slashes +input: "fake://example.com" +output: [Text(text="fake://example.com")] + +--- + +name: fake_scheme_brackets_no_slashes +label: a nonexistent scheme in a bracketed link, without slashes +input: "[fake:example.com]" +output: [Text(text="[fake:example.com]")] + +--- + +name: fake_scheme_brackets_slashes +label: #=a nonexistent scheme in a bracketed link, with slashes +input: "[fake://example.com]" +output: [Text(text="[fake://example.com]")] + +--- + +name: interrupted_scheme +label: an otherwise valid scheme with something in the middle of it, in a free link +input: "ht?tp://example.com" +output: [Text(text="ht?tp://example.com")] + +--- + +name: interrupted_scheme_brackets +label: an otherwise valid scheme with something in the middle of it, in a bracketed link +input: "[ht?tp://example.com]" +output: [Text(text="[ht?tp://example.com]")] + +--- + +name: no_slashes_brackets +label: no slashes after the colon in a bracketed link +input: "[mailto:example@example.com Example]" +output: [ExternalLinkOpen(brackets=True), Text(text="mailto:example@example.com"), ExternalLinkSeparator(), Text(text="Example"), ExternalLinkClose()] + +--- + +name: space_before_no_slashes_brackets +label: a space before a bracketed link with no slashes after the colon +input: "[ mailto:example@example.com Example]" +output: [Text(text="[ "), ExternalLinkOpen(brackets=False), Text(text="mailto:example@example.com"), ExternalLinkClose(), Text(text=" Example]")] + +--- + +name: slashes_optional_brackets +label: a bracketed link using a scheme that doesn't need slashes, but has them anyway +input: "[mailto://example@example.com Example]" +output: [ExternalLinkOpen(brackets=True), Text(text="mailto://example@example.com"), ExternalLinkSeparator(), Text(text="Example"), ExternalLinkClose()] + +--- + +name: short_brackets +label: a very short link in brackets +input: "[mailto://abc Example]" +output: [ExternalLinkOpen(brackets=True), Text(text="mailto://abc"), ExternalLinkSeparator(), Text(text="Example"), ExternalLinkClose()] + +--- + +name: slashes_missing_brackets +label: slashes missing from a scheme that requires them in a bracketed link +input: "[http:example@example.com Example]" +output: [Text(text="[http:example@example.com Example]")] + +--- + +name: protcol_relative +label: a protocol-relative link (in brackets) +input: "[//example.com Example]" +output: [ExternalLinkOpen(brackets=True), Text(text="//example.com"), ExternalLinkSeparator(), Text(text="Example"), ExternalLinkClose()] + +--- + +name: scheme_missing_but_colon_brackets +label: scheme missing from a bracketed link, but with a colon +input: "[:example.com Example]" +output: [Text(text="[:example.com Example]")] + +--- + +name: scheme_missing_but_colon_slashes_brackets +label: scheme missing from a bracketed link, but with a colon and slashes +input: "[://example.com Example]" +output: [Text(text="[://example.com Example]")] + +--- + +name: unclosed_protocol_relative +label: an unclosed protocol-relative bracketed link +input: "[//example.com" +output: [Text(text="[//example.com")] + +--- + +name: space_before_protcol_relative +label: a space before a protocol-relative bracketed link +input: "[ //example.com]" +output: [Text(text="[ //example.com]")] + +--- + +name: unclosed_just_scheme +label: an unclosed bracketed link, ending after the scheme +input: "[http" +output: [Text(text="[http")] + +--- + +name: unclosed_scheme_colon +label: an unclosed bracketed link, ending after the colon +input: "[http:" +output: [Text(text="[http:")] + +--- + +name: unclosed_scheme_colon_slashes +label: an unclosed bracketed link, ending after the slashes +input: "[http://" +output: [Text(text="[http://")] + +--- + +name: incomplete_bracket +label: just an open bracket +input: "[" +output: [Text(text="[")] + +--- + +name: incomplete_scheme_colon +label: a free link with just a scheme and a colon +input: "http:" +output: [Text(text="http:")] + +--- + +name: incomplete_scheme_colon_slashes +label: a free link with just a scheme, colon, and slashes +input: "http://" +output: [Text(text="http://")] + +--- + +name: brackets_scheme_but_no_url +label: brackets around a scheme and a colon +input: "[mailto:]" +output: [Text(text="[mailto:]")] + +--- + +name: brackets_scheme_slashes_but_no_url +label: brackets around a scheme, colon, and slashes +input: "[http://]" +output: [Text(text="[http://]")] + +--- + +name: brackets_scheme_title_but_no_url +label: brackets around a scheme, colon, and slashes, with a title +input: "[http:// Example]" +output: [Text(text="[http:// Example]")] diff --git a/tests/tokenizer/html_entities.mwtest b/tests/tokenizer/html_entities.mwtest index 625dd60..53bedbd 100644 --- a/tests/tokenizer/html_entities.mwtest +++ b/tests/tokenizer/html_entities.mwtest @@ -117,6 +117,20 @@ output: [Text(text="&;")] --- +name: invalid_partial_amp_pound +label: invalid entities: just an ampersand, pound sign +input: "&#" +output: [Text(text="&#")] + +--- + +name: invalid_partial_amp_pound_x +label: invalid entities: just an ampersand, pound sign, x +input: "&#x" +output: [Text(text="&#x")] + +--- + name: invalid_partial_amp_pound_semicolon label: invalid entities: an ampersand, pound sign, and semicolon input: "&#;" diff --git a/tests/tokenizer/integration.mwtest b/tests/tokenizer/integration.mwtest index d3cb419..083b12c 100644 --- a/tests/tokenizer/integration.mwtest +++ b/tests/tokenizer/integration.mwtest @@ -12,6 +12,13 @@ output: [TemplateOpen(), ArgumentOpen(), ArgumentOpen(), Text(text="foo"), Argum --- +name: link_in_template_name +label: a wikilink inside a template name, which breaks the template +input: "{{foo[[bar]]}}" +output: [Text(text="{{foo"), WikilinkOpen(), Text(text="bar"), WikilinkClose(), Text(text="}}")] + +--- + name: rich_heading label: a heading with templates/wikilinks in it input: "== Head{{ing}} [[with]] {{{funky|{{stuf}}}}} ==" @@ -33,6 +40,13 @@ output: [Text(text="&n"), CommentStart(), Text(text="foo"), CommentEnd(), Text(t --- +name: rich_tags +label: a HTML tag with tons of other things in it +input: "{{dubious claim}}[[Source]]" +output: [TemplateOpen(), Text(text="dubious claim"), TemplateClose(), TagOpenOpen(), Text(text="ref"), TagAttrStart(pad_first=" ", pad_before_eq="", pad_after_eq=""), Text(text="name"), TagAttrEquals(), TemplateOpen(), Text(text="abc"), TemplateClose(), TagAttrStart(pad_first=" ", pad_before_eq="", pad_after_eq=""), Text(text="foo"), TagAttrEquals(), TagAttrQuote(), Text(text="bar "), TemplateOpen(), Text(text="baz"), TemplateClose(), TagAttrStart(pad_first=" ", pad_before_eq="", pad_after_eq=""), Text(text="abc"), TagAttrEquals(), TemplateOpen(), Text(text="de"), TemplateClose(), Text(text="f"), TagAttrStart(pad_first=" ", pad_before_eq="", pad_after_eq=""), Text(text="ghi"), TagAttrEquals(), Text(text="j"), TemplateOpen(), Text(text="k"), TemplateClose(), TemplateOpen(), Text(text="l"), TemplateClose(), TagAttrStart(pad_first=" \n ", pad_before_eq=" ", pad_after_eq=" "), Text(text="mno"), TagAttrEquals(), TagAttrQuote(), TemplateOpen(), Text(text="p"), TemplateClose(), Text(text=" "), WikilinkOpen(), Text(text="q"), WikilinkClose(), Text(text=" "), TemplateOpen(), Text(text="r"), TemplateClose(), TagCloseOpen(padding=""), WikilinkOpen(), Text(text="Source"), WikilinkClose(), TagOpenClose(), Text(text="ref"), TagCloseClose()] + +--- + name: wildcard label: a wildcard assortment of various things input: "{{{{{{{{foo}}bar|baz=biz}}buzz}}usr|{{bin}}}}" @@ -44,3 +58,17 @@ name: wildcard_redux label: an even wilder assortment of various things input: "{{a|b|{{c|[[d]]{{{e}}}}}}}[[f|{{{g}}}]]{{i|j= }}" output: [TemplateOpen(), Text(text="a"), TemplateParamSeparator(), Text(text="b"), TemplateParamSeparator(), TemplateOpen(), Text(text="c"), TemplateParamSeparator(), WikilinkOpen(), Text(text="d"), WikilinkClose(), ArgumentOpen(), Text(text="e"), ArgumentClose(), TemplateClose(), TemplateClose(), WikilinkOpen(), Text(text="f"), WikilinkSeparator(), ArgumentOpen(), Text(text="g"), ArgumentClose(), CommentStart(), Text(text="h"), CommentEnd(), WikilinkClose(), TemplateOpen(), Text(text="i"), TemplateParamSeparator(), Text(text="j"), TemplateParamEquals(), HTMLEntityStart(), Text(text="nbsp"), HTMLEntityEnd(), TemplateClose()] + +--- + +name: link_inside_dl +label: an external link inside a def list, such that the external link is parsed +input: ";;;mailto:example" +output: [TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), ExternalLinkOpen(brackets=False), Text(text="mailto:example"), ExternalLinkClose()] + +--- + +name: link_inside_dl_2 +label: an external link inside a def list, such that the external link is not parsed +input: ";;;malito:example" +output: [TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), Text(text="malito"), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), Text(text="example")] diff --git a/tests/tokenizer/tags.mwtest b/tests/tokenizer/tags.mwtest new file mode 100644 index 0000000..a0d7f18 --- /dev/null +++ b/tests/tokenizer/tags.mwtest @@ -0,0 +1,578 @@ +name: basic +label: a basic tag with an open and close +input: "" +output: [TagOpenOpen(), Text(text="ref"), TagCloseOpen(padding=""), TagOpenClose(), Text(text="ref"), TagCloseClose()] + +--- + +name: basic_selfclosing +label: a basic self-closing tag +input: "" +output: [TagOpenOpen(), Text(text="ref"), TagCloseSelfclose(padding="")] + +--- + +name: content +label: a tag with some content in the middle +input: "this is a reference" +output: [TagOpenOpen(), Text(text="ref"), TagCloseOpen(padding=""), Text(text="this is a reference"), TagOpenClose(), Text(text="ref"), TagCloseClose()] + +--- + +name: padded_open +label: a tag with some padding in the open tag +input: "" +output: [TagOpenOpen(), Text(text="ref"), TagCloseOpen(padding=" "), TagOpenClose(), Text(text="ref"), TagCloseClose()] + +--- + +name: padded_close +label: a tag with some padding in the close tag +input: "" +output: [TagOpenOpen(), Text(text="ref"), TagCloseOpen(padding=""), TagOpenClose(), Text(text="ref "), TagCloseClose()] + +--- + +name: padded_selfclosing +label: a self-closing tag with padding +input: "" +output: [TagOpenOpen(), Text(text="ref"), TagCloseSelfclose(padding=" ")] + +--- + +name: attribute +label: a tag with a single attribute +input: "" +output: [TagOpenOpen(), Text(text="ref"), TagAttrStart(pad_first=" ", pad_before_eq="", pad_after_eq=""), Text(text="name"), TagCloseOpen(padding=""), TagOpenClose(), Text(text="ref"), TagCloseClose()] + +--- + +name: attribute_value +label: a tag with a single attribute with a value +input: "" +output: [TagOpenOpen(), Text(text="ref"), TagAttrStart(pad_first=" ", pad_before_eq="", pad_after_eq=""), Text(text="name"), TagAttrEquals(), Text(text="foo"), TagCloseOpen(padding=""), TagOpenClose(), Text(text="ref"), TagCloseClose()] + +--- + +name: attribute_quoted +label: a tag with a single quoted attribute +input: "" +output: [TagOpenOpen(), Text(text="ref"), TagAttrStart(pad_first=" ", pad_before_eq="", pad_after_eq=""), Text(text="name"), TagAttrEquals(), TagAttrQuote(), Text(text="foo bar"), TagCloseOpen(padding=""), TagOpenClose(), Text(text="ref"), TagCloseClose()] + +--- + +name: attribute_hyphen +label: a tag with a single attribute, containing a hyphen +input: "" +output: [TagOpenOpen(), Text(text="ref"), TagAttrStart(pad_first=" ", pad_before_eq="", pad_after_eq=""), Text(text="name"), TagAttrEquals(), Text(text="foo-bar"), TagCloseOpen(padding=""), TagOpenClose(), Text(text="ref"), TagCloseClose()] + +--- + +name: attribute_quoted_hyphen +label: a tag with a single quoted attribute, containing a hyphen +input: "" +output: [TagOpenOpen(), Text(text="ref"), TagAttrStart(pad_first=" ", pad_before_eq="", pad_after_eq=""), Text(text="name"), TagAttrEquals(), TagAttrQuote(), Text(text="foo-bar"), TagCloseOpen(padding=""), TagOpenClose(), Text(text="ref"), TagCloseClose()] + +--- + +name: attribute_selfclosing +label: a self-closing tag with a single attribute +input: "" +output: [TagOpenOpen(), Text(text="ref"), TagAttrStart(pad_first=" ", pad_before_eq="", pad_after_eq=""), Text(text="name"), TagCloseSelfclose(padding="")] + +--- + +name: attribute_selfclosing_value +label: a self-closing tag with a single attribute with a value +input: "" +output: [TagOpenOpen(), Text(text="ref"), TagAttrStart(pad_first=" ", pad_before_eq="", pad_after_eq=""), Text(text="name"), TagAttrEquals(), Text(text="foo"), TagCloseSelfclose(padding="")] + +--- + +name: attribute_selfclosing_value_quoted +label: a self-closing tag with a single quoted attribute +input: "" +output: [TagOpenOpen(), Text(text="ref"), TagAttrStart(pad_first=" ", pad_before_eq="", pad_after_eq=""), Text(text="name"), TagAttrEquals(), TagAttrQuote(), Text(text="foo"), TagCloseSelfclose(padding="")] + +--- + +name: nested_tag +label: a tag nested within the attributes of another +input: "foo>citation" +output: [TagOpenOpen(), Text(text="ref"), TagAttrStart(pad_first=" ", pad_before_eq="", pad_after_eq=""), Text(text="name"), TagAttrEquals(), TagOpenOpen(), Text(text="span"), TagAttrStart(pad_first=" ", pad_before_eq="", pad_after_eq=""), Text(text="style"), TagAttrEquals(), TagAttrQuote(), Text(text="color: red;"), TagCloseOpen(padding=""), Text(text="foo"), TagOpenClose(), Text(text="span"), TagCloseClose(), TagCloseOpen(padding=""), Text(text="citation"), TagOpenClose(), Text(text="ref"), TagCloseClose()] + +--- + +name: nested_tag_quoted +label: a tag nested within the attributes of another, quoted +input: "foo">citation" +output: [TagOpenOpen(), Text(text="ref"), TagAttrStart(pad_first=" ", pad_before_eq="", pad_after_eq=""), Text(text="name"), TagAttrEquals(), TagAttrQuote(), TagOpenOpen(), Text(text="span"), TagAttrStart(pad_first=" ", pad_before_eq="", pad_after_eq=""), Text(text="style"), TagAttrEquals(), TagAttrQuote(), Text(text="color: red;"), TagCloseOpen(padding=""), Text(text="foo"), TagOpenClose(), Text(text="span"), TagCloseClose(), TagCloseOpen(padding=""), Text(text="citation"), TagOpenClose(), Text(text="ref"), TagCloseClose()] + +--- + +name: nested_troll_tag +label: a bogus tag that appears to be nested within the attributes of another +input: ">citation" +output: [Text(text=">citation")] + +--- + +name: nested_troll_tag_quoted +label: a bogus tag that appears to be nested within the attributes of another, quoted +input: "citation" +output: [TagOpenOpen(), Text(text="ref"), TagAttrStart(pad_first=" ", pad_before_eq="", pad_after_eq=""), Text(text="name"), TagAttrEquals(), TagAttrQuote(), Text(text=""), TagCloseOpen(padding=""), Text(text="citation"), TagOpenClose(), Text(text="ref"), TagCloseClose()] + +--- + +name: invalid_space_begin_open +label: invalid tag: a space at the beginning of the open tag +input: "< ref>test" +output: [Text(text="< ref>test")] + +--- + +name: invalid_space_begin_close +label: invalid tag: a space at the beginning of the close tag +input: "test" +output: [Text(text="test")] + +--- + +name: valid_space_end +label: valid tag: spaces at the ends of both the open and close tags +input: "test" +output: [TagOpenOpen(), Text(text="ref"), TagCloseOpen(padding=" "), Text(text="test"), TagOpenClose(), Text(text="ref "), TagCloseClose()] + +--- + +name: invalid_template_ends +label: invalid tag: a template at the ends of both the open and close tags +input: "test" +output: [Text(text="test" +output: [Text(text="test" +output: [TagOpenOpen(), Text(text="ref"), TagAttrStart(pad_first=" ", pad_before_eq="", pad_after_eq=""), TemplateOpen(), Text(text="foo"), TemplateClose(), TagCloseOpen(padding=""), Text(text="test"), TagOpenClose(), Text(text="ref"), TagCloseClose()] + +--- + +name: valid_template_end_open_space_end_close +label: valid tag: a template at the end of the open tag; whitespace at the end of the close tag +input: "test" +output: [TagOpenOpen(), Text(text="ref"), TagAttrStart(pad_first=" ", pad_before_eq="", pad_after_eq=""), TemplateOpen(), Text(text="foo"), TemplateClose(), TagCloseOpen(padding=""), Text(text="test"), TagOpenClose(), Text(text="ref\n"), TagCloseClose()] + +--- + +name: invalid_template_end_open_nospace +label: invalid tag: a template at the end of the open tag, without spacing +input: "test" +output: [Text(text="test" +output: [Text(text="test")] + +--- + +name: invalid_template_start_open +label: invalid tag: a template at the beginning of the open tag +input: "<{{foo}}ref>test" +output: [Text(text="<"), TemplateOpen(), Text(text="foo"), TemplateClose(), Text(text="ref>test")] + +--- + +name: unclosed_quote +label: a quoted attribute that is never closed +input: "stuff" +output: [TagOpenOpen(), Text(text="span"), TagAttrStart(pad_first=" ", pad_before_eq="", pad_after_eq=""), Text(text="style"), TagAttrEquals(), Text(text="\"foo\"bar"), TagCloseOpen(padding=""), Text(text="stuff"), TagOpenClose(), Text(text="span"), TagCloseClose()] + +--- + +name: fake_quote_complex +label: a fake quoted attribute, with spaces and templates and links +input: "stuff" +output: [TagOpenOpen(), Text(text="span"), TagAttrStart(pad_first=" ", pad_before_eq="", pad_after_eq=""), Text(text="style"), TagAttrEquals(), Text(text="\"foo"), TagAttrStart(pad_first=" ", pad_before_eq="\n", pad_after_eq=""), TemplateOpen(), Text(text="bar"), TemplateClose(), TagAttrStart(pad_first="", pad_before_eq=" ", pad_after_eq=""), WikilinkOpen(), Text(text="baz"), WikilinkClose(), Text(text="\"buzz"), TagCloseOpen(padding=""), Text(text="stuff"), TagOpenClose(), Text(text="span"), TagCloseClose()] + +--- + +name: incomplete_lbracket +label: incomplete tags: just a left bracket +input: "<" +output: [Text(text="<")] + +--- + +name: incomplete_lbracket_junk +label: incomplete tags: just a left bracket, surrounded by stuff +input: "foo" +output: [Text(text="junk ")] + +--- + +name: incomplete_open_unnamed_attr +label: incomplete tags: an open tag, unnamed attribute +input: "junk " +output: [Text(text="junk ")] + +--- + +name: incomplete_open_attr_equals +label: incomplete tags: an open tag, attribute, equal sign +input: "junk " +output: [Text(text="junk ")] + +--- + +name: incomplete_open_attr +label: incomplete tags: an open tag, attribute with a key/value +input: "junk " +output: [Text(text="junk ")] + +--- + +name: incomplete_open_attr_quoted +label: incomplete tags: an open tag, attribute with a key/value, quoted +input: "junk " +output: [Text(text="junk ")] + +--- + +name: incomplete_open_text +label: incomplete tags: an open tag, text +input: "junk foo" +output: [Text(text="junk foo")] + +--- + +name: incomplete_open_attr_text +label: incomplete tags: an open tag, attribute with a key/value, text +input: "junk bar" +output: [Text(text="junk bar")] + +--- + +name: incomplete_open_text_lbracket +label: incomplete tags: an open tag, text, left open bracket +input: "junk bar<" +output: [Text(text="junk bar<")] + +--- + +name: incomplete_open_text_lbracket_slash +label: incomplete tags: an open tag, text, left bracket, slash +input: "junk barbarbarbar" +output: [Text(text="junk bar")] + +--- + +name: incomplete_unclosed_close +label: incomplete tags: an unclosed close tag +input: "junk " +output: [Text(text="junk ")] + +--- + +name: incomplete_no_tag_name_open +label: incomplete tags: no tag name within brackets; just an open +input: "junk <>" +output: [Text(text="junk <>")] + +--- + +name: incomplete_no_tag_name_selfclosing +label: incomplete tags: no tag name within brackets; self-closing +input: "junk < />" +output: [Text(text="junk < />")] + +--- + +name: incomplete_no_tag_name_open_close +label: incomplete tags: no tag name within brackets; open and close +input: "junk <>" +output: [Text(text="junk <>")] + +--- + +name: backslash_premature_before +label: a backslash before a quote before a space +input: "blah" +output: [TagOpenOpen(), Text(text="foo"), TagAttrStart(pad_first=" ", pad_before_eq="", pad_after_eq=""), Text(text="attribute"), TagAttrEquals(), TagAttrQuote(), Text(text="this is\\\" quoted"), TagCloseOpen(padding=""), Text(text="blah"), TagOpenClose(), Text(text="foo"), TagCloseClose()] + +--- + +name: backslash_premature_after +label: a backslash before a quote after a space +input: "blah" +output: [TagOpenOpen(), Text(text="foo"), TagAttrStart(pad_first=" ", pad_before_eq="", pad_after_eq=""), Text(text="attribute"), TagAttrEquals(), TagAttrQuote(), Text(text="this is \\\"quoted"), TagCloseOpen(padding=""), Text(text="blah"), TagOpenClose(), Text(text="foo"), TagCloseClose()] + +--- + +name: backslash_premature_middle +label: a backslash before a quote in the middle of a word +input: "blah" +output: [TagOpenOpen(), Text(text="foo"), TagAttrStart(pad_first=" ", pad_before_eq="", pad_after_eq=""), Text(text="attribute"), TagAttrEquals(), TagAttrQuote(), Text(text="this i\\\"s quoted"), TagCloseOpen(padding=""), Text(text="blah"), TagOpenClose(), Text(text="foo"), TagCloseClose()] + +--- + +name: backslash_adjacent +label: escaped quotes next to unescaped quotes +input: "blah" +output: [TagOpenOpen(), Text(text="foo"), TagAttrStart(pad_first=" ", pad_before_eq="", pad_after_eq=""), Text(text="attribute"), TagAttrEquals(), TagAttrQuote(), Text(text="\\\"this is quoted\\\""), TagCloseOpen(padding=""), Text(text="blah"), TagOpenClose(), Text(text="foo"), TagCloseClose()] + +--- + +name: backslash_endquote +label: backslashes before the end quote, causing the attribute to become unquoted +input: "blah" +output: [TagOpenOpen(), Text(text="foo"), TagAttrStart(pad_first=" ", pad_before_eq="", pad_after_eq=""), Text(text="attribute"), TagAttrEquals(), Text(text="\"this_is"), TagAttrStart(pad_first=" ", pad_before_eq="", pad_after_eq=""), Text(text="quoted\\\""), TagCloseOpen(padding=""), Text(text="blah"), TagOpenClose(), Text(text="foo"), TagCloseClose()] + +--- + +name: backslash_double +label: two adjacent backslashes, which do *not* affect the quote +input: "blah" +output: [TagOpenOpen(), Text(text="foo"), TagAttrStart(pad_first=" ", pad_before_eq="", pad_after_eq=""), Text(text="attribute"), TagAttrEquals(), TagAttrQuote(), Text(text="this is\\\\"), TagAttrStart(pad_first=" ", pad_before_eq="", pad_after_eq=""), Text(text="quoted\""), TagCloseOpen(padding=""), Text(text="blah"), TagOpenClose(), Text(text="foo"), TagCloseClose()] + +--- + +name: backslash_triple +label: three adjacent backslashes, which do *not* affect the quote +input: "blah" +output: [TagOpenOpen(), Text(text="foo"), TagAttrStart(pad_first=" ", pad_before_eq="", pad_after_eq=""), Text(text="attribute"), TagAttrEquals(), TagAttrQuote(), Text(text="this is\\\\\\"), TagAttrStart(pad_first=" ", pad_before_eq="", pad_after_eq=""), Text(text="quoted\""), TagCloseOpen(padding=""), Text(text="blah"), TagOpenClose(), Text(text="foo"), TagCloseClose()] + +--- + +name: backslash_unaffecting +label: backslashes near quotes, but not immediately adjacent, thus having no effect +input: "blah" +output: [TagOpenOpen(), Text(text="foo"), TagAttrStart(pad_first=" ", pad_before_eq="", pad_after_eq=""), Text(text="attribute"), TagAttrEquals(), TagAttrQuote(), Text(text="\\quote\\d"), TagAttrStart(pad_first=" ", pad_before_eq="", pad_after_eq=""), Text(text="also"), TagAttrEquals(), Text(text="\"quote\\d\\\""), TagCloseOpen(padding=""), Text(text="blah"), TagOpenClose(), Text(text="foo"), TagCloseClose()] + +--- + +name: unparsable +label: a tag that should not be put through the normal parser +input: "{{t1}}{{t2}}{{t3}}" +output: [TemplateOpen(), Text(text="t1"), TemplateClose(), TagOpenOpen(), Text(text="nowiki"), TagCloseOpen(padding=""), Text(text="{{t2}}"), TagOpenClose(), Text(text="nowiki"), TagCloseClose(), TemplateOpen(), Text(text="t3"), TemplateClose()] + +--- + +name: unparsable_complex +label: a tag that should not be put through the normal parser; lots of stuff inside +input: "{{t1}}
    {{t2}}\n==Heading==\nThis is some text with a [[page|link]].
    {{t3}}" +output: [TemplateOpen(), Text(text="t1"), TemplateClose(), TagOpenOpen(), Text(text="pre"), TagCloseOpen(padding=""), Text(text="{{t2}}\n==Heading==\nThis is some text with a [[page|link]]."), TagOpenClose(), Text(text="pre"), TagCloseClose(), TemplateOpen(), Text(text="t3"), TemplateClose()] + +--- + +name: unparsable_attributed +label: a tag that should not be put through the normal parser; parsed attributes +input: "{{t1}}{{t2}}{{t3}}" +output: [TemplateOpen(), Text(text=u't1'), TemplateClose(), TagOpenOpen(), Text(text="nowiki"), TagAttrStart(pad_first=" ", pad_before_eq="", pad_after_eq=""), Text(text="attr"), TagAttrEquals(), Text(text="val"), TagAttrStart(pad_first=" ", pad_before_eq="", pad_after_eq=""), Text(text="attr2"), TagAttrEquals(), TagAttrQuote(), TemplateOpen(), Text(text="val2"), TemplateClose(), TagCloseOpen(padding=""), Text(text="{{t2}}"), TagOpenClose(), Text(text="nowiki"), TagCloseClose(), TemplateOpen(), Text(text="t3"), TemplateClose()] + +--- + +name: unparsable_incomplete +label: a tag that should not be put through the normal parser; incomplete +input: "{{t1}}{{t2}}{{t3}}" +output: [TemplateOpen(), Text(text="t1"), TemplateClose(), Text(text=""), TemplateOpen(), Text(text="t2"), TemplateClose(), TemplateOpen(), Text(text="t3"), TemplateClose()] + +--- + +name: unparsable_entity +label: a HTML entity inside unparsable text is still parsed +input: "{{t1}}{{t2}} {{t3}}{{t4}}" +output: [TemplateOpen(), Text(text="t1"), TemplateClose(), TagOpenOpen(), Text(text="nowiki"), TagCloseOpen(padding=""), Text(text="{{t2}}"), HTMLEntityStart(), Text(text="nbsp"), HTMLEntityEnd(), Text(text="{{t3}}"), TagOpenClose(), Text(text="nowiki"), TagCloseClose(), TemplateOpen(), Text(text="t4"), TemplateClose()] + +--- + +name: unparsable_entity_incomplete +label: an incomplete HTML entity inside unparsable text +input: "&" +output: [TagOpenOpen(), Text(text="nowiki"), TagCloseOpen(padding=""), Text(text="&"), TagOpenClose(), Text(text="nowiki"), TagCloseClose()] + +--- + +name: unparsable_entity_incomplete_2 +label: an incomplete HTML entity inside unparsable text +input: "&" +output: [Text(text="&")] + +--- + +name: single_open_close +label: a tag that supports being single; both an open and a close tag +input: "foo
  • bar{{baz}}
  • " +output: [Text(text="foo"), TagOpenOpen(), Text(text="li"), TagCloseOpen(padding=""), Text(text="bar"), TemplateOpen(), Text(text="baz"), TemplateClose(), TagOpenClose(), Text(text="li"), TagCloseClose()] + +--- + +name: single_open +label: a tag that supports being single; just an open tag +input: "foo
  • bar{{baz}}" +output: [Text(text="foo"), TagOpenOpen(), Text(text="li"), TagCloseSelfclose(padding="", implicit=True), Text(text="bar"), TemplateOpen(), Text(text="baz"), TemplateClose()] + +--- + +name: single_selfclose +label: a tag that supports being single; a self-closing tag +input: "foo
  • bar{{baz}}" +output: [Text(text="foo"), TagOpenOpen(), Text(text="li"), TagCloseSelfclose(padding=""), Text(text="bar"), TemplateOpen(), Text(text="baz"), TemplateClose()] + +--- + +name: single_close +label: a tag that supports being single; just a close tag +input: "foo
  • bar{{baz}}" +output: [Text(text="foobar"), TemplateOpen(), Text(text="baz"), TemplateClose()] + +--- + +name: single_only_open_close +label: a tag that can only be single; both an open and a close tag +input: "foo
    bar{{baz}}
    " +output: [Text(text="foo"), TagOpenOpen(), Text(text="br"), TagCloseSelfclose(padding="", implicit=True), Text(text="bar"), TemplateOpen(), Text(text="baz"), TemplateClose(), TagOpenOpen(invalid=True), Text(text="br"), TagCloseSelfclose(padding="", implicit=True)] + +--- + +name: single_only_open +label: a tag that can only be single; just an open tag +input: "foo
    bar{{baz}}" +output: [Text(text="foo"), TagOpenOpen(), Text(text="br"), TagCloseSelfclose(padding="", implicit=True), Text(text="bar"), TemplateOpen(), Text(text="baz"), TemplateClose()] + +--- + +name: single_only_selfclose +label: a tag that can only be single; a self-closing tag +input: "foo
    bar{{baz}}" +output: [Text(text="foo"), TagOpenOpen(), Text(text="br"), TagCloseSelfclose(padding=""), Text(text="bar"), TemplateOpen(), Text(text="baz"), TemplateClose()] + +--- + +name: single_only_close +label: a tag that can only be single; just a close tag +input: "foo
    bar{{baz}}" +output: [Text(text="foo"), TagOpenOpen(invalid=True), Text(text="br"), TagCloseSelfclose(padding="", implicit=True), Text(text="bar"), TemplateOpen(), Text(text="baz"), TemplateClose()] + +--- + +name: single_only_double +label: a tag that can only be single; a tag with backslashes at the beginning and end +input: "foo
    bar{{baz}}" +output: [Text(text="foo"), TagOpenOpen(invalid=True), Text(text="br"), TagCloseSelfclose(padding=""), Text(text="bar"), TemplateOpen(), Text(text="baz"), TemplateClose()] + +--- + +name: single_only_close_attribute +label: a tag that can only be single; presented as a close tag with an attribute +input: "
    " +output: [TagOpenOpen(invalid=True), Text(text="br"), TagAttrStart(pad_first=" ", pad_after_eq="", pad_before_eq=""), Text(text="id"), TagAttrEquals(), TagAttrQuote(), Text(text="break"), TagCloseSelfclose(padding="", implicit=True)] + +--- + +name: capitalization +label: caps should be ignored within tag names +input: "{{test}}" +output: [TagOpenOpen(), Text(text="NoWiKi"), TagCloseOpen(padding=""), Text(text="{{test}}"), TagOpenClose(), Text(text="nOwIkI"), TagCloseClose()] diff --git a/tests/tokenizer/tags_wikimarkup.mwtest b/tests/tokenizer/tags_wikimarkup.mwtest new file mode 100644 index 0000000..feff9c5 --- /dev/null +++ b/tests/tokenizer/tags_wikimarkup.mwtest @@ -0,0 +1,523 @@ +name: basic_italics +label: basic italic text +input: "''text''" +output: [TagOpenOpen(wiki_markup="''"), Text(text="i"), TagCloseOpen(), Text(text="text"), TagOpenClose(), Text(text="i"), TagCloseClose()] + +--- + +name: basic_bold +label: basic bold text +input: "'''text'''" +output: [TagOpenOpen(wiki_markup="'''"), Text(text="b"), TagCloseOpen(), Text(text="text"), TagOpenClose(), Text(text="b"), TagCloseClose()] + +--- + +name: basic_ul +label: basic unordered list +input: "*text" +output: [TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), Text(text="text")] + +--- + +name: basic_ol +label: basic ordered list +input: "#text" +output: [TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), Text(text="text")] + +--- + +name: basic_dt +label: basic description term +input: ";text" +output: [TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), Text(text="text")] + +--- + +name: basic_dd +label: basic description item +input: ":text" +output: [TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), Text(text="text")] + +--- + +name: basic_hr +label: basic horizontal rule +input: "----" +output: [TagOpenOpen(wiki_markup="----"), Text(text="hr"), TagCloseSelfclose()] + +--- + +name: complex_italics +label: italics with a lot in them +input: "''this is a test of [[Italic text|italics]] with {{plenty|of|stuff}}''" +output: [TagOpenOpen(wiki_markup="''"), Text(text="i"), TagCloseOpen(), Text(text="this is a"), HTMLEntityStart(), Text(text="nbsp"), HTMLEntityEnd(), Text(text="test of "), WikilinkOpen(), Text(text="Italic text"), WikilinkSeparator(), Text(text="italics"), WikilinkClose(), Text(text=" with "), TemplateOpen(), Text(text="plenty"), TemplateParamSeparator(), Text(text="of"), TemplateParamSeparator(), Text(text="stuff"), TemplateClose(), TagOpenClose(), Text(text="i"), TagCloseClose()] + +--- + +name: multiline_italics +label: italics spanning mulitple lines +input: "foo\nbar''testing\ntext\nspanning\n\n\n\n\nmultiple\nlines''foo\n\nbar" +output: [Text(text="foo\nbar"), TagOpenOpen(wiki_markup="''"), Text(text="i"), TagCloseOpen(), Text(text="testing\ntext\nspanning\n\n\n\n\nmultiple\nlines"), TagOpenClose(), Text(text="i"), TagCloseClose(), Text(text="foo\n\nbar")] + +--- + +name: unending_italics +label: italics without an ending tag +input: "''unending formatting!" +output: [Text(text="''unending formatting!")] + +--- + +name: misleading_italics_end +label: italics with something that looks like an end but isn't +input: "''this is 'not' the en'd'''" +output: [Text(text="''this is 'not' the en'd'"), TagOpenOpen(), Text(text="nowiki"), TagCloseOpen(padding=""), Text(text="''"), TagOpenClose(), Text(text="nowiki"), TagCloseClose()] +] + +--- + +name: italics_start_outside_end_inside +label: italics that start outside a link and end inside it +input: "''foo[[bar|baz'']]spam" +output: [Text(text="''foo"), WikilinkOpen(), Text(text="bar"), WikilinkSeparator(), Text(text="baz''"), WikilinkClose(), Text(text="spam")] + +--- + +name: italics_start_inside_end_outside +label: italics that start inside a link and end outside it +input: "[[foo|''bar]]baz''spam" +output: [Text(text="[[foo|"), TagOpenOpen(wiki_markup="''"), Text(text="i"), TagCloseOpen(), Text(text="bar]]baz"), TagOpenClose(), Text(text="i"), TagCloseClose(), Text(text="spam")] + +--- + +name: complex_bold +label: bold with a lot in it +input: "'''this is a test of [[Bold text|bold]] with {{plenty|of|stuff}}'''" +output: [TagOpenOpen(wiki_markup="'''"), Text(text="b"), TagCloseOpen(), Text(text="this is a"), HTMLEntityStart(), Text(text="nbsp"), HTMLEntityEnd(), Text(text="test of "), WikilinkOpen(), Text(text="Bold text"), WikilinkSeparator(), Text(text="bold"), WikilinkClose(), Text(text=" with "), TemplateOpen(), Text(text="plenty"), TemplateParamSeparator(), Text(text="of"), TemplateParamSeparator(), Text(text="stuff"), TemplateClose(), TagOpenClose(), Text(text="b"), TagCloseClose()] + +--- + +name: multiline_bold +label: bold spanning mulitple lines +input: "foo\nbar'''testing\ntext\nspanning\n\n\n\n\nmultiple\nlines'''foo\n\nbar" +output: [Text(text="foo\nbar"), TagOpenOpen(wiki_markup="'''"), Text(text="b"), TagCloseOpen(), Text(text="testing\ntext\nspanning\n\n\n\n\nmultiple\nlines"), TagOpenClose(), Text(text="b"), TagCloseClose(), Text(text="foo\n\nbar")] + +--- + +name: unending_bold +label: bold without an ending tag +input: "'''unending formatting!" +output: [Text(text="'''unending formatting!")] + +--- + +name: misleading_bold_end +label: bold with something that looks like an end but isn't +input: "'''this is 'not' the en''d''''" +output: [Text(text="'"), TagOpenOpen(wiki_markup="''"), Text(text="i"), TagCloseOpen(), Text(text="this is 'not' the en"), TagOpenClose(), Text(text="i"), TagCloseClose(), Text(text="d'"), TagOpenOpen(), Text(text="nowiki"), TagCloseOpen(padding=""), Text(text="'''"), TagOpenClose(), Text(text="nowiki"), TagCloseClose()] + +--- + +name: bold_start_outside_end_inside +label: bold that start outside a link and end inside it +input: "'''foo[[bar|baz''']]spam" +output: [Text(text="'''foo"), WikilinkOpen(), Text(text="bar"), WikilinkSeparator(), Text(text="baz'''"), WikilinkClose(), Text(text="spam")] + +--- + +name: bold_start_inside_end_outside +label: bold that start inside a link and end outside it +input: "[[foo|'''bar]]baz'''spam" +output: [Text(text="[[foo|"), TagOpenOpen(wiki_markup="'''"), Text(text="b"), TagCloseOpen(), Text(text="bar]]baz"), TagOpenClose(), Text(text="b"), TagCloseClose(), Text(text="spam")] + +--- + +name: bold_and_italics +label: bold and italics together +input: "this is '''''bold and italic text'''''!" +output: [Text(text="this is "), TagOpenOpen(wiki_markup="''"), Text(text="i"), TagCloseOpen(), TagOpenOpen(wiki_markup="'''"), Text(text="b"), TagCloseOpen(), Text(text="bold and italic text"), TagOpenClose(), Text(text="b"), TagCloseClose(), TagOpenClose(), Text(text="i"), TagCloseClose(), Text(text="!")] + +--- + +name: both_then_bold +label: text that starts bold/italic, then is just bold +input: "'''''both''bold'''" +output: [TagOpenOpen(wiki_markup="'''"), Text(text="b"), TagCloseOpen(), TagOpenOpen(wiki_markup="''"), Text(text="i"), TagCloseOpen(), Text(text="both"), TagOpenClose(), Text(text="i"), TagCloseClose(), Text(text="bold"), TagOpenClose(), Text(text="b"), TagCloseClose()] + +--- + +name: both_then_italics +label: text that starts bold/italic, then is just italic +input: "'''''both'''italics''" +output: [TagOpenOpen(wiki_markup="''"), Text(text="i"), TagCloseOpen(), TagOpenOpen(wiki_markup="'''"), Text(text="b"), TagCloseOpen(), Text(text="both"), TagOpenClose(), Text(text="b"), TagCloseClose(), Text(text="italics"), TagOpenClose(), Text(text="i"), TagCloseClose()] + +--- + +name: bold_then_both +label: text that starts just bold, then is bold/italic +input: "'''bold''both'''''" +output: [TagOpenOpen(wiki_markup="'''"), Text(text="b"), TagCloseOpen(), Text(text="bold"), TagOpenOpen(wiki_markup="''"), Text(text="i"), TagCloseOpen(), Text(text="both"), TagOpenClose(), Text(text="i"), TagCloseClose(), TagOpenClose(), Text(text="b"), TagCloseClose()] + +--- + +name: italics_then_both +label: text that starts just italic, then is bold/italic +input: "''italics'''both'''''" +output: [TagOpenOpen(wiki_markup="''"), Text(text="i"), TagCloseOpen(), Text(text="italics"), TagOpenOpen(wiki_markup="'''"), Text(text="b"), TagCloseOpen(), Text(text="both"), TagOpenClose(), Text(text="b"), TagCloseClose(), TagOpenClose(), Text(text="i"), TagCloseClose()] + +--- + +name: italics_then_bold +label: text that starts italic, then is bold +input: "none''italics'''''bold'''none" +output: [Text(text="none"), TagOpenOpen(wiki_markup="''"), Text(text="i"), TagCloseOpen(), Text(text="italics"), TagOpenClose(), Text(text="i"), TagCloseClose(), TagOpenOpen(wiki_markup="'''"), Text(text="b"), TagCloseOpen(), Text(text="bold"), TagOpenClose(), Text(text="b"), TagCloseClose(), Text(text="none")] + +--- + +name: bold_then_italics +label: text that starts bold, then is italic +input: "none'''bold'''''italics''none" +output: [Text(text="none"), TagOpenOpen(wiki_markup="'''"), Text(text="b"), TagCloseOpen(), Text(text="bold"), TagOpenClose(), Text(text="b"), TagCloseClose(), TagOpenOpen(wiki_markup="''"), Text(text="i"), TagCloseOpen(), Text(text="italics"), TagOpenClose(), Text(text="i"), TagCloseClose(), Text(text="none")] + +--- + +name: five_three +label: five ticks to open, three to close (bold) +input: "'''''foobar'''" +output: [Text(text="''"), TagOpenOpen(wiki_markup="'''"), Text(text="b"), TagCloseOpen(), Text(text="foobar"), TagOpenClose(), Text(text="b"), TagCloseClose()] + +--- + +name: five_two +label: five ticks to open, two to close (bold) +input: "'''''foobar''" +output: [Text(text="'''"), TagOpenOpen(wiki_markup="''"), Text(text="i"), TagCloseOpen(), Text(text="foobar"), TagOpenClose(), Text(text="i"), TagCloseClose()] + +--- + +name: four +label: four ticks +input: "foo ''''bar'''' baz" +output: [Text(text="foo '"), TagOpenOpen(wiki_markup="'''"), Text(text="b"), TagCloseOpen(), Text(text="bar'"), TagOpenClose(), Text(text="b"), TagCloseClose(), Text(text=" baz")] + +--- + +name: four_two +label: four ticks to open, two to close +input: "foo ''''bar'' baz" +output: [Text(text="foo ''"), TagOpenOpen(wiki_markup="''"), Text(text="i"), TagCloseOpen(), Text(text="bar"), TagOpenClose(), Text(text="i"), TagCloseClose(), Text(text=" baz")] + +--- + +name: two_three +label: two ticks to open, three to close +input: "foo ''bar''' baz" +output: [Text(text="foo "), TagOpenOpen(wiki_markup="''"), Text(text="i"), TagCloseOpen(), Text(text="bar'"), TagOpenClose(), Text(text="i"), TagCloseClose(), Text(text=" baz")] + +--- + +name: two_four +label: two ticks to open, four to close +input: "foo ''bar'''' baz" +output: [Text(text="foo "), TagOpenOpen(wiki_markup="''"), Text(text="i"), TagCloseOpen(), Text(text="bar''"), TagOpenClose(), Text(text="i"), TagCloseClose(), Text(text=" baz")] + +--- + +name: two_three_two +label: two ticks to open, three to close, two afterwards +input: "foo ''bar''' baz''" +output: [Text(text="foo "), TagOpenOpen(wiki_markup="''"), Text(text="i"), TagCloseOpen(), Text(text="bar''' baz"), TagOpenClose(), Text(text="i"), TagCloseClose()] + +--- + +name: two_four_four +label: two ticks to open, four to close, four afterwards +input: "foo ''bar'''' baz''''" +output: [Text(text="foo ''bar'"), TagOpenOpen(wiki_markup="'''"), Text(text="b"), TagCloseOpen(), Text(text=" baz'"), TagOpenClose(), Text(text="b"), TagCloseClose()] + +--- + +name: seven +label: seven ticks +input: "'''''''seven'''''''" +output: [Text(text="''"), TagOpenOpen(wiki_markup="''"), Text(text="i"), TagCloseOpen(), TagOpenOpen(wiki_markup="'''"), Text(text="b"), TagCloseOpen(), Text(text="seven''"), TagOpenClose(), Text(text="b"), TagCloseClose(), TagOpenClose(), Text(text="i"), TagCloseClose()] + +--- + +name: complex_ul +label: ul with a lot in it +input: "* this is a test of an [[Unordered list|ul]] with {{plenty|of|stuff}}" +output: [TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), Text(text=" this is a"), HTMLEntityStart(), Text(text="nbsp"), HTMLEntityEnd(), Text(text="test of an "), WikilinkOpen(), Text(text="Unordered list"), WikilinkSeparator(), Text(text="ul"), WikilinkClose(), Text(text=" with "), TemplateOpen(), Text(text="plenty"), TemplateParamSeparator(), Text(text="of"), TemplateParamSeparator(), Text(text="stuff"), TemplateClose()] + +--- + +name: ul_multiline_template +label: ul with a template that spans multiple lines +input: "* this has a template with a {{line|\nbreak}}\nthis is not part of the list" +output: [TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), Text(text=" this has a template with a "), TemplateOpen(), Text(text="line"), TemplateParamSeparator(), Text(text="\nbreak"), TemplateClose(), Text(text="\nthis is not part of the list")] + +--- + +name: ul_adjacent +label: multiple adjacent uls +input: "a\n*b\n*c\nd\n*e\nf" +output: [Text(text="a\n"), TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), Text(text="b\n"), TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), Text(text="c\nd\n"), TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), Text(text="e\nf")] + +--- + +name: ul_depths +label: multiple adjacent uls, with differing depths +input: "*a\n**b\n***c\n********d\n**e\nf\n***g" +output: [TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), Text(text="a\n"), TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), Text(text="b\n"), TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), Text(text="c\n"), TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), Text(text="d\n"), TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), Text(text="e\nf\n"), TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), Text(text="g")] + +--- + +name: ul_space_before +label: uls with space before them +input: "foo *bar\n *baz\n*buzz" +output: [Text(text="foo *bar\n *baz\n"), TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), Text(text="buzz")] + +--- + +name: ul_interruption +label: high-depth ul with something blocking it +input: "**f*oobar" +output: [TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), Text(text="f*oobar")] + +--- + +name: complex_ol +label: ol with a lot in it +input: "# this is a test of an [[Ordered list|ol]] with {{plenty|of|stuff}}" +output: [TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), Text(text=" this is a"), HTMLEntityStart(), Text(text="nbsp"), HTMLEntityEnd(), Text(text="test of an "), WikilinkOpen(), Text(text="Ordered list"), WikilinkSeparator(), Text(text="ol"), WikilinkClose(), Text(text=" with "), TemplateOpen(), Text(text="plenty"), TemplateParamSeparator(), Text(text="of"), TemplateParamSeparator(), Text(text="stuff"), TemplateClose()] + +--- + +name: ol_multiline_template +label: ol with a template that spans moltiple lines +input: "# this has a template with a {{line|\nbreak}}\nthis is not part of the list" +output: [TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), Text(text=" this has a template with a "), TemplateOpen(), Text(text="line"), TemplateParamSeparator(), Text(text="\nbreak"), TemplateClose(), Text(text="\nthis is not part of the list")] + +--- + +name: ol_adjacent +label: moltiple adjacent ols +input: "a\n#b\n#c\nd\n#e\nf" +output: [Text(text="a\n"), TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), Text(text="b\n"), TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), Text(text="c\nd\n"), TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), Text(text="e\nf")] + +--- + +name: ol_depths +label: moltiple adjacent ols, with differing depths +input: "#a\n##b\n###c\n########d\n##e\nf\n###g" +output: [TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), Text(text="a\n"), TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), Text(text="b\n"), TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), Text(text="c\n"), TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), Text(text="d\n"), TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), Text(text="e\nf\n"), TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), Text(text="g")] + +--- + +name: ol_space_before +label: ols with space before them +input: "foo #bar\n #baz\n#buzz" +output: [Text(text="foo #bar\n #baz\n"), TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), Text(text="buzz")] + +--- + +name: ol_interruption +label: high-depth ol with something blocking it +input: "##f#oobar" +output: [TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), Text(text="f#oobar")] + +--- + +name: ul_ol_mix +label: a mix of adjacent uls and ols +input: "*a\n*#b\n*##c\n*##*#*#*d\n*#e\nf\n##*g" +output: [TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), Text(text="a\n"), TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), Text(text="b\n"), TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), Text(text="c\n"), TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), Text(text="d\n"), TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), Text(text="e\nf\n"), TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), Text(text="g")] + +--- + +name: complex_dt +label: dt with a lot in it +input: "; this is a test of an [[description term|dt]] with {{plenty|of|stuff}}" +output: [TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), Text(text=" this is a"), HTMLEntityStart(), Text(text="nbsp"), HTMLEntityEnd(), Text(text="test of an "), WikilinkOpen(), Text(text="description term"), WikilinkSeparator(), Text(text="dt"), WikilinkClose(), Text(text=" with "), TemplateOpen(), Text(text="plenty"), TemplateParamSeparator(), Text(text="of"), TemplateParamSeparator(), Text(text="stuff"), TemplateClose()] + +--- + +name: dt_multiline_template +label: dt with a template that spans mdttiple lines +input: "; this has a template with a {{line|\nbreak}}\nthis is not part of the list" +output: [TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), Text(text=" this has a template with a "), TemplateOpen(), Text(text="line"), TemplateParamSeparator(), Text(text="\nbreak"), TemplateClose(), Text(text="\nthis is not part of the list")] + +--- + +name: dt_adjacent +label: mdttiple adjacent dts +input: "a\n;b\n;c\nd\n;e\nf" +output: [Text(text="a\n"), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), Text(text="b\n"), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), Text(text="c\nd\n"), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), Text(text="e\nf")] + +--- + +name: dt_depths +label: mdttiple adjacent dts, with differing depths +input: ";a\n;;b\n;;;c\n;;;;;;;;d\n;;e\nf\n;;;g" +output: [TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), Text(text="a\n"), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), Text(text="b\n"), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), Text(text="c\n"), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), Text(text="d\n"), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), Text(text="e\nf\n"), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), Text(text="g")] + +--- + +name: dt_space_before +label: dts with space before them +input: "foo ;bar\n ;baz\n;buzz" +output: [Text(text="foo ;bar\n ;baz\n"), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), Text(text="buzz")] + +--- + +name: dt_interruption +label: high-depth dt with something blocking it +input: ";;f;oobar" +output: [TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), Text(text="f;oobar")] + +--- + +name: complex_dd +label: dd with a lot in it +input: ": this is a test of an [[description item|dd]] with {{plenty|of|stuff}}" +output: [TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), Text(text=" this is a"), HTMLEntityStart(), Text(text="nbsp"), HTMLEntityEnd(), Text(text="test of an "), WikilinkOpen(), Text(text="description item"), WikilinkSeparator(), Text(text="dd"), WikilinkClose(), Text(text=" with "), TemplateOpen(), Text(text="plenty"), TemplateParamSeparator(), Text(text="of"), TemplateParamSeparator(), Text(text="stuff"), TemplateClose()] + +--- + +name: dd_multiline_template +label: dd with a template that spans mddtiple lines +input: ": this has a template with a {{line|\nbreak}}\nthis is not part of the list" +output: [TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), Text(text=" this has a template with a "), TemplateOpen(), Text(text="line"), TemplateParamSeparator(), Text(text="\nbreak"), TemplateClose(), Text(text="\nthis is not part of the list")] + +--- + +name: dd_adjacent +label: mddtiple adjacent dds +input: "a\n:b\n:c\nd\n:e\nf" +output: [Text(text="a\n"), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), Text(text="b\n"), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), Text(text="c\nd\n"), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), Text(text="e\nf")] + +--- + +name: dd_depths +label: mddtiple adjacent dds, with differing depths +input: ":a\n::b\n:::c\n::::::::d\n::e\nf\n:::g" +output: [TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), Text(text="a\n"), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), Text(text="b\n"), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), Text(text="c\n"), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), Text(text="d\n"), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), Text(text="e\nf\n"), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), Text(text="g")] + +--- + +name: dd_space_before +label: dds with space before them +input: "foo :bar\n :baz\n:buzz" +output: [Text(text="foo :bar\n :baz\n"), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), Text(text="buzz")] + +--- + +name: dd_interruption +label: high-depth dd with something blocking it +input: "::f:oobar" +output: [TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), Text(text="f:oobar")] + +--- + +name: dt_dd_mix +label: a mix of adjacent dts and dds +input: ";a\n;:b\n;::c\n;::;:;:;d\n;:e\nf\n::;g" +output: [TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), Text(text="a\n"), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), Text(text="b\n"), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), Text(text="c\n"), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), Text(text="d\n"), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), Text(text="e\nf\n"), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), Text(text="g")] + +--- + +name: dt_dd_mix2 +label: the correct usage of a dt/dd unit, as in a dl +input: ";foo:bar:baz" +output: [TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), Text(text="foo"), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), Text(text="bar:baz")] + +--- + +name: dt_dd_mix3 +label: another example of correct (but strange) dt/dd usage +input: ":;;::foo:bar:baz" +output: [TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), Text(text="foo"), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), Text(text="bar:baz")] + +--- + +name: ul_ol_dt_dd_mix +label: an assortment of uls, ols, dds, and dts +input: ";:#*foo\n:#*;foo\n#*;:foo\n*;:#foo" +output: [TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), Text(text="foo\n"), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), Text(text="foo\n"), TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), Text(text="foo\n"), TagOpenOpen(wiki_markup="*"), Text(text="li"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=";"), Text(text="dt"), TagCloseSelfclose(), TagOpenOpen(wiki_markup=":"), Text(text="dd"), TagCloseSelfclose(), TagOpenOpen(wiki_markup="#"), Text(text="li"), TagCloseSelfclose(), Text(text="foo")] + +--- + +name: hr_text_before +label: text before an otherwise-valid hr +input: "foo----" +output: [Text(text="foo----")] + +--- + +name: hr_text_after +label: text after a valid hr +input: "----bar" +output: [TagOpenOpen(wiki_markup="----"), Text(text="hr"), TagCloseSelfclose(), Text(text="bar")] + +--- + +name: hr_text_before_after +label: text at both ends of an otherwise-valid hr +input: "foo----bar" +output: [Text(text="foo----bar")] + +--- + +name: hr_newlines +label: newlines surrounding a valid hr +input: "foo\n----\nbar" +output: [Text(text="foo\n"), TagOpenOpen(wiki_markup="----"), Text(text="hr"), TagCloseSelfclose(), Text(text="\nbar")] + +--- + +name: hr_adjacent +label: two adjacent hrs +input: "----\n----" +output: [TagOpenOpen(wiki_markup="----"), Text(text="hr"), TagCloseSelfclose(), Text(text="\n"), TagOpenOpen(wiki_markup="----"), Text(text="hr"), TagCloseSelfclose()] + +--- + +name: hr_adjacent_space +label: two adjacent hrs, with a space before the second one, making it invalid +input: "----\n ----" +output: [TagOpenOpen(wiki_markup="----"), Text(text="hr"), TagCloseSelfclose(), Text(text="\n ----")] + +--- + +name: hr_short +label: an invalid three-hyphen-long hr +input: "---" +output: [Text(text="---")] + +--- + +name: hr_long +label: a very long, valid hr +input: "------------------------------------------" +output: [TagOpenOpen(wiki_markup="------------------------------------------"), Text(text="hr"), TagCloseSelfclose()] + +--- + +name: hr_interruption_short +label: a hr that is interrupted, making it invalid +input: "---x-" +output: [Text(text="---x-")] + +--- + +name: hr_interruption_long +label: a hr that is interrupted, but the first part remains valid because it is long enough +input: "----x--" +output: [TagOpenOpen(wiki_markup="----"), Text(text="hr"), TagCloseSelfclose(), Text(text="x--")] + +--- + +name: nowiki_cancel +label: a nowiki tag before a list causes it to not be parsed +input: "* Unordered list" +output: [TagOpenOpen(), Text(text="nowiki"), TagCloseSelfclose(padding=" "), Text(text="* Unordered list")] diff --git a/tests/tokenizer/text.mwtest b/tests/tokenizer/text.mwtest index 77d5f50..040c677 100644 --- a/tests/tokenizer/text.mwtest +++ b/tests/tokenizer/text.mwtest @@ -23,3 +23,10 @@ name: unicode2 label: additional unicode check for non-BMP codepoints input: "𐌲𐌿𐍄𐌰𐍂𐌰𐌶𐌳𐌰" output: [Text(text="𐌲𐌿𐍄𐌰𐍂𐌰𐌶𐌳𐌰")] + +--- + +name: large +label: a lot of text, requiring multiple textbuffer blocks in the C tokenizer +input: "ZWfsZYcZyhGbkDYJiguJuuhsNyHGFkFhnjkbLJyXIygTHqcXdhsDkEOTSIKYlBiohLIkiXxvyebUyCGvvBcYqFdtcftGmaAanKXEIyYSEKlTfEEbdGhdePVwVImOyKiHSzAEuGyEVRIKPZaNjQsYqpqARIQfvAklFtQyTJVGlLwjJIxYkiqmHBmdOvTyNqJRbMvouoqXRyOhYDwowtkcZGSOcyzVxibQdnzhDYbrgbatUrlOMRvFSzmLWHRihtXnddwYadPgFWUOxAzAgddJVDXHerawdkrRuWaEXfuwQSkQUmLEJUmrgXDVlXCpciaisfuOUjBldElygamkkXbewzLucKRnAEBimIIotXeslRRhnqQjrypnLQvvdCsKFWPVTZaHvzJMFEahDHWcCbyXgxFvknWjhVfiLSDuFhGoFxqSvhjnnRZLmCMhmWeOgSoanDEInKTWHnbpKyUlabLppITDFFxyWKAnUYJQIcmYnrvMmzmtYvsbCYbebgAhMFVVFAKUSvlkLFYluDpbpBaNFWyfXTaOdSBrfiHDTWGBTUCXMqVvRCIMrEjWpQaGsABkioGnveQWqBTDdRQlxQiUipwfyqAocMddXqdvTHhEwjEzMkOSWVPjJvDtClhYwpvRztPmRKCSpGIpXQqrYtTLmShFdpKtOxGtGOZYIdyUGPjdmyvhJTQMtgYJWUUZnecRjBfQXsyWQWikyONySLzLEqRFqcJYdRNFcGwWZtfZasfFWcvdsHRXoqKlKYihRAOJdrPBDdxksXFwKceQVncmFXfUfBsNgjKzoObVExSnRnjegeEhqxXzPmFcuiasViAFeaXrAxXhSfSyCILkKYpjxNeKynUmdcGAbwRwRnlAFbOSCafmzXddiNpLCFTHBELvArdXFpKUGpSHRekhrMedMRNkQzmSyFKjVwiWwCvbNWjgxJRzYeRxHiCCRMXktmKBxbxGZvOpvZIJOwvGIxcBLzsMFlDqAMLtScdsJtrbIUAvKfcdChXGnBzIxGxXMgxJhayrziaCswdpjJJJhkaYnGhHXqZwOzHFdhhUIEtfjERdLaSPRTDDMHpQtonNaIgXUYhjdbnnKppfMBxgNSOOXJAPtFjfAKnrRDrumZBpNhxMstqjTGBViRkDqbTdXYUirsedifGYzZpQkvdNhtFTOPgsYXYCwZHLcSLSfwfpQKtWfZuRUUryHJsbVsAOQcIJdSKKlOvCeEjUQNRPHKXuBJUjPuaAJJxcDMqyaufqfVwUmHLdjeYZzSiiGLHOTCInpVAalbXXTMLugLiwFiyPSuSFiyJUKVrWjbZAHaJtZnQmnvorRrxdPKThqXzNgTjszQiCoMczRnwGYJMERUWGXFyrSbAqsHmLwLlnJOJoXNsjVehQjVOpQOQJAZWwFZBlgyVIplzLTlFwumPgBLYrUIAJAcmvHPGfHfWQguCjfTYzxYfbohaLFAPwxFRrNuCdCzLlEbuhyYjCmuDBTJDMCdLpNRVqEALjnPSaBPsKWRCKNGwEMFpiEWbYZRwaMopjoUuBUvMpvyLfsPKDrfQLiFOQIWPtLIMoijUEUYfhykHrSKbTtrvjwIzHdWZDVwLIpNkloCqpzIsErxxKAFuFEjikWNYChqYqVslXMtoSWzNhbMuxYbzLfJIcPGoUeGPkGyPQNhDyrjgdKekzftFrRPTuyLYqCArkDcWHTrjPQHfoThBNnTQyMwLEWxEnBXLtzJmFVLGEPrdbEwlXpgYfnVnWoNXgPQKKyiXifpvrmJATzQOzYwFhliiYxlbnsEPKbHYUfJLrwYPfSUwTIHiEvBFMrEtVmqJobfcwsiiEudTIiAnrtuywgKLOiMYbEIOAOJdOXqroPjWnQQcTNxFvkIEIsuHLyhSqSphuSmlvknzydQEnebOreeZwOouXYKlObAkaWHhOdTFLoMCHOWrVKeXjcniaxtgCziKEqWOZUWHJQpcDJzYnnduDZrmxgjZroBRwoPBUTJMYipsgJwbTSlvMyXXdAmiEWGMiQxhGvHGPLOKeTxNaLnFVbWpiYIVyqN" +output: [Text(text="ZWfsZYcZyhGbkDYJiguJuuhsNyHGFkFhnjkbLJyXIygTHqcXdhsDkEOTSIKYlBiohLIkiXxvyebUyCGvvBcYqFdtcftGmaAanKXEIyYSEKlTfEEbdGhdePVwVImOyKiHSzAEuGyEVRIKPZaNjQsYqpqARIQfvAklFtQyTJVGlLwjJIxYkiqmHBmdOvTyNqJRbMvouoqXRyOhYDwowtkcZGSOcyzVxibQdnzhDYbrgbatUrlOMRvFSzmLWHRihtXnddwYadPgFWUOxAzAgddJVDXHerawdkrRuWaEXfuwQSkQUmLEJUmrgXDVlXCpciaisfuOUjBldElygamkkXbewzLucKRnAEBimIIotXeslRRhnqQjrypnLQvvdCsKFWPVTZaHvzJMFEahDHWcCbyXgxFvknWjhVfiLSDuFhGoFxqSvhjnnRZLmCMhmWeOgSoanDEInKTWHnbpKyUlabLppITDFFxyWKAnUYJQIcmYnrvMmzmtYvsbCYbebgAhMFVVFAKUSvlkLFYluDpbpBaNFWyfXTaOdSBrfiHDTWGBTUCXMqVvRCIMrEjWpQaGsABkioGnveQWqBTDdRQlxQiUipwfyqAocMddXqdvTHhEwjEzMkOSWVPjJvDtClhYwpvRztPmRKCSpGIpXQqrYtTLmShFdpKtOxGtGOZYIdyUGPjdmyvhJTQMtgYJWUUZnecRjBfQXsyWQWikyONySLzLEqRFqcJYdRNFcGwWZtfZasfFWcvdsHRXoqKlKYihRAOJdrPBDdxksXFwKceQVncmFXfUfBsNgjKzoObVExSnRnjegeEhqxXzPmFcuiasViAFeaXrAxXhSfSyCILkKYpjxNeKynUmdcGAbwRwRnlAFbOSCafmzXddiNpLCFTHBELvArdXFpKUGpSHRekhrMedMRNkQzmSyFKjVwiWwCvbNWjgxJRzYeRxHiCCRMXktmKBxbxGZvOpvZIJOwvGIxcBLzsMFlDqAMLtScdsJtrbIUAvKfcdChXGnBzIxGxXMgxJhayrziaCswdpjJJJhkaYnGhHXqZwOzHFdhhUIEtfjERdLaSPRTDDMHpQtonNaIgXUYhjdbnnKppfMBxgNSOOXJAPtFjfAKnrRDrumZBpNhxMstqjTGBViRkDqbTdXYUirsedifGYzZpQkvdNhtFTOPgsYXYCwZHLcSLSfwfpQKtWfZuRUUryHJsbVsAOQcIJdSKKlOvCeEjUQNRPHKXuBJUjPuaAJJxcDMqyaufqfVwUmHLdjeYZzSiiGLHOTCInpVAalbXXTMLugLiwFiyPSuSFiyJUKVrWjbZAHaJtZnQmnvorRrxdPKThqXzNgTjszQiCoMczRnwGYJMERUWGXFyrSbAqsHmLwLlnJOJoXNsjVehQjVOpQOQJAZWwFZBlgyVIplzLTlFwumPgBLYrUIAJAcmvHPGfHfWQguCjfTYzxYfbohaLFAPwxFRrNuCdCzLlEbuhyYjCmuDBTJDMCdLpNRVqEALjnPSaBPsKWRCKNGwEMFpiEWbYZRwaMopjoUuBUvMpvyLfsPKDrfQLiFOQIWPtLIMoijUEUYfhykHrSKbTtrvjwIzHdWZDVwLIpNkloCqpzIsErxxKAFuFEjikWNYChqYqVslXMtoSWzNhbMuxYbzLfJIcPGoUeGPkGyPQNhDyrjgdKekzftFrRPTuyLYqCArkDcWHTrjPQHfoThBNnTQyMwLEWxEnBXLtzJmFVLGEPrdbEwlXpgYfnVnWoNXgPQKKyiXifpvrmJATzQOzYwFhliiYxlbnsEPKbHYUfJLrwYPfSUwTIHiEvBFMrEtVmqJobfcwsiiEudTIiAnrtuywgKLOiMYbEIOAOJdOXqroPjWnQQcTNxFvkIEIsuHLyhSqSphuSmlvknzydQEnebOreeZwOouXYKlObAkaWHhOdTFLoMCHOWrVKeXjcniaxtgCziKEqWOZUWHJQpcDJzYnnduDZrmxgjZroBRwoPBUTJMYipsgJwbTSlvMyXXdAmiEWGMiQxhGvHGPLOKeTxNaLnFVbWpiYIVyqN")] diff --git a/tests/tokenizer/wikilinks.mwtest b/tests/tokenizer/wikilinks.mwtest index 0682ef1..8eb381a 100644 --- a/tests/tokenizer/wikilinks.mwtest +++ b/tests/tokenizer/wikilinks.mwtest @@ -40,17 +40,17 @@ output: [WikilinkOpen(), Text(text="foo"), WikilinkSeparator(), Text(text="bar|b --- -name: nested -label: a wikilink nested within the value of another -input: "[[foo|[[bar]]]]" -output: [WikilinkOpen(), Text(text="foo"), WikilinkSeparator(), WikilinkOpen(), Text(text="bar"), WikilinkClose(), WikilinkClose()] +name: newline_text +label: a newline in the middle of the text +input: "[[foo|foo\nbar]]" +output: [WikilinkOpen(), Text(text="foo"), WikilinkSeparator(), Text(text="foo\nbar"), WikilinkClose()] --- -name: nested_with_text -label: a wikilink nested within the value of another, separated by other data -input: "[[foo|a[[b]]c]]" -output: [WikilinkOpen(), Text(text="foo"), WikilinkSeparator(), Text(text="a"), WikilinkOpen(), Text(text="b"), WikilinkClose(), Text(text="c"), WikilinkClose()] +name: bracket_text +label: a left bracket in the middle of the text +input: "[[foo|bar[baz]]" +output: [WikilinkOpen(), Text(text="foo"), WikilinkSeparator(), Text(text="bar[baz"), WikilinkClose()] --- @@ -96,13 +96,34 @@ output: [Text(text="[[foo"), WikilinkOpen(), Text(text="bar"), WikilinkClose(), --- -name: invalid_nested_text +name: invalid_nested_padding label: invalid wikilink: trying to nest in the wrong context, with a text param input: "[[foo[[bar]]|baz]]" output: [Text(text="[[foo"), WikilinkOpen(), Text(text="bar"), WikilinkClose(), Text(text="|baz]]")] --- +name: invalid_nested_text +label: invalid wikilink: a wikilink nested within the value of another +input: "[[foo|[[bar]]" +output: [Text(text="[[foo|"), WikilinkOpen(), Text(text="bar"), WikilinkClose()] + +--- + +name: invalid_nested_text_2 +label: invalid wikilink: a wikilink nested within the value of another, two pairs of closing brackets +input: "[[foo|[[bar]]]]" +output: [Text(text="[[foo|"), WikilinkOpen(), Text(text="bar"), WikilinkClose(), Text(text="]]")] + +--- + +name: invalid_nested_text_padding +label: invalid wikilink: a wikilink nested within the value of another, separated by other data +input: "[[foo|a[[b]]c]]" +output: [Text(text="[[foo|a"), WikilinkOpen(), Text(text="b"), WikilinkClose(), Text(text="c]]")] + +--- + name: incomplete_open_only label: incomplete wikilinks: just an open input: "[["