diff --git a/.coveragerc b/.coveragerc new file mode 100644 index 0000000..48a64ce --- /dev/null +++ b/.coveragerc @@ -0,0 +1,9 @@ +[report] +exclude_lines = + pragma: no cover + raise NotImplementedError() +partial_branches = + pragma: no branch + if py3k: + if not py3k: + if py26: diff --git a/.gitignore b/.gitignore index 4068716..3da2db3 100644 --- a/.gitignore +++ b/.gitignore @@ -1,10 +1,14 @@ *.pyc +*.pyd *.so *.dll *.egg *.egg-info +.coverage .DS_Store __pycache__ build dist docs/_build +scripts/*.log +htmlcov/ diff --git a/.travis.yml b/.travis.yml index 31090f2..07dab97 100644 --- a/.travis.yml +++ b/.travis.yml @@ -1,8 +1,18 @@ language: python python: - - "2.6" - - "2.7" - - "3.2" - - "3.3" -install: python setup.py build -script: python setup.py test -q + - 2.6 + - 2.7 + - 3.2 + - 3.3 + - 3.4 +install: + - pip install coveralls + - python setup.py build +script: + - coverage run --source=mwparserfromhell setup.py -q test +after_success: + - coveralls +env: + matrix: + - WITHOUT_EXTENSION=0 + - WITHOUT_EXTENSION=1 diff --git a/CHANGELOG b/CHANGELOG index 9faf6b7..0ab103a 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -1,3 +1,41 @@ +v0.4 (unreleased): + +- The parser now falls back on pure Python mode if C extensions cannot be + built. This fixes an issue that prevented some Windows users from installing + the parser. +- Added support for parsing wikicode tables (patches by David Winegar). +- Added a script to test for memory leaks in scripts/memtest.py. +- Added a script to do releases in scripts/release.sh. +- skip_style_tags can now be passed to mwparserfromhell.parse() (previously, + only Parser().parse() allowed it). +- The 'recursive' argument to Wikicode's filter methods now accepts a third + option, RECURSE_OTHERS, which recurses over all children except instances of + 'forcetype' (for example, `code.filter_templates(code.RECURSE_OTHERS)` + returns all un-nested templates). +- The parser now understands HTML tag attributes quoted with single quotes. + When setting a tag attribute's value, quotes will be added if necessary. As + part of this, Attribute's 'quoted' attribute has been changed to 'quotes', + and is now either a string or None. +- Calling Template.remove() with a Parameter object that is not part of the + template now raises ValueError instead of doing nothing. +- Parameters with non-integer keys can no longer be created with + 'showkey=False', nor have the value of this attribute be set to False later. +- _ListProxy.destroy() has been changed to _ListProxy.detach(), and now works + in a more useful way. +- If something goes wrong while parsing, ParserError will now be raised. + Previously, the parser would produce an unclear BadRoute exception or allow + an incorrect node tree to be build. +- Fixed parser bugs involving: + - nested tags; + - comments in template names; + - tags inside of tags. +- Added tests to ensure that parsed trees convert back to wikicode without + unintentional modifications. +- Added support for a NOWEB environment variable, which disables a unit test + that makes a web call. +- Test coverage has been improved, and some minor related bugs have been fixed. +- Updated and fixed some documentation. + v0.3.3 (released April 22, 2014): - Added support for Python 2.6 and 3.4. diff --git a/LICENSE b/LICENSE index 327905b..92f5e42 100644 --- a/LICENSE +++ b/LICENSE @@ -1,4 +1,4 @@ -Copyright (C) 2012-2014 Ben Kurtovic +Copyright (C) 2012-2015 Ben Kurtovic Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal diff --git a/MANIFEST.in b/MANIFEST.in new file mode 100644 index 0000000..27e8a54 --- /dev/null +++ b/MANIFEST.in @@ -0,0 +1,2 @@ +include LICENSE CHANGELOG +recursive-include tests *.py *.mwtest diff --git a/README.rst b/README.rst index 5b4cfe1..45c7286 100644 --- a/README.rst +++ b/README.rst @@ -1,10 +1,14 @@ mwparserfromhell ================ -.. image:: https://travis-ci.org/earwig/mwparserfromhell.png?branch=develop +.. image:: https://img.shields.io/travis/earwig/mwparserfromhell/develop.svg :alt: Build Status :target: http://travis-ci.org/earwig/mwparserfromhell +.. image:: https://img.shields.io/coveralls/earwig/mwparserfromhell/develop.svg + :alt: Coverage Status + :target: https://coveralls.io/r/earwig/mwparserfromhell + **mwparserfromhell** (the *MediaWiki Parser from Hell*) is a Python package that provides an easy-to-use and outrageously powerful parser for MediaWiki_ wikicode. It supports Python 2 and Python 3. @@ -15,19 +19,17 @@ Full documentation is available on ReadTheDocs_. Development occurs on GitHub_. Installation ------------ -The easiest way to install the parser is through the `Python Package Index`_, -so you can install the latest release with ``pip install mwparserfromhell`` -(`get pip`_). Alternatively, get the latest development version:: +The easiest way to install the parser is through the `Python Package Index`_; +you can install the latest release with ``pip install mwparserfromhell`` +(`get pip`_). On Windows, make sure you have the latest version of pip +installed by running ``pip install --upgrade pip``. + +Alternatively, get the latest development version:: git clone https://github.com/earwig/mwparserfromhell.git cd mwparserfromhell python setup.py install -If you get ``error: Unable to find vcvarsall.bat`` while installing, this is -because Windows can't find the compiler for C extensions. Consult this -`StackOverflow question`_ for help. You can also set ``ext_modules`` in -``setup.py`` to an empty list to prevent the extension from building. - You can run the comprehensive unit testing suite with ``python setup.py test -q``. @@ -40,24 +42,24 @@ Normal usage is rather straightforward (where ``text`` is page text):: >>> wikicode = mwparserfromhell.parse(text) ``wikicode`` is a ``mwparserfromhell.Wikicode`` object, which acts like an -ordinary ``unicode`` object (or ``str`` in Python 3) with some extra methods. +ordinary ``str`` object (or ``unicode`` in Python 2) with some extra methods. For example:: >>> text = "I has a template! {{foo|bar|baz|eggs=spam}} See it?" >>> wikicode = mwparserfromhell.parse(text) - >>> print wikicode + >>> print(wikicode) I has a template! {{foo|bar|baz|eggs=spam}} See it? >>> templates = wikicode.filter_templates() - >>> print templates + >>> print(templates) ['{{foo|bar|baz|eggs=spam}}'] >>> template = templates[0] - >>> print template.name + >>> print(template.name) foo - >>> print template.params + >>> print(template.params) ['bar', 'baz', 'eggs=spam'] - >>> print template.get(1).value + >>> print(template.get(1).value) bar - >>> print template.get("eggs").value + >>> print(template.get("eggs").value) spam Since nodes can contain other nodes, getting nested templates is trivial:: @@ -71,14 +73,14 @@ templates manually. This is possible because nodes can contain additional ``Wikicode`` objects:: >>> code = mwparserfromhell.parse("{{foo|this {{includes a|template}}}}") - >>> print code.filter_templates(recursive=False) + >>> print(code.filter_templates(recursive=False)) ['{{foo|this {{includes a|template}}}}'] >>> foo = code.filter_templates(recursive=False)[0] - >>> print foo.get(1).value + >>> print(foo.get(1).value) this {{includes a|template}} - >>> print foo.get(1).value.filter_templates()[0] + >>> print(foo.get(1).value.filter_templates()[0]) {{includes a|template}} - >>> print foo.get(1).value.filter_templates()[0].get(1).value + >>> print(foo.get(1).value.filter_templates()[0].get(1).value) template Templates can be easily modified to add, remove, or alter params. ``Wikicode`` @@ -93,24 +95,24 @@ whitespace:: ... if template.name.matches("Cleanup") and not template.has("date"): ... template.add("date", "July 2012") ... - >>> print code + >>> print(code) {{cleanup|date=July 2012}} '''Foo''' is a [[bar]]. {{uncategorized}} >>> code.replace("{{uncategorized}}", "{{bar-stub}}") - >>> print code + >>> print(code) {{cleanup|date=July 2012}} '''Foo''' is a [[bar]]. {{bar-stub}} - >>> print code.filter_templates() + >>> print(code.filter_templates()) ['{{cleanup|date=July 2012}}', '{{bar-stub}}'] -You can then convert ``code`` back into a regular ``unicode`` object (for -saving the page!) by calling ``unicode()`` on it:: +You can then convert ``code`` back into a regular ``str`` object (for +saving the page!) by calling ``str()`` on it:: - >>> text = unicode(code) - >>> print text + >>> text = str(code) + >>> print(text) {{cleanup|date=July 2012}} '''Foo''' is a [[bar]]. {{bar-stub}} >>> text == code True -Likewise, use ``str(code)`` in Python 3. +Likewise, use ``unicode(code)`` in Python 2. Integration ----------- @@ -119,27 +121,30 @@ Integration ``Page`` objects have a ``parse`` method that essentially calls ``mwparserfromhell.parse()`` on ``page.get()``. -If you're using Pywikipedia_, your code might look like this:: +If you're using Pywikibot_, your code might look like this:: import mwparserfromhell - import wikipedia as pywikibot + import pywikibot + def parse(title): - site = pywikibot.getSite() + site = pywikibot.Site() page = pywikibot.Page(site, title) text = page.get() return mwparserfromhell.parse(text) -If you're not using a library, you can parse templates in any page using the -following code (via the API_):: +If you're not using a library, you can parse any page using the following code +(via the API_):: import json - import urllib + from urllib.parse import urlencode + from urllib.request import urlopen import mwparserfromhell API_URL = "http://en.wikipedia.org/w/api.php" + def parse(title): data = {"action": "query", "prop": "revisions", "rvlimit": 1, "rvprop": "content", "format": "json", "titles": title} - raw = urllib.urlopen(API_URL, urllib.urlencode(data)).read() + raw = urlopen(API_URL, urlencode(data).encode()).read() res = json.loads(raw) text = res["query"]["pages"].values()[0]["revisions"][0]["*"] return mwparserfromhell.parse(text) @@ -154,5 +159,5 @@ following code (via the API_):: .. _StackOverflow question: http://stackoverflow.com/questions/2817869/error-unable-to-find-vcvarsall-bat .. _get pip: http://pypi.python.org/pypi/pip .. _EarwigBot: https://github.com/earwig/earwigbot -.. _Pywikipedia: https://www.mediawiki.org/wiki/Manual:Pywikipediabot +.. _Pywikibot: https://www.mediawiki.org/wiki/Manual:Pywikibot .. _API: http://mediawiki.org/wiki/API diff --git a/docs/api/mwparserfromhell.nodes.rst b/docs/api/mwparserfromhell.nodes.rst index 7043070..2cbaa1c 100644 --- a/docs/api/mwparserfromhell.nodes.rst +++ b/docs/api/mwparserfromhell.nodes.rst @@ -87,4 +87,3 @@ Subpackages .. toctree:: mwparserfromhell.nodes.extras - diff --git a/docs/api/mwparserfromhell.rst b/docs/api/mwparserfromhell.rst index 0da522e..63af111 100644 --- a/docs/api/mwparserfromhell.rst +++ b/docs/api/mwparserfromhell.rst @@ -15,6 +15,12 @@ mwparserfromhell Package :members: :undoc-members: +:mod:`definitions` Module +------------------------- + +.. automodule:: mwparserfromhell.definitions + :members: + :mod:`smart_list` Module ------------------------ @@ -30,12 +36,6 @@ mwparserfromhell Package :members: :undoc-members: -:mod:`definitions` Module -------------------------- - -.. automodule:: mwparserfromhell.definitions - :members: - :mod:`utils` Module ------------------- diff --git a/docs/changelog.rst b/docs/changelog.rst index 9efc022..9811b5c 100644 --- a/docs/changelog.rst +++ b/docs/changelog.rst @@ -1,6 +1,54 @@ Changelog ========= +v0.4 +---- + +Unreleased +(`changes `__): + +- The parser now falls back on pure Python mode if C extensions cannot be + built. This fixes an issue that prevented some Windows users from installing + the parser. +- Added support for parsing wikicode tables (patches by David Winegar). +- Added a script to test for memory leaks in :file:`scripts/memtest.py`. +- Added a script to do releases in :file:`scripts/release.sh`. +- *skip_style_tags* can now be passed to :func:`mwparserfromhell.parse() + <.parse_anything>` (previously, only :meth:`.Parser.parse` allowed it). +- The *recursive* argument to :class:`Wikicode's <.Wikicode>` :meth:`.filter` + methods now accepts a third option, ``RECURSE_OTHERS``, which recurses over + all children except instances of *forcetype* (for example, + ``code.filter_templates(code.RECURSE_OTHERS)`` returns all un-nested + templates). +- The parser now understands HTML tag attributes quoted with single quotes. + When setting a tag attribute's value, quotes will be added if necessary. As + part of this, :class:`.Attribute`\ 's :attr:`~.Attribute.quoted` attribute + has been changed to :attr:`~.Attribute.quotes`, and is now either a string or + ``None``. +- Calling :meth:`.Template.remove` with a :class:`.Parameter` object that is + not part of the template now raises :exc:`ValueError` instead of doing + nothing. +- :class:`.Parameter`\ s with non-integer keys can no longer be created with + *showkey=False*, nor have the value of this attribute be set to *False* + later. +- :meth:`._ListProxy.destroy` has been changed to :meth:`._ListProxy.detach`, + and now works in a more useful way. +- If something goes wrong while parsing, :exc:`.ParserError` will now be + raised. Previously, the parser would produce an unclear :exc:`.BadRoute` + exception or allow an incorrect node tree to be build. +- Fixed parser bugs involving: + + - nested tags; + - comments in template names; + - tags inside of ```` tags. + +- Added tests to ensure that parsed trees convert back to wikicode without + unintentional modifications. +- Added support for a :envvar:`NOWEB` environment variable, which disables a + unit test that makes a web call. +- Test coverage has been improved, and some minor related bugs have been fixed. +- Updated and fixed some documentation. + v0.3.3 ------ @@ -8,22 +56,21 @@ v0.3.3 (`changes `__): - Added support for Python 2.6 and 3.4. -- :py:meth:`.Template.has` is now passed *ignore_empty=False* by default +- :meth:`.Template.has` is now passed *ignore_empty=False* by default instead of *True*. This fixes a bug when adding parameters to templates with empty fields, **and is a breaking change if you rely on the default behavior.** -- The *matches* argument of :py:class:`Wikicode's <.Wikicode>` - :py:meth:`.filter` methods now accepts a function (taking one argument, a - :py:class:`.Node`, and returning a bool) in addition to a regex. -- Re-added *flat* argument to :py:meth:`.Wikicode.get_sections`, fixed the - order in which it returns sections, and made it faster. -- :py:meth:`.Wikicode.matches` now accepts a tuple or list of - strings/:py:class:`.Wikicode` objects instead of just a single string or - :py:class:`.Wikicode`. +- The *matches* argument of :class:`Wikicode's <.Wikicode>` :meth:`.filter` + methods now accepts a function (taking one argument, a :class:`.Node`, and + returning a bool) in addition to a regex. +- Re-added *flat* argument to :meth:`.Wikicode.get_sections`, fixed the order + in which it returns sections, and made it faster. +- :meth:`.Wikicode.matches` now accepts a tuple or list of + strings/:class:`.Wikicode` objects instead of just a single string or + :class:`.Wikicode`. - Given the frequency of issues with the (admittedly insufficient) tag parser, - there's a temporary *skip_style_tags* argument to - :py:meth:`~.Parser.parse` that ignores ``''`` and ``'''`` until these issues - are corrected. + there's a temporary *skip_style_tags* argument to :meth:`~.Parser.parse` that + ignores ``''`` and ``'''`` until these issues are corrected. - Fixed a parser bug involving nested wikilinks and external links. - C code cleanup and speed improvements. @@ -34,9 +81,9 @@ v0.3.2 (`changes `__): - Added support for Python 3.2 (along with current support for 3.3 and 2.7). -- Renamed :py:meth:`.Template.remove`\ 's first argument from *name* to - *param*, which now accepts :py:class:`.Parameter` objects in addition to - parameter name strings. +- Renamed :meth:`.Template.remove`\ 's first argument from *name* to *param*, + which now accepts :class:`.Parameter` objects in addition to parameter name + strings. v0.3.1 ------ @@ -53,24 +100,24 @@ v0.3 `Released August 24, 2013 `_ (`changes `__): -- Added complete support for HTML :py:class:`Tags <.Tag>`, including forms like +- Added complete support for HTML :class:`Tags <.Tag>`, including forms like ``foo``, ````, and wiki-markup tags like bold (``'''``), italics (``''``), and lists (``*``, ``#``, ``;`` and ``:``). -- Added support for :py:class:`.ExternalLink`\ s (``http://example.com/`` and +- Added support for :class:`.ExternalLink`\ s (``http://example.com/`` and ``[http://example.com/ Example]``). -- :py:class:`Wikicode's <.Wikicode>` :py:meth:`.filter` methods are now passed +- :class:`Wikicode's <.Wikicode>` :meth:`.filter` methods are now passed *recursive=True* by default instead of *False*. **This is a breaking change if you rely on any filter() methods being non-recursive by default.** -- Added a :py:meth:`.matches` method to :py:class:`~.Wikicode` for - page/template name comparisons. -- The *obj* param of :py:meth:`Wikicode.insert_before() <.insert_before>`, - :py:meth:`~.insert_after`, :py:meth:`~.Wikicode.replace`, and - :py:meth:`~.Wikicode.remove` now accepts :py:class:`~.Wikicode` objects and - strings representing parts of wikitext, instead of just nodes. These methods - also make all possible substitutions instead of just one. -- Renamed :py:meth:`Template.has_param() <.has_param>` to - :py:meth:`~.Template.has` for consistency with :py:class:`~.Template`\ 's - other methods; :py:meth:`~.has_param` is now an alias. +- Added a :meth:`.matches` method to :class:`.Wikicode` for page/template name + comparisons. +- The *obj* param of :meth:`.Wikicode.insert_before`, :meth:`.insert_after`, + :meth:`~.Wikicode.replace`, and :meth:`~.Wikicode.remove` now accepts + :class:`.Wikicode` objects and strings representing parts of wikitext, + instead of just nodes. These methods also make all possible substitutions + instead of just one. +- Renamed :meth:`.Template.has_param` to :meth:`~.Template.has` for consistency + with :class:`.Template`\ 's other methods; :meth:`.has_param` is now an + alias. - The C tokenizer extension now works on Python 3 in addition to Python 2.7. - Various bugfixes, internal changes, and cleanup. @@ -83,29 +130,27 @@ v0.2 - The parser now fully supports Python 3 in addition to Python 2.7. - Added a C tokenizer extension that is significantly faster than its Python equivalent. It is enabled by default (if available) and can be toggled by - setting :py:attr:`mwparserfromhell.parser.use_c` to a boolean value. + setting :attr:`mwparserfromhell.parser.use_c` to a boolean value. - Added a complete set of unit tests covering parsing and wikicode manipulation. -- Renamed :py:meth:`.filter_links` to :py:meth:`.filter_wikilinks` (applies to - :py:meth:`.ifilter` as well). -- Added filter methods for :py:class:`Arguments <.Argument>`, - :py:class:`Comments <.Comment>`, :py:class:`Headings <.Heading>`, and - :py:class:`HTMLEntities <.HTMLEntity>`. -- Added *before* param to :py:meth:`Template.add() <.Template.add>`; renamed - *force_nonconformity* to *preserve_spacing*. -- Added *include_lead* param to :py:meth:`Wikicode.get_sections() - <.get_sections>`. -- Removed *flat* param from :py:meth:`.get_sections`. -- Removed *force_no_field* param from :py:meth:`Template.remove() - <.Template.remove>`. +- Renamed :meth:`.filter_links` to :meth:`.filter_wikilinks` (applies to + :meth:`.ifilter` as well). +- Added filter methods for :class:`Arguments <.Argument>`, + :class:`Comments <.Comment>`, :class:`Headings <.Heading>`, and + :class:`HTMLEntities <.HTMLEntity>`. +- Added *before* param to :meth:`.Template.add`; renamed *force_nonconformity* + to *preserve_spacing*. +- Added *include_lead* param to :meth:`.Wikicode.get_sections`. +- Removed *flat* param from :meth:`.get_sections`. +- Removed *force_no_field* param from :meth:`.Template.remove`. - Added support for Travis CI. - Added note about Windows build issue in the README. - The tokenizer will limit itself to a realistic recursion depth to prevent errors and unreasonably long parse times. - Fixed how some nodes' attribute setters handle input. - Fixed multiple bugs in the tokenizer's handling of invalid markup. -- Fixed bugs in the implementation of :py:class:`.SmartList` and - :py:class:`.StringMixIn`. +- Fixed bugs in the implementation of :class:`.SmartList` and + :class:`.StringMixIn`. - Fixed some broken example code in the README; other copyedits. - Other bugfixes and code cleanup. @@ -115,12 +160,12 @@ v0.1.1 `Released September 21, 2012 `_ (`changes `__): -- Added support for :py:class:`Comments <.Comment>` (````) and - :py:class:`Wikilinks <.Wikilink>` (``[[foo]]``). -- Added corresponding :py:meth:`.ifilter_links` and :py:meth:`.filter_links` - methods to :py:class:`.Wikicode`. +- Added support for :class:`Comments <.Comment>` (````) and + :class:`Wikilinks <.Wikilink>` (``[[foo]]``). +- Added corresponding :meth:`.ifilter_links` and :meth:`.filter_links` methods + to :class:`.Wikicode`. - Fixed a bug when parsing incomplete templates. -- Fixed :py:meth:`.strip_code` to affect the contents of headings. +- Fixed :meth:`.strip_code` to affect the contents of headings. - Various copyedits in documentation and comments. v0.1 diff --git a/docs/conf.py b/docs/conf.py index dd1d6e1..3f82ea7 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -42,7 +42,7 @@ master_doc = 'index' # General information about the project. project = u'mwparserfromhell' -copyright = u'2012, 2013, 2014 Ben Kurtovic' +copyright = u'2012, 2013, 2014, 2015 Ben Kurtovic' # The version info for the project you're documenting, acts as replacement for # |version| and |release|, also used in various other places throughout the diff --git a/docs/index.rst b/docs/index.rst index a6d2df3..988f5e7 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -1,9 +1,9 @@ MWParserFromHell v\ |version| Documentation =========================================== -:py:mod:`mwparserfromhell` (the *MediaWiki Parser from Hell*) is a Python -package that provides an easy-to-use and outrageously powerful parser for -MediaWiki_ wikicode. It supports Python 2 and Python 3. +:mod:`mwparserfromhell` (the *MediaWiki Parser from Hell*) is a Python package +that provides an easy-to-use and outrageously powerful parser for MediaWiki_ +wikicode. It supports Python 2 and Python 3. Developed by Earwig_ with contributions from `Σ`_, Legoktm_, and others. Development occurs on GitHub_. diff --git a/docs/integration.rst b/docs/integration.rst index 78810b8..bbd00bb 100644 --- a/docs/integration.rst +++ b/docs/integration.rst @@ -1,35 +1,40 @@ Integration =========== -:py:mod:`mwparserfromhell` is used by and originally developed for EarwigBot_; -:py:class:`~earwigbot.wiki.page.Page` objects have a -:py:meth:`~earwigbot.wiki.page.Page.parse` method that essentially calls -:py:func:`mwparserfromhell.parse() ` on -:py:meth:`~earwigbot.wiki.page.Page.get`. +:mod:`mwparserfromhell` is used by and originally developed for EarwigBot_; +:class:`~earwigbot.wiki.page.Page` objects have a +:meth:`~earwigbot.wiki.page.Page.parse` method that essentially calls +:func:`mwparserfromhell.parse() ` on +:meth:`~earwigbot.wiki.page.Page.get`. -If you're using Pywikipedia_, your code might look like this:: +If you're using Pywikibot_, your code might look like this:: import mwparserfromhell - import wikipedia as pywikibot + import pywikibot + def parse(title): - site = pywikibot.getSite() + site = pywikibot.Site() page = pywikibot.Page(site, title) text = page.get() return mwparserfromhell.parse(text) -If you're not using a library, you can parse templates in any page using the -following code (via the API_):: +If you're not using a library, you can parse any page using the following code +(via the API_):: import json - import urllib + from urllib.parse import urlencode + from urllib.request import urlopen import mwparserfromhell API_URL = "http://en.wikipedia.org/w/api.php" + def parse(title): - raw = urllib.urlopen(API_URL, data).read() + data = {"action": "query", "prop": "revisions", "rvlimit": 1, + "rvprop": "content", "format": "json", "titles": title} + raw = urlopen(API_URL, urlencode(data).encode()).read() res = json.loads(raw) text = res["query"]["pages"].values()[0]["revisions"][0]["*"] return mwparserfromhell.parse(text) .. _EarwigBot: https://github.com/earwig/earwigbot -.. _Pywikipedia: https://www.mediawiki.org/wiki/Manual:Pywikipediabot +.. _Pywikibot: https://www.mediawiki.org/wiki/Manual:Pywikibot .. _API: http://mediawiki.org/wiki/API diff --git a/docs/usage.rst b/docs/usage.rst index 974c670..ee667fd 100644 --- a/docs/usage.rst +++ b/docs/usage.rst @@ -6,25 +6,25 @@ Normal usage is rather straightforward (where ``text`` is page text):: >>> import mwparserfromhell >>> wikicode = mwparserfromhell.parse(text) -``wikicode`` is a :py:class:`mwparserfromhell.Wikicode <.Wikicode>` object, -which acts like an ordinary ``unicode`` object (or ``str`` in Python 3) with -some extra methods. For example:: +``wikicode`` is a :class:`mwparserfromhell.Wikicode <.Wikicode>` object, which +acts like an ordinary ``str`` object (or ``unicode`` in Python 2) with some +extra methods. For example:: >>> text = "I has a template! {{foo|bar|baz|eggs=spam}} See it?" >>> wikicode = mwparserfromhell.parse(text) - >>> print wikicode + >>> print(wikicode) I has a template! {{foo|bar|baz|eggs=spam}} See it? >>> templates = wikicode.filter_templates() - >>> print templates + >>> print(templates) ['{{foo|bar|baz|eggs=spam}}'] >>> template = templates[0] - >>> print template.name + >>> print(template.name) foo - >>> print template.params + >>> print(template.params) ['bar', 'baz', 'eggs=spam'] - >>> print template.get(1).value + >>> print(template.get(1).value) bar - >>> print template.get("eggs").value + >>> print(template.get("eggs").value) spam Since nodes can contain other nodes, getting nested templates is trivial:: @@ -33,27 +33,27 @@ Since nodes can contain other nodes, getting nested templates is trivial:: >>> mwparserfromhell.parse(text).filter_templates() ['{{foo|{{bar}}={{baz|{{spam}}}}}}', '{{bar}}', '{{baz|{{spam}}}}', '{{spam}}'] -You can also pass *recursive=False* to :py:meth:`~.filter_templates` and -explore templates manually. This is possible because nodes can contain -additional :py:class:`~.Wikicode` objects:: +You can also pass *recursive=False* to :meth:`.filter_templates` and explore +templates manually. This is possible because nodes can contain additional +:class:`.Wikicode` objects:: >>> code = mwparserfromhell.parse("{{foo|this {{includes a|template}}}}") - >>> print code.filter_templates(recursive=False) + >>> print(code.filter_templates(recursive=False)) ['{{foo|this {{includes a|template}}}}'] >>> foo = code.filter_templates(recursive=False)[0] - >>> print foo.get(1).value + >>> print(foo.get(1).value) this {{includes a|template}} - >>> print foo.get(1).value.filter_templates()[0] + >>> print(foo.get(1).value.filter_templates()[0]) {{includes a|template}} - >>> print foo.get(1).value.filter_templates()[0].get(1).value + >>> print(foo.get(1).value.filter_templates()[0].get(1).value) template Templates can be easily modified to add, remove, or alter params. -:py:class:`~.Wikicode` objects can be treated like lists, with -:py:meth:`~.Wikicode.append`, :py:meth:`~.Wikicode.insert`, -:py:meth:`~.Wikicode.remove`, :py:meth:`~.Wikicode.replace`, and more. They -also have a :py:meth:`~.Wikicode.matches` method for comparing page or template -names, which takes care of capitalization and whitespace:: +:class:`.Wikicode` objects can be treated like lists, with +:meth:`~.Wikicode.append`, :meth:`~.Wikicode.insert`, +:meth:`~.Wikicode.remove`, :meth:`~.Wikicode.replace`, and more. They also have +a :meth:`~.Wikicode.matches` method for comparing page or template names, which +takes care of capitalization and whitespace:: >>> text = "{{cleanup}} '''Foo''' is a [[bar]]. {{uncategorized}}" >>> code = mwparserfromhell.parse(text) @@ -61,24 +61,24 @@ names, which takes care of capitalization and whitespace:: ... if template.name.matches("Cleanup") and not template.has("date"): ... template.add("date", "July 2012") ... - >>> print code + >>> print(code) {{cleanup|date=July 2012}} '''Foo''' is a [[bar]]. {{uncategorized}} >>> code.replace("{{uncategorized}}", "{{bar-stub}}") - >>> print code + >>> print(code) {{cleanup|date=July 2012}} '''Foo''' is a [[bar]]. {{bar-stub}} - >>> print code.filter_templates() + >>> print(code.filter_templates()) ['{{cleanup|date=July 2012}}', '{{bar-stub}}'] -You can then convert ``code`` back into a regular :py:class:`unicode` object -(for saving the page!) by calling :py:func:`unicode` on it:: +You can then convert ``code`` back into a regular :class:`str` object (for +saving the page!) by calling :func:`str` on it:: - >>> text = unicode(code) - >>> print text + >>> text = str(code) + >>> print(text) {{cleanup|date=July 2012}} '''Foo''' is a [[bar]]. {{bar-stub}} >>> text == code True -(Likewise, use :py:func:`str(code) ` in Python 3.) +(Likewise, use :func:`unicode(code) ` in Python 2.) -For more tips, check out :py:class:`Wikicode's full method list <.Wikicode>` -and the :py:mod:`list of Nodes <.nodes>`. +For more tips, check out :class:`Wikicode's full method list <.Wikicode>` and +the :mod:`list of Nodes <.nodes>`. diff --git a/mwparserfromhell/__init__.py b/mwparserfromhell/__init__.py index 469e9a6..94b6e03 100644 --- a/mwparserfromhell/__init__.py +++ b/mwparserfromhell/__init__.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal @@ -26,12 +26,10 @@ Parser from Hell) is a Python package that provides an easy-to-use and outrageously powerful parser for `MediaWiki `_ wikicode. """ -from __future__ import unicode_literals - __author__ = "Ben Kurtovic" -__copyright__ = "Copyright (C) 2012, 2013, 2014 Ben Kurtovic" +__copyright__ = "Copyright (C) 2012, 2013, 2014, 2015 Ben Kurtovic" __license__ = "MIT License" -__version__ = "0.3.3" +__version__ = "0.4" __email__ = "ben.kurtovic@gmail.com" from . import (compat, definitions, nodes, parser, smart_list, string_mixin, diff --git a/mwparserfromhell/compat.py b/mwparserfromhell/compat.py index 4384ace..590a271 100644 --- a/mwparserfromhell/compat.py +++ b/mwparserfromhell/compat.py @@ -2,10 +2,10 @@ """ Implements support for both Python 2 and Python 3 by defining common types in -terms of their Python 2/3 variants. For example, :py:class:`str` is set to -:py:class:`unicode` on Python 2 but :py:class:`str` on Python 3; likewise, -:py:class:`bytes` is :py:class:`str` on 2 but :py:class:`bytes` on 3. These -types are meant to be imported directly from within the parser's modules. +terms of their Python 2/3 variants. For example, :class:`str` is set to +:class:`unicode` on Python 2 but :class:`str` on Python 3; likewise, +:class:`bytes` is :class:`str` on 2 but :class:`bytes` on 3. These types are +meant to be imported directly from within the parser's modules. """ import sys diff --git a/mwparserfromhell/definitions.py b/mwparserfromhell/definitions.py index 6020ad1..e0ba16b 100644 --- a/mwparserfromhell/definitions.py +++ b/mwparserfromhell/definitions.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal @@ -52,7 +52,7 @@ INVISIBLE_TAGS = [ # [mediawiki/core.git]/includes/Sanitizer.php @ 87a0aef762 SINGLE_ONLY = ["br", "hr", "meta", "link", "img"] -SINGLE = SINGLE_ONLY + ["li", "dt", "dd"] +SINGLE = SINGLE_ONLY + ["li", "dt", "dd", "th", "td", "tr"] MARKUP_TO_HTML = { "#": "li", diff --git a/mwparserfromhell/nodes/__init__.py b/mwparserfromhell/nodes/__init__.py index 223cc67..d0258ca 100644 --- a/mwparserfromhell/nodes/__init__.py +++ b/mwparserfromhell/nodes/__init__.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal @@ -21,12 +21,12 @@ # SOFTWARE. """ -This package contains :py:class:`~.Wikicode` "nodes", which represent a single -unit of wikitext, such as a Template, an HTML tag, a Heading, or plain text. -The node "tree" is far from flat, as most types can contain additional -:py:class:`~.Wikicode` types within them - and with that, more nodes. For -example, the name of a :py:class:`~.Template` is a :py:class:`~.Wikicode` -object that can contain text or more templates. +This package contains :class:`.Wikicode` "nodes", which represent a single unit +of wikitext, such as a Template, an HTML tag, a Heading, or plain text. The +node "tree" is far from flat, as most types can contain additional +:class:`.Wikicode` types within them - and with that, more nodes. For example, +the name of a :class:`.Template` is a :class:`.Wikicode` object that can +contain text or more templates. """ from __future__ import unicode_literals @@ -40,23 +40,23 @@ __all__ = ["Node", "Text", "Argument", "Heading", "HTMLEntity", "Tag", class Node(StringMixIn): """Represents the base Node type, demonstrating the methods to override. - :py:meth:`__unicode__` must be overridden. It should return a ``unicode`` - or (``str`` in py3k) representation of the node. If the node contains - :py:class:`~.Wikicode` objects inside of it, :py:meth:`__children__` - should be a generator that iterates over them. If the node is printable - (shown when the page is rendered), :py:meth:`__strip__` should return its + :meth:`__unicode__` must be overridden. It should return a ``unicode`` or + (``str`` in py3k) representation of the node. If the node contains + :class:`.Wikicode` objects inside of it, :meth:`__children__` should be a + generator that iterates over them. If the node is printable + (shown when the page is rendered), :meth:`__strip__` should return its printable version, stripping out any formatting marks. It does not have to return a string, but something that can be converted to a string with - ``str()``. Finally, :py:meth:`__showtree__` can be overridden to build a + ``str()``. Finally, :meth:`__showtree__` can be overridden to build a nice tree representation of the node, if desired, for - :py:meth:`~.Wikicode.get_tree`. + :meth:`~.Wikicode.get_tree`. """ def __unicode__(self): raise NotImplementedError() def __children__(self): - return # Funny generator-that-yields-nothing syntax - yield + return + yield # pragma: no cover (this is a generator that yields nothing) def __strip__(self, normalize, collapse): return None diff --git a/mwparserfromhell/nodes/argument.py b/mwparserfromhell/nodes/argument.py index a595dfb..39c33ae 100644 --- a/mwparserfromhell/nodes/argument.py +++ b/mwparserfromhell/nodes/argument.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal diff --git a/mwparserfromhell/nodes/comment.py b/mwparserfromhell/nodes/comment.py index fcfd946..3e82be7 100644 --- a/mwparserfromhell/nodes/comment.py +++ b/mwparserfromhell/nodes/comment.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal diff --git a/mwparserfromhell/nodes/external_link.py b/mwparserfromhell/nodes/external_link.py index d13376e..a07e985 100644 --- a/mwparserfromhell/nodes/external_link.py +++ b/mwparserfromhell/nodes/external_link.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal @@ -67,12 +67,12 @@ class ExternalLink(Node): @property def url(self): - """The URL of the link target, as a :py:class:`~.Wikicode` object.""" + """The URL of the link target, as a :class:`.Wikicode` object.""" return self._url @property def title(self): - """The link title (if given), as a :py:class:`~.Wikicode` object.""" + """The link title (if given), as a :class:`.Wikicode` object.""" return self._title @property diff --git a/mwparserfromhell/nodes/extras/__init__.py b/mwparserfromhell/nodes/extras/__init__.py index a131269..854fa45 100644 --- a/mwparserfromhell/nodes/extras/__init__.py +++ b/mwparserfromhell/nodes/extras/__init__.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal @@ -21,9 +21,8 @@ # SOFTWARE. """ -This package contains objects used by -:py:class:`~.Node`\ s, but are not nodes themselves. This includes the -parameters of Templates or the attributes of HTML tags. +This package contains objects used by :class:`.Node`\ s, but that are not nodes +themselves. This includes template parameters and HTML tag attributes. """ from .attribute import Attribute diff --git a/mwparserfromhell/nodes/extras/attribute.py b/mwparserfromhell/nodes/extras/attribute.py index 4b7c668..7c7dd56 100644 --- a/mwparserfromhell/nodes/extras/attribute.py +++ b/mwparserfromhell/nodes/extras/attribute.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal @@ -31,17 +31,19 @@ __all__ = ["Attribute"] class Attribute(StringMixIn): """Represents an attribute of an HTML tag. - This is used by :py:class:`~.Tag` objects. For example, the tag + This is used by :class:`.Tag` objects. For example, the tag ```` contains an Attribute whose name is ``"name"`` and whose value is ``"foo"``. """ - def __init__(self, name, value=None, quoted=True, pad_first=" ", - pad_before_eq="", pad_after_eq=""): + def __init__(self, name, value=None, quotes='"', pad_first=" ", + pad_before_eq="", pad_after_eq="", check_quotes=True): super(Attribute, self).__init__() + if check_quotes and not quotes and self._value_needs_quotes(value): + raise ValueError("given value {0!r} requires quotes".format(value)) self._name = name self._value = value - self._quoted = quoted + self._quotes = quotes self._pad_first = pad_first self._pad_before_eq = pad_before_eq self._pad_after_eq = pad_after_eq @@ -50,11 +52,18 @@ class Attribute(StringMixIn): result = self.pad_first + str(self.name) + self.pad_before_eq if self.value is not None: result += "=" + self.pad_after_eq - if self.quoted: - return result + '"' + str(self.value) + '"' + if self.quotes: + return result + self.quotes + str(self.value) + self.quotes return result + str(self.value) return result + @staticmethod + def _value_needs_quotes(val): + """Return the preferred quotes for the given value, or None.""" + if val and any(char.isspace() for char in val): + return ('"' in val and "'" in val) or ("'" if '"' in val else '"') + return None + def _set_padding(self, attr, value): """Setter for the value of a padding attribute.""" if not value: @@ -65,20 +74,28 @@ class Attribute(StringMixIn): raise ValueError("padding must be entirely whitespace") setattr(self, attr, value) + @staticmethod + def coerce_quotes(quotes): + """Coerce a quote type into an acceptable value, or raise an error.""" + orig, quotes = quotes, str(quotes) if quotes else None + if quotes not in [None, '"', "'"]: + raise ValueError("{0!r} is not a valid quote type".format(orig)) + return quotes + @property def name(self): - """The name of the attribute as a :py:class:`~.Wikicode` object.""" + """The name of the attribute as a :class:`.Wikicode` object.""" return self._name @property def value(self): - """The value of the attribute as a :py:class:`~.Wikicode` object.""" + """The value of the attribute as a :class:`.Wikicode` object.""" return self._value @property - def quoted(self): - """Whether the attribute's value is quoted with double quotes.""" - return self._quoted + def quotes(self): + """How to enclose the attribute value. ``"``, ``'``, or ``None``.""" + return self._quotes @property def pad_first(self): @@ -101,11 +118,21 @@ class Attribute(StringMixIn): @value.setter def value(self, newval): - self._value = None if newval is None else parse_anything(newval) - - @quoted.setter - def quoted(self, value): - self._quoted = bool(value) + if newval is None: + self._value = None + else: + code = parse_anything(newval) + quotes = self._value_needs_quotes(code) + if quotes in ['"', "'"] or (quotes is True and not self.quotes): + self._quotes = quotes + self._value = code + + @quotes.setter + def quotes(self, value): + value = self.coerce_quotes(value) + if not value and self._value_needs_quotes(self.value): + raise ValueError("attribute value requires quotes") + self._quotes = value @pad_first.setter def pad_first(self, value): diff --git a/mwparserfromhell/nodes/extras/parameter.py b/mwparserfromhell/nodes/extras/parameter.py index e273af9..48f610c 100644 --- a/mwparserfromhell/nodes/extras/parameter.py +++ b/mwparserfromhell/nodes/extras/parameter.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal @@ -21,6 +21,7 @@ # SOFTWARE. from __future__ import unicode_literals +import re from ...compat import str from ...string_mixin import StringMixIn @@ -39,6 +40,8 @@ class Parameter(StringMixIn): def __init__(self, name, value, showkey=True): super(Parameter, self).__init__() + if not showkey and not self.can_hide_key(name): + raise ValueError("key {0!r} cannot be hidden".format(name)) self._name = name self._value = value self._showkey = showkey @@ -48,14 +51,19 @@ class Parameter(StringMixIn): return str(self.name) + "=" + str(self.value) return str(self.value) + @staticmethod + def can_hide_key(key): + """Return whether or not the given key can be hidden.""" + return re.match(r"[1-9][0-9]*$", str(key).strip()) + @property def name(self): - """The name of the parameter as a :py:class:`~.Wikicode` object.""" + """The name of the parameter as a :class:`.Wikicode` object.""" return self._name @property def value(self): - """The value of the parameter as a :py:class:`~.Wikicode` object.""" + """The value of the parameter as a :class:`.Wikicode` object.""" return self._value @property @@ -73,4 +81,7 @@ class Parameter(StringMixIn): @showkey.setter def showkey(self, newval): - self._showkey = bool(newval) + newval = bool(newval) + if not newval and not self.can_hide_key(self.name): + raise ValueError("parameter key cannot be hidden") + self._showkey = newval diff --git a/mwparserfromhell/nodes/heading.py b/mwparserfromhell/nodes/heading.py index 47c23a8..0db56f3 100644 --- a/mwparserfromhell/nodes/heading.py +++ b/mwparserfromhell/nodes/heading.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal @@ -52,7 +52,7 @@ class Heading(Node): @property def title(self): - """The title of the heading, as a :py:class:`~.Wikicode` object.""" + """The title of the heading, as a :class:`.Wikicode` object.""" return self._title @property diff --git a/mwparserfromhell/nodes/html_entity.py b/mwparserfromhell/nodes/html_entity.py index c75cb99..e7f1bbc 100644 --- a/mwparserfromhell/nodes/html_entity.py +++ b/mwparserfromhell/nodes/html_entity.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal @@ -77,17 +77,17 @@ class HTMLEntity(Node): # Test whether we're on the wide or narrow Python build. Check # the length of a non-BMP code point # (U+1F64A, SPEAK-NO-EVIL MONKEY): - if len("\U0001F64A") == 2: - # Ensure this is within the range we can encode: - if value > 0x10FFFF: - raise ValueError("unichr() arg not in range(0x110000)") - code = value - 0x10000 - if value < 0: # Invalid code point - raise - lead = 0xD800 + (code >> 10) - trail = 0xDC00 + (code % (1 << 10)) - return unichr(lead) + unichr(trail) - raise + if len("\U0001F64A") == 1: # pragma: no cover + raise + # Ensure this is within the range we can encode: + if value > 0x10FFFF: + raise ValueError("unichr() arg not in range(0x110000)") + code = value - 0x10000 + if value < 0: # Invalid code point + raise + lead = 0xD800 + (code >> 10) + trail = 0xDC00 + (code % (1 << 10)) + return unichr(lead) + unichr(trail) @property def value(self): diff --git a/mwparserfromhell/nodes/tag.py b/mwparserfromhell/nodes/tag.py index 661304e..cf3b4a5 100644 --- a/mwparserfromhell/nodes/tag.py +++ b/mwparserfromhell/nodes/tag.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal @@ -35,7 +35,8 @@ class Tag(Node): def __init__(self, tag, contents=None, attrs=None, wiki_markup=None, self_closing=False, invalid=False, implicit=False, padding="", - closing_tag=None): + closing_tag=None, wiki_style_separator=None, + closing_wiki_markup=None): super(Tag, self).__init__() self._tag = tag if contents is None and not self_closing: @@ -52,13 +53,28 @@ class Tag(Node): self._closing_tag = closing_tag else: self._closing_tag = tag + self._wiki_style_separator = wiki_style_separator + if closing_wiki_markup is not None: + self._closing_wiki_markup = closing_wiki_markup + elif wiki_markup and not self_closing: + self._closing_wiki_markup = wiki_markup + else: + self._closing_wiki_markup = None def __unicode__(self): if self.wiki_markup: + if self.attributes: + attrs = "".join([str(attr) for attr in self.attributes]) + else: + attrs = "" + padding = self.padding or "" + separator = self.wiki_style_separator or "" + close = self.closing_wiki_markup or "" if self.self_closing: - return self.wiki_markup + return self.wiki_markup + attrs + padding + separator else: - return self.wiki_markup + str(self.contents) + self.wiki_markup + return self.wiki_markup + attrs + padding + separator + \ + str(self.contents) + close result = ("`` and ````. See - :py:func:`.definitions.is_single_only`. + :func:`.definitions.is_single_only`. """ return self._invalid @@ -155,8 +171,8 @@ class Tag(Node): """Whether the tag is implicitly self-closing, with no ending slash. This is only possible for specific "single" tags like ``
`` and - ``
  • ``. See :py:func:`.definitions.is_single`. This field only has an - effect if :py:attr:`self_closing` is also ``True``. + ``
  • ``. See :func:`.definitions.is_single`. This field only has an + effect if :attr:`self_closing` is also ``True``. """ return self._implicit @@ -167,13 +183,34 @@ class Tag(Node): @property def closing_tag(self): - """The closing tag, as a :py:class:`~.Wikicode` object. + """The closing tag, as a :class:`.Wikicode` object. - This will usually equal :py:attr:`tag`, unless there is additional + This will usually equal :attr:`tag`, unless there is additional spacing, comments, or the like. """ return self._closing_tag + @property + def wiki_style_separator(self): + """The separator between the padding and content in a wiki markup tag. + + Essentially the wiki equivalent of the TagCloseOpen. + """ + return self._wiki_style_separator + + @property + def closing_wiki_markup(self): + """The wikified version of the closing tag to show instead of HTML. + + If set to a value, this will be displayed instead of the close tag + brackets. If tag is :attr:`self_closing` is ``True`` then this is not + displayed. If :attr:`wiki_markup` is set and this has not been set, this + is set to the value of :attr:`wiki_markup`. If this has been set and + :attr:`wiki_markup` is set to a ``False`` value, this is set to + ``None``. + """ + return self._closing_wiki_markup + @tag.setter def tag(self, value): self._tag = self._closing_tag = parse_anything(value) @@ -185,6 +222,8 @@ class Tag(Node): @wiki_markup.setter def wiki_markup(self, value): self._wiki_markup = str(value) if value else None + if not value or not self.closing_wiki_markup: + self._closing_wiki_markup = self._wiki_markup @self_closing.setter def self_closing(self, value): @@ -212,6 +251,14 @@ class Tag(Node): def closing_tag(self, value): self._closing_tag = parse_anything(value) + @wiki_style_separator.setter + def wiki_style_separator(self, value): + self._wiki_style_separator = str(value) if value else None + + @closing_wiki_markup.setter + def closing_wiki_markup(self, value): + self._closing_wiki_markup = str(value) if value else None + def has(self, name): """Return whether any attribute in the tag has the given *name*. @@ -226,8 +273,8 @@ class Tag(Node): def get(self, name): """Get the attribute with the given *name*. - The returned object is a :py:class:`~.Attribute` instance. Raises - :py:exc:`ValueError` if no attribute has this name. Since multiple + The returned object is a :class:`.Attribute` instance. Raises + :exc:`ValueError` if no attribute has this name. Since multiple attributes can have the same name, we'll return the last match, since all but the last are ignored by the MediaWiki parser. """ @@ -236,21 +283,24 @@ class Tag(Node): return attr raise ValueError(name) - def add(self, name, value=None, quoted=True, pad_first=" ", + def add(self, name, value=None, quotes='"', pad_first=" ", pad_before_eq="", pad_after_eq=""): """Add an attribute with the given *name* and *value*. - *name* and *value* can be anything parasable by - :py:func:`.utils.parse_anything`; *value* can be omitted if the - attribute is valueless. *quoted* is a bool telling whether to wrap the - *value* in double quotes (this is recommended). *pad_first*, - *pad_before_eq*, and *pad_after_eq* are whitespace used as padding - before the name, before the equal sign (or after the name if no value), - and after the equal sign (ignored if no value), respectively. + *name* and *value* can be anything parsable by + :func:`.utils.parse_anything`; *value* can be omitted if the attribute + is valueless. If *quotes* is not ``None``, it should be a string + (either ``"`` or ``'``) that *value* will be wrapped in (this is + recommended). ``None`` is only legal if *value* contains no spacing. + + *pad_first*, *pad_before_eq*, and *pad_after_eq* are whitespace used as + padding before the name, before the equal sign (or after the name if no + value), and after the equal sign (ignored if no value), respectively. """ if value is not None: value = parse_anything(value) - attr = Attribute(parse_anything(name), value, quoted) + quotes = Attribute.coerce_quotes(quotes) + attr = Attribute(parse_anything(name), value, quotes) attr.pad_first = pad_first attr.pad_before_eq = pad_before_eq attr.pad_after_eq = pad_after_eq diff --git a/mwparserfromhell/nodes/template.py b/mwparserfromhell/nodes/template.py index d1a0b0e..7cbeb7d 100644 --- a/mwparserfromhell/nodes/template.py +++ b/mwparserfromhell/nodes/template.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal @@ -95,7 +95,7 @@ class Template(Node): def _select_theory(self, theories): """Return the most likely spacing convention given different options. - Given a dictionary of convention options as keys and their occurance as + Given a dictionary of convention options as keys and their occurrence as values, return the convention that occurs the most, or ``None`` if there is no clear preferred style. """ @@ -110,8 +110,8 @@ class Template(Node): """Try to determine the whitespace conventions for parameters. This will examine the existing parameters and use - :py:meth:`_select_theory` to determine if there are any preferred - styles for how much whitespace to put before or after the value. + :meth:`_select_theory` to determine if there are any preferred styles + for how much whitespace to put before or after the value. """ before_theories = defaultdict(lambda: 0) after_theories = defaultdict(lambda: 0) @@ -155,10 +155,11 @@ class Template(Node): else: self.params.pop(i) return + raise ValueError(needle) @property def name(self): - """The name of the template, as a :py:class:`~.Wikicode` object.""" + """The name of the template, as a :class:`.Wikicode` object.""" return self._name @property @@ -188,13 +189,13 @@ class Template(Node): has_param = lambda self, name, ignore_empty=False: \ self.has(name, ignore_empty) - has_param.__doc__ = "Alias for :py:meth:`has`." + has_param.__doc__ = "Alias for :meth:`has`." def get(self, name): """Get the parameter whose name is *name*. - The returned object is a :py:class:`~.Parameter` instance. Raises - :py:exc:`ValueError` if no parameter has this name. Since multiple + The returned object is a :class:`.Parameter` instance. Raises + :exc:`ValueError` if no parameter has this name. Since multiple parameters can have the same name, we'll return the last match, since the last parameter is the only one read by the MediaWiki parser. """ @@ -208,9 +209,9 @@ class Template(Node): preserve_spacing=True): """Add a parameter to the template with a given *name* and *value*. - *name* and *value* can be anything parasable by - :py:func:`.utils.parse_anything`; pipes and equal signs are - automatically escaped from *value* when appropriate. + *name* and *value* can be anything parsable by + :func:`.utils.parse_anything`; pipes and equal signs are automatically + escaped from *value* when appropriate. If *showkey* is given, this will determine whether or not to show the parameter's name (e.g., ``{{foo|bar}}``'s parameter has a name of @@ -220,13 +221,13 @@ class Template(Node): If *name* is already a parameter in the template, we'll replace its value while keeping the same whitespace around it. We will also try to guess the dominant spacing convention when adding a new parameter using - :py:meth:`_get_spacing_conventions`. + :meth:`_get_spacing_conventions`. - If *before* is given (either a :py:class:`~.Parameter` object or a - name), then we will place the parameter immediately before this one. + If *before* is given (either a :class:`.Parameter` object or a name), + then we will place the parameter immediately before this one. Otherwise, it will be added at the end. If *before* is a name and exists multiple times in the template, we will place it before the last - occurance. If *before* is not in the template, :py:exc:`ValueError` is + occurrence. If *before* is not in the template, :exc:`ValueError` is raised. The argument is ignored if the new parameter already exists. If *preserve_spacing* is ``False``, we will avoid preserving spacing @@ -254,21 +255,19 @@ class Template(Node): return existing if showkey is None: - try: + if Parameter.can_hide_key(name): int_name = int(str(name)) - except ValueError: - showkey = True - else: int_keys = set() for param in self.params: if not param.showkey: - if re.match(r"[1-9][0-9]*$", param.name.strip()): - int_keys.add(int(str(param.name))) + int_keys.add(int(str(param.name))) expected = min(set(range(1, len(int_keys) + 2)) - int_keys) if expected == int_name: showkey = False else: showkey = True + else: + showkey = True if not showkey: self._surface_escape(value, "=") @@ -290,9 +289,9 @@ class Template(Node): def remove(self, param, keep_field=False): """Remove a parameter from the template, identified by *param*. - If *param* is a :py:class:`.Parameter` object, it will be matched - exactly, otherwise it will be treated like the *name* argument to - :py:meth:`has` and :py:meth:`get`. + If *param* is a :class:`.Parameter` object, it will be matched exactly, + otherwise it will be treated like the *name* argument to :meth:`has` + and :meth:`get`. If *keep_field* is ``True``, we will keep the parameter's name, but blank its value. Otherwise, we will remove the parameter completely @@ -301,7 +300,7 @@ class Template(Node): we expected, so ``{{foo||baz}}`` will be produced instead). If the parameter shows up multiple times in the template and *param* is - not a :py:class:`.Parameter` object, we will remove all instances of it + not a :class:`.Parameter` object, we will remove all instances of it (and keep only one if *keep_field* is ``True`` - the first instance if none have dependents, otherwise the one with dependents will be kept). """ diff --git a/mwparserfromhell/nodes/text.py b/mwparserfromhell/nodes/text.py index 55c714e..e793c1f 100644 --- a/mwparserfromhell/nodes/text.py +++ b/mwparserfromhell/nodes/text.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal diff --git a/mwparserfromhell/nodes/wikilink.py b/mwparserfromhell/nodes/wikilink.py index 4640f34..88eaacc 100644 --- a/mwparserfromhell/nodes/wikilink.py +++ b/mwparserfromhell/nodes/wikilink.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal @@ -62,12 +62,12 @@ class Wikilink(Node): @property def title(self): - """The title of the linked page, as a :py:class:`~.Wikicode` object.""" + """The title of the linked page, as a :class:`.Wikicode` object.""" return self._title @property def text(self): - """The text to display (if any), as a :py:class:`~.Wikicode` object.""" + """The text to display (if any), as a :class:`.Wikicode` object.""" return self._text @title.setter diff --git a/mwparserfromhell/parser/__init__.py b/mwparserfromhell/parser/__init__.py index 093e501..ae13c76 100644 --- a/mwparserfromhell/parser/__init__.py +++ b/mwparserfromhell/parser/__init__.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal @@ -22,10 +22,23 @@ """ This package contains the actual wikicode parser, split up into two main -modules: the :py:mod:`~.tokenizer` and the :py:mod:`~.builder`. This module -joins them together under one interface. +modules: the :mod:`.tokenizer` and the :mod:`.builder`. This module joins them +together into one interface. """ +class ParserError(Exception): + """Exception raised when an internal error occurs while parsing. + + This does not mean that the wikicode was invalid, because invalid markup + should still be parsed correctly. This means that the parser caught itself + with an impossible internal state and is bailing out before other problems + can happen. Its appearance indicates a bug. + """ + def __init__(self, extra): + msg = "This is a bug and should be reported. Info: {0}.".format(extra) + super(ParserError, self).__init__(msg) + + from .builder import Builder from .tokenizer import Tokenizer try: @@ -35,15 +48,22 @@ except ImportError: CTokenizer = None use_c = False -__all__ = ["use_c", "Parser"] +__all__ = ["use_c", "Parser", "ParserError"] class Parser(object): """Represents a parser for wikicode. Actual parsing is a two-step process: first, the text is split up into a - series of tokens by the :py:class:`~.Tokenizer`, and then the tokens are - converted into trees of :py:class:`~.Wikicode` objects and - :py:class:`~.Node`\ s by the :py:class:`~.Builder`. + series of tokens by the :class:`.Tokenizer`, and then the tokens are + converted into trees of :class:`.Wikicode` objects and :class:`.Node`\ s by + the :class:`.Builder`. + + Instances of this class or its dependents (:class:`.Tokenizer` and + :class:`.Builder`) should not be shared between threads. :meth:`parse` can + be called multiple times as long as it is not done concurrently. In + general, there is no need to do this because parsing should be done through + :func:`mwparserfromhell.parse`, which creates a new :class:`.Parser` object + as necessary. """ def __init__(self): @@ -54,10 +74,20 @@ class Parser(object): self._builder = Builder() def parse(self, text, context=0, skip_style_tags=False): - """Parse *text*, returning a :py:class:`~.Wikicode` object tree. + """Parse *text*, returning a :class:`.Wikicode` object tree. + + If given, *context* will be passed as a starting context to the parser. + This is helpful when this function is used inside node attribute + setters. For example, :class:`.ExternalLink`\ 's + :attr:`~.ExternalLink.url` setter sets *context* to + :mod:`contexts.EXT_LINK_URI <.contexts>` to prevent the URL itself + from becoming an :class:`.ExternalLink`. If *skip_style_tags* is ``True``, then ``''`` and ``'''`` will not be - parsed, but instead be treated as plain text. + parsed, but instead will be treated as plain text. + + If there is an internal error while parsing, :exc:`.ParserError` will + be raised. """ tokens = self._tokenizer.tokenize(text, context, skip_style_tags) code = self._builder.build(tokens) diff --git a/mwparserfromhell/parser/builder.py b/mwparserfromhell/parser/builder.py index 5f8ce45..ad29f4d 100644 --- a/mwparserfromhell/parser/builder.py +++ b/mwparserfromhell/parser/builder.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal @@ -22,7 +22,7 @@ from __future__ import unicode_literals -from . import tokens +from . import tokens, ParserError from ..compat import str from ..nodes import (Argument, Comment, ExternalLink, Heading, HTMLEntity, Tag, Template, Text, Wikilink) @@ -32,34 +32,42 @@ from ..wikicode import Wikicode __all__ = ["Builder"] +_HANDLERS = { + tokens.Text: lambda self, token: Text(token.text) +} + +def _add_handler(token_type): + """Create a decorator that adds a handler function to the lookup table.""" + def decorator(func): + """Add a handler function to the lookup table.""" + _HANDLERS[token_type] = func + return func + return decorator + + class Builder(object): - """Combines a sequence of tokens into a tree of ``Wikicode`` objects. + """Builds a tree of nodes out of a sequence of tokens. - To use, pass a list of :py:class:`~.Token`\ s to the :py:meth:`build` - method. The list will be exhausted as it is parsed and a - :py:class:`~.Wikicode` object will be returned. + To use, pass a list of :class:`.Token`\ s to the :meth:`build` method. The + list will be exhausted as it is parsed and a :class:`.Wikicode` object + containing the node tree will be returned. """ def __init__(self): self._tokens = [] self._stacks = [] - def _wrap(self, nodes): - """Properly wrap a list of nodes in a ``Wikicode`` object.""" - return Wikicode(SmartList(nodes)) - def _push(self): """Push a new node list onto the stack.""" self._stacks.append([]) - def _pop(self, wrap=True): + def _pop(self): """Pop the current node list off of the stack. - If *wrap* is ``True``, we will call :py:meth:`_wrap` on the list. + The raw node list is wrapped in a :class:`.SmartList` and then in a + :class:`.Wikicode` object. """ - if wrap: - return self._wrap(self._stacks.pop()) - return self._stacks.pop() + return Wikicode(SmartList(self._stacks.pop())) def _write(self, item): """Append a node to the current node list.""" @@ -84,12 +92,14 @@ class Builder(object): self._tokens.append(token) value = self._pop() if key is None: - key = self._wrap([Text(str(default))]) + key = Wikicode(SmartList([Text(str(default))])) return Parameter(key, value, showkey) else: self._write(self._handle_token(token)) + raise ParserError("_handle_parameter() missed a close token") - def _handle_template(self): + @_add_handler(tokens.TemplateOpen) + def _handle_template(self, token): """Handle a case where a template is at the head of the tokens.""" params = [] default = 1 @@ -109,8 +119,10 @@ class Builder(object): return Template(name, params) else: self._write(self._handle_token(token)) + raise ParserError("_handle_template() missed a close token") - def _handle_argument(self): + @_add_handler(tokens.ArgumentOpen) + def _handle_argument(self, token): """Handle a case where an argument is at the head of the tokens.""" name = None self._push() @@ -125,8 +137,10 @@ class Builder(object): return Argument(self._pop()) else: self._write(self._handle_token(token)) + raise ParserError("_handle_argument() missed a close token") - def _handle_wikilink(self): + @_add_handler(tokens.WikilinkOpen) + def _handle_wikilink(self, token): """Handle a case where a wikilink is at the head of the tokens.""" title = None self._push() @@ -141,7 +155,9 @@ class Builder(object): return Wikilink(self._pop()) else: self._write(self._handle_token(token)) + raise ParserError("_handle_wikilink() missed a close token") + @_add_handler(tokens.ExternalLinkOpen) def _handle_external_link(self, token): """Handle when an external link is at the head of the tokens.""" brackets, url = token.brackets, None @@ -157,8 +173,10 @@ class Builder(object): return ExternalLink(self._pop(), brackets=brackets) else: self._write(self._handle_token(token)) + raise ParserError("_handle_external_link() missed a close token") - def _handle_entity(self): + @_add_handler(tokens.HTMLEntityStart) + def _handle_entity(self, token): """Handle a case where an HTML entity is at the head of the tokens.""" token = self._tokens.pop() if isinstance(token, tokens.HTMLEntityNumeric): @@ -173,6 +191,7 @@ class Builder(object): self._tokens.pop() # Remove HTMLEntityEnd return HTMLEntity(token.text, named=True, hexadecimal=False) + @_add_handler(tokens.HeadingStart) def _handle_heading(self, token): """Handle a case where a heading is at the head of the tokens.""" level = token.level @@ -184,8 +203,10 @@ class Builder(object): return Heading(title, level) else: self._write(self._handle_token(token)) + raise ParserError("_handle_heading() missed a close token") - def _handle_comment(self): + @_add_handler(tokens.CommentStart) + def _handle_comment(self, token): """Handle a case where an HTML comment is at the head of the tokens.""" self._push() while self._tokens: @@ -195,10 +216,11 @@ class Builder(object): return Comment(contents) else: self._write(self._handle_token(token)) + raise ParserError("_handle_comment() missed a close token") def _handle_attribute(self, start): """Handle a case where a tag attribute is at the head of the tokens.""" - name, quoted = None, False + name = quotes = None self._push() while self._tokens: token = self._tokens.pop() @@ -206,7 +228,7 @@ class Builder(object): name = self._pop() self._push() elif isinstance(token, tokens.TagAttrQuote): - quoted = True + quotes = token.char elif isinstance(token, (tokens.TagAttrStart, tokens.TagCloseOpen, tokens.TagCloseSelfclose)): self._tokens.append(token) @@ -214,30 +236,37 @@ class Builder(object): value = self._pop() else: name, value = self._pop(), None - return Attribute(name, value, quoted, start.pad_first, - start.pad_before_eq, start.pad_after_eq) + return Attribute(name, value, quotes, start.pad_first, + start.pad_before_eq, start.pad_after_eq, + check_quotes=False) else: self._write(self._handle_token(token)) + raise ParserError("_handle_attribute() missed a close token") + @_add_handler(tokens.TagOpenOpen) def _handle_tag(self, token): """Handle a case where a tag is at the head of the tokens.""" close_tokens = (tokens.TagCloseSelfclose, tokens.TagCloseClose) implicit, attrs, contents, closing_tag = False, [], None, None wiki_markup, invalid = token.wiki_markup, token.invalid or False + wiki_style_separator, closing_wiki_markup = None, wiki_markup self._push() while self._tokens: token = self._tokens.pop() if isinstance(token, tokens.TagAttrStart): attrs.append(self._handle_attribute(token)) elif isinstance(token, tokens.TagCloseOpen): + wiki_style_separator = token.wiki_markup padding = token.padding or "" tag = self._pop() self._push() elif isinstance(token, tokens.TagOpenClose): + closing_wiki_markup = token.wiki_markup contents = self._pop() self._push() elif isinstance(token, close_tokens): if isinstance(token, tokens.TagCloseSelfclose): + closing_wiki_markup = token.wiki_markup tag = self._pop() self_closing = True padding = token.padding or "" @@ -246,30 +275,19 @@ class Builder(object): self_closing = False closing_tag = self._pop() return Tag(tag, contents, attrs, wiki_markup, self_closing, - invalid, implicit, padding, closing_tag) + invalid, implicit, padding, closing_tag, + wiki_style_separator, closing_wiki_markup) else: self._write(self._handle_token(token)) + raise ParserError("_handle_tag() missed a close token") def _handle_token(self, token): """Handle a single token.""" - if isinstance(token, tokens.Text): - return Text(token.text) - elif isinstance(token, tokens.TemplateOpen): - return self._handle_template() - elif isinstance(token, tokens.ArgumentOpen): - return self._handle_argument() - elif isinstance(token, tokens.WikilinkOpen): - return self._handle_wikilink() - elif isinstance(token, tokens.ExternalLinkOpen): - return self._handle_external_link(token) - elif isinstance(token, tokens.HTMLEntityStart): - return self._handle_entity() - elif isinstance(token, tokens.HeadingStart): - return self._handle_heading(token) - elif isinstance(token, tokens.CommentStart): - return self._handle_comment() - elif isinstance(token, tokens.TagOpenOpen): - return self._handle_tag(token) + try: + return _HANDLERS[type(token)](self, token) + except KeyError: + err = "_handle_token() got unexpected {0}" + raise ParserError(err.format(type(token).__name__)) def build(self, tokenlist): """Build a Wikicode object from a list tokens and return it.""" @@ -280,3 +298,6 @@ class Builder(object): node = self._handle_token(self._tokens.pop()) self._write(node) return self._pop() + + +del _add_handler diff --git a/mwparserfromhell/parser/contexts.py b/mwparserfromhell/parser/contexts.py index 28023b5..e98d8f7 100644 --- a/mwparserfromhell/parser/contexts.py +++ b/mwparserfromhell/parser/contexts.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal @@ -35,72 +35,81 @@ will cover ``BAR == 0b10`` and ``BAZ == 0b01``). Local (stack-specific) contexts: -* :py:const:`TEMPLATE` +* :const:`TEMPLATE` - * :py:const:`TEMPLATE_NAME` - * :py:const:`TEMPLATE_PARAM_KEY` - * :py:const:`TEMPLATE_PARAM_VALUE` + * :const:`TEMPLATE_NAME` + * :const:`TEMPLATE_PARAM_KEY` + * :const:`TEMPLATE_PARAM_VALUE` -* :py:const:`ARGUMENT` +* :const:`ARGUMENT` - * :py:const:`ARGUMENT_NAME` - * :py:const:`ARGUMENT_DEFAULT` + * :const:`ARGUMENT_NAME` + * :const:`ARGUMENT_DEFAULT` -* :py:const:`WIKILINK` +* :const:`WIKILINK` - * :py:const:`WIKILINK_TITLE` - * :py:const:`WIKILINK_TEXT` + * :const:`WIKILINK_TITLE` + * :const:`WIKILINK_TEXT` -* :py:const:`EXT_LINK` +* :const:`EXT_LINK` - * :py:const:`EXT_LINK_URI` - * :py:const:`EXT_LINK_TITLE` + * :const:`EXT_LINK_URI` + * :const:`EXT_LINK_TITLE` -* :py:const:`HEADING` +* :const:`HEADING` - * :py:const:`HEADING_LEVEL_1` - * :py:const:`HEADING_LEVEL_2` - * :py:const:`HEADING_LEVEL_3` - * :py:const:`HEADING_LEVEL_4` - * :py:const:`HEADING_LEVEL_5` - * :py:const:`HEADING_LEVEL_6` + * :const:`HEADING_LEVEL_1` + * :const:`HEADING_LEVEL_2` + * :const:`HEADING_LEVEL_3` + * :const:`HEADING_LEVEL_4` + * :const:`HEADING_LEVEL_5` + * :const:`HEADING_LEVEL_6` -* :py:const:`TAG` +* :const:`TAG` - * :py:const:`TAG_OPEN` - * :py:const:`TAG_ATTR` - * :py:const:`TAG_BODY` - * :py:const:`TAG_CLOSE` + * :const:`TAG_OPEN` + * :const:`TAG_ATTR` + * :const:`TAG_BODY` + * :const:`TAG_CLOSE` -* :py:const:`STYLE` +* :const:`STYLE` - * :py:const:`STYLE_ITALICS` - * :py:const:`STYLE_BOLD` - * :py:const:`STYLE_PASS_AGAIN` - * :py:const:`STYLE_SECOND_PASS` + * :const:`STYLE_ITALICS` + * :const:`STYLE_BOLD` + * :const:`STYLE_PASS_AGAIN` + * :const:`STYLE_SECOND_PASS` -* :py:const:`DL_TERM` +* :const:`DL_TERM` -* :py:const:`SAFETY_CHECK` +* :const:`SAFETY_CHECK` - * :py:const:`HAS_TEXT` - * :py:const:`FAIL_ON_TEXT` - * :py:const:`FAIL_NEXT` - * :py:const:`FAIL_ON_LBRACE` - * :py:const:`FAIL_ON_RBRACE` - * :py:const:`FAIL_ON_EQUALS` + * :const:`HAS_TEXT` + * :const:`FAIL_ON_TEXT` + * :const:`FAIL_NEXT` + * :const:`FAIL_ON_LBRACE` + * :const:`FAIL_ON_RBRACE` + * :const:`FAIL_ON_EQUALS` + +* :const:`TABLE` + + * :const:`TABLE_OPEN` + * :const:`TABLE_CELL_OPEN` + * :const:`TABLE_CELL_STYLE` + * :const:`TABLE_TD_LINE` + * :const:`TABLE_TH_LINE` + * :const:`TABLE_CELL_LINE_CONTEXTS` Global contexts: -* :py:const:`GL_HEADING` +* :const:`GL_HEADING` Aggregate contexts: -* :py:const:`FAIL` -* :py:const:`UNSAFE` -* :py:const:`DOUBLE` -* :py:const:`NO_WIKILINKS` -* :py:const:`NO_EXT_LINKS` +* :const:`FAIL` +* :const:`UNSAFE` +* :const:`DOUBLE` +* :const:`NO_WIKILINKS` +* :const:`NO_EXT_LINKS` """ @@ -155,15 +164,26 @@ FAIL_ON_EQUALS = 1 << 29 SAFETY_CHECK = (HAS_TEXT + FAIL_ON_TEXT + FAIL_NEXT + FAIL_ON_LBRACE + FAIL_ON_RBRACE + FAIL_ON_EQUALS) +TABLE_OPEN = 1 << 30 +TABLE_CELL_OPEN = 1 << 31 +TABLE_CELL_STYLE = 1 << 32 +TABLE_ROW_OPEN = 1 << 33 +TABLE_TD_LINE = 1 << 34 +TABLE_TH_LINE = 1 << 35 +TABLE_CELL_LINE_CONTEXTS = TABLE_TD_LINE + TABLE_TH_LINE + TABLE_CELL_STYLE +TABLE = (TABLE_OPEN + TABLE_CELL_OPEN + TABLE_CELL_STYLE + TABLE_ROW_OPEN + + TABLE_TD_LINE + TABLE_TH_LINE) + # Global contexts: GL_HEADING = 1 << 0 # Aggregate contexts: -FAIL = TEMPLATE + ARGUMENT + WIKILINK + EXT_LINK_TITLE + HEADING + TAG + STYLE +FAIL = (TEMPLATE + ARGUMENT + WIKILINK + EXT_LINK_TITLE + HEADING + TAG + + STYLE + TABLE) UNSAFE = (TEMPLATE_NAME + WIKILINK_TITLE + EXT_LINK_TITLE + TEMPLATE_PARAM_KEY + ARGUMENT_NAME + TAG_CLOSE) -DOUBLE = TEMPLATE_PARAM_KEY + TAG_CLOSE +DOUBLE = TEMPLATE_PARAM_KEY + TAG_CLOSE + TABLE_ROW_OPEN NO_WIKILINKS = TEMPLATE_NAME + ARGUMENT_NAME + WIKILINK_TITLE + EXT_LINK_URI NO_EXT_LINKS = TEMPLATE_NAME + ARGUMENT_NAME + WIKILINK_TITLE + EXT_LINK diff --git a/mwparserfromhell/parser/tokenizer.c b/mwparserfromhell/parser/tokenizer.c index de58e72..c125021 100644 --- a/mwparserfromhell/parser/tokenizer.c +++ b/mwparserfromhell/parser/tokenizer.c @@ -1,6 +1,6 @@ /* Tokenizer for MWParserFromHell -Copyright (C) 2012-2014 Ben Kurtovic +Copyright (C) 2012-2015 Ben Kurtovic Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in @@ -69,15 +69,19 @@ static int call_def_func(const char* funcname, PyObject* in1, PyObject* in2, /* Sanitize the name of a tag so it can be compared with others for equality. */ -static PyObject* strip_tag_name(PyObject* token) +static PyObject* strip_tag_name(PyObject* token, int take_attr) { PyObject *text, *rstripped, *lowered; - text = PyObject_GetAttrString(token, "text"); - if (!text) - return NULL; - rstripped = PyObject_CallMethod(text, "rstrip", NULL); - Py_DECREF(text); + if (take_attr) { + text = PyObject_GetAttrString(token, "text"); + if (!text) + return NULL; + rstripped = PyObject_CallMethod(text, "rstrip", NULL); + Py_DECREF(text); + } + else + rstripped = PyObject_CallMethod(token, "rstrip", NULL); if (!rstripped) return NULL; lowered = PyObject_CallMethod(rstripped, "lower", NULL); @@ -173,7 +177,7 @@ static TagData* TagData_new(void) ALLOC_BUFFER(self->pad_first) ALLOC_BUFFER(self->pad_before_eq) ALLOC_BUFFER(self->pad_after_eq) - self->reset = 0; + self->quoter = self->reset = 0; return self; } @@ -241,7 +245,7 @@ static int Tokenizer_init(Tokenizer* self, PyObject* args, PyObject* kwds) /* Add a new token stack, context, and textbuffer to the list. */ -static int Tokenizer_push(Tokenizer* self, int context) +static int Tokenizer_push(Tokenizer* self, uint64_t context) { Stack* top = malloc(sizeof(Stack)); @@ -333,7 +337,7 @@ static PyObject* Tokenizer_pop(Tokenizer* self) static PyObject* Tokenizer_pop_keeping_context(Tokenizer* self) { PyObject* stack; - int context; + uint64_t context; if (Tokenizer_push_textbuffer(self)) return NULL; @@ -347,11 +351,11 @@ static PyObject* Tokenizer_pop_keeping_context(Tokenizer* self) /* Fail the current tokenization route. Discards the current - stack/context/textbuffer and raises a BadRoute exception. + stack/context/textbuffer and sets the BAD_ROUTE flag. */ static void* Tokenizer_fail_route(Tokenizer* self) { - int context = self->topstack->context; + uint64_t context = self->topstack->context; PyObject* stack = Tokenizer_pop(self); Py_XDECREF(stack); @@ -676,11 +680,8 @@ static int Tokenizer_parse_template_or_argument(Tokenizer* self) RESET_ROUTE(); for (i = 0; i < braces; i++) text[i] = '{'; text[braces] = '\0'; - if (Tokenizer_emit_text_then_stack(self, text)) { - Py_XDECREF(text); + if (Tokenizer_emit_text_then_stack(self, text)) return -1; - } - Py_XDECREF(text); return 0; } else @@ -832,8 +833,6 @@ static int Tokenizer_parse_wikilink(Tokenizer* self) Py_DECREF(wikilink); if (Tokenizer_emit(self, WikilinkClose)) return -1; - if (self->topstack->context & LC_FAIL_NEXT) - self->topstack->context ^= LC_FAIL_NEXT; return 0; } @@ -1036,7 +1035,7 @@ Tokenizer_is_free_link(Tokenizer* self, Py_UNICODE this, Py_UNICODE next) { // Built from Tokenizer_parse()'s end sentinels: Py_UNICODE after = Tokenizer_READ(self, 2); - int ctx = self->topstack->context; + uint64_t ctx = self->topstack->context; return (!this || this == '\n' || this == '[' || this == ']' || this == '<' || this == '>' || (this == '\'' && next == '\'') || @@ -1555,6 +1554,12 @@ static int Tokenizer_parse_comment(Tokenizer* self) return -1; Py_DECREF(comment); self->head += 2; + if (self->topstack->context & LC_FAIL_NEXT) { + /* _verify_safe() sets this flag while parsing a template name + when it encounters what might be a comment -- we must unset + it to let _verify_safe() know it was correct: */ + self->topstack->context ^= LC_FAIL_NEXT; + } return 0; } if (Tokenizer_emit_char(self, this)) @@ -1568,10 +1573,18 @@ static int Tokenizer_parse_comment(Tokenizer* self) */ static int Tokenizer_push_tag_buffer(Tokenizer* self, TagData* data) { - PyObject *tokens, *kwargs, *pad_first, *pad_before_eq, *pad_after_eq; + PyObject *tokens, *kwargs, *tmp, *pad_first, *pad_before_eq, *pad_after_eq; if (data->context & TAG_QUOTED) { - if (Tokenizer_emit_first(self, TagAttrQuote)) + kwargs = PyDict_New(); + if (!kwargs) + return -1; + tmp = PyUnicode_FromUnicode(&data->quoter, 1); + if (!tmp) + return -1; + PyDict_SetItemString(kwargs, "char", tmp); + Py_DECREF(tmp); + if (Tokenizer_emit_first_kwargs(self, TagAttrQuote, kwargs)) return -1; tokens = Tokenizer_pop(self); if (!tokens) @@ -1617,9 +1630,9 @@ static int Tokenizer_push_tag_buffer(Tokenizer* self, TagData* data) static int Tokenizer_handle_tag_space(Tokenizer* self, TagData* data, Py_UNICODE text) { - int ctx = data->context; - int end_of_value = (ctx & TAG_ATTR_VALUE && - !(ctx & (TAG_QUOTED | TAG_NOTE_QUOTE))); + uint64_t ctx = data->context; + uint64_t end_of_value = (ctx & TAG_ATTR_VALUE && + !(ctx & (TAG_QUOTED | TAG_NOTE_QUOTE))); if (end_of_value || (ctx & TAG_QUOTED && ctx & TAG_NOTE_SPACE)) { if (Tokenizer_push_tag_buffer(self, data)) @@ -1718,21 +1731,22 @@ Tokenizer_handle_tag_data(Tokenizer* self, TagData* data, Py_UNICODE chunk) return -1; } } - else if (data->context & TAG_ATTR_VALUE) { + else { // data->context & TAG_ATTR_VALUE assured escaped = (Tokenizer_READ_BACKWARDS(self, 1) == '\\' && Tokenizer_READ_BACKWARDS(self, 2) != '\\'); if (data->context & TAG_NOTE_QUOTE) { data->context ^= TAG_NOTE_QUOTE; - if (chunk == '"' && !escaped) { + if ((chunk == '"' || chunk == '\'') && !escaped) { data->context |= TAG_QUOTED; + data->quoter = chunk; + data->reset = self->head; if (Tokenizer_push(self, self->topstack->context)) return -1; - data->reset = self->head; return 0; } } else if (data->context & TAG_QUOTED) { - if (chunk == '"' && !escaped) { + if (chunk == data->quoter && !escaped) { data->context |= TAG_NOTE_SPACE; return 0; } @@ -1802,8 +1816,9 @@ static PyObject* Tokenizer_handle_tag_close_close(Tokenizer* self) valid = 0; break; case 1: { - so = strip_tag_name(first); - sc = strip_tag_name(PyList_GET_ITEM(self->topstack->stack, 1)); + so = strip_tag_name(first, 1); + sc = strip_tag_name( + PyList_GET_ITEM(self->topstack->stack, 1), 1); if (so && sc) { if (PyUnicode_Compare(so, sc)) valid = 0; @@ -1838,7 +1853,11 @@ static PyObject* Tokenizer_handle_tag_close_close(Tokenizer* self) */ static PyObject* Tokenizer_handle_blacklisted_tag(Tokenizer* self) { + Textbuffer* buffer; + PyObject *buf_tmp, *end_tag, *start_tag; Py_UNICODE this, next; + Py_ssize_t reset; + int cmp; while (1) { this = Tokenizer_READ(self, 0); @@ -1846,10 +1865,48 @@ static PyObject* Tokenizer_handle_blacklisted_tag(Tokenizer* self) if (!this) return Tokenizer_fail_route(self); else if (this == '<' && next == '/') { - if (Tokenizer_handle_tag_open_close(self)) + self->head += 2; + reset = self->head - 1; + buffer = Textbuffer_new(); + if (!buffer) return NULL; - self->head++; - return Tokenizer_parse(self, 0, 0); + while ((this = Tokenizer_READ(self, 0))) { + if (this == '>') { + buf_tmp = Textbuffer_render(buffer); + if (!buf_tmp) + return NULL; + end_tag = strip_tag_name(buf_tmp, 0); + Py_DECREF(buf_tmp); + if (!end_tag) + return NULL; + start_tag = strip_tag_name( + PyList_GET_ITEM(self->topstack->stack, 1), 1); + if (!start_tag) + return NULL; + cmp = PyUnicode_Compare(start_tag, end_tag); + Py_DECREF(end_tag); + Py_DECREF(start_tag); + if (cmp) + goto no_matching_end; + if (Tokenizer_emit(self, TagOpenClose)) + return NULL; + if (Tokenizer_emit_textbuffer(self, buffer, 0)) + return NULL; + if (Tokenizer_emit(self, TagCloseClose)) + return NULL; + return Tokenizer_pop(self); + } + if (!this || this == '\n') { + no_matching_end: + Textbuffer_dealloc(buffer); + self->head = reset; + if (Tokenizer_emit_text(self, "head++; + } } else if (this == '&') { if (Tokenizer_parse_entity(self)) @@ -1896,18 +1953,26 @@ static PyObject* Tokenizer_handle_single_tag_end(Tokenizer* self) { PyObject *token = 0, *padding, *kwargs; Py_ssize_t len, index; - int is_instance; + int depth = 1, is_instance; len = PyList_GET_SIZE(self->topstack->stack); - for (index = 0; index < len; index++) { + for (index = 2; index < len; index++) { token = PyList_GET_ITEM(self->topstack->stack, index); - is_instance = PyObject_IsInstance(token, TagCloseOpen); + is_instance = PyObject_IsInstance(token, TagOpenOpen); if (is_instance == -1) return NULL; else if (is_instance == 1) - break; + depth++; + is_instance = PyObject_IsInstance(token, TagCloseOpen); + if (is_instance == -1) + return NULL; + else if (is_instance == 1) { + depth--; + if (depth == 0) + break; + } } - if (!token) + if (!token || depth > 0) return NULL; padding = PyObject_GetAttrString(token, "padding"); if (!padding) @@ -2132,7 +2197,7 @@ static int Tokenizer_emit_style_tag(Tokenizer* self, const char* tag, static int Tokenizer_parse_italics(Tokenizer* self) { Py_ssize_t reset = self->head; - int context; + uint64_t context; PyObject *stack; stack = Tokenizer_parse(self, LC_STYLE_ITALICS, 1); @@ -2252,7 +2317,7 @@ static int Tokenizer_parse_italics_and_bold(Tokenizer* self) */ static PyObject* Tokenizer_parse_style(Tokenizer* self) { - int context = self->topstack->context, ticks = 2, i; + uint64_t context = self->topstack->context, ticks = 2, i; self->head += 2; while (Tokenizer_READ(self, 0) == '\'') { @@ -2405,9 +2470,363 @@ static int Tokenizer_handle_dl_term(Tokenizer* self) } /* + Emit a table tag. +*/ +static int +Tokenizer_emit_table_tag(Tokenizer* self, const char* open_open_markup, + const char* tag, PyObject* style, PyObject* padding, + const char* close_open_markup, PyObject* contents, + const char* open_close_markup) +{ + PyObject *open_open_kwargs, *open_open_markup_unicode, *close_open_kwargs, + *close_open_markup_unicode, *open_close_kwargs, + *open_close_markup_unicode; + + open_open_kwargs = PyDict_New(); + if (!open_open_kwargs) + goto fail_decref_all; + open_open_markup_unicode = PyUnicode_FromString(open_open_markup); + if (!open_open_markup_unicode) { + Py_DECREF(open_open_kwargs); + goto fail_decref_all; + } + PyDict_SetItemString(open_open_kwargs, "wiki_markup", + open_open_markup_unicode); + Py_DECREF(open_open_markup_unicode); + if (Tokenizer_emit_kwargs(self, TagOpenOpen, open_open_kwargs)) + goto fail_decref_all; + if (Tokenizer_emit_text(self, tag)) + goto fail_decref_all; + + if (style) { + if (Tokenizer_emit_all(self, style)) + goto fail_decref_all; + Py_DECREF(style); + } + + close_open_kwargs = PyDict_New(); + if (!close_open_kwargs) + goto fail_decref_padding_contents; + if (close_open_markup && strlen(close_open_markup) != 0) { + close_open_markup_unicode = PyUnicode_FromString(close_open_markup); + if (!close_open_markup_unicode) { + Py_DECREF(close_open_kwargs); + goto fail_decref_padding_contents; + } + PyDict_SetItemString(close_open_kwargs, "wiki_markup", + close_open_markup_unicode); + Py_DECREF(close_open_markup_unicode); + } + PyDict_SetItemString(close_open_kwargs, "padding", padding); + Py_DECREF(padding); + if (Tokenizer_emit_kwargs(self, TagCloseOpen, close_open_kwargs)) + goto fail_decref_contents; + + if (contents) { + if (Tokenizer_emit_all(self, contents)) + goto fail_decref_contents; + Py_DECREF(contents); + } + + open_close_kwargs = PyDict_New(); + if (!open_close_kwargs) + return -1; + open_close_markup_unicode = PyUnicode_FromString(open_close_markup); + if (!open_close_markup_unicode) { + Py_DECREF(open_close_kwargs); + return -1; + } + PyDict_SetItemString(open_close_kwargs, "wiki_markup", + open_close_markup_unicode); + Py_DECREF(open_close_markup_unicode); + if (Tokenizer_emit_kwargs(self, TagOpenClose, open_close_kwargs)) + return -1; + if (Tokenizer_emit_text(self, tag)) + return -1; + if (Tokenizer_emit(self, TagCloseClose)) + return -1; + return 0; + + fail_decref_all: + Py_XDECREF(style); + fail_decref_padding_contents: + Py_DECREF(padding); + fail_decref_contents: + Py_DECREF(contents); + return -1; +} + +/* + Handle style attributes for a table until an ending token. +*/ +static PyObject* Tokenizer_handle_table_style(Tokenizer* self, char end_token) +{ + TagData *data = TagData_new(); + PyObject *padding, *trash; + Py_UNICODE this; + int can_exit; + + if (!data) + return NULL; + data->context = TAG_ATTR_READY; + + while (1) { + this = Tokenizer_READ(self, 0); + can_exit = (!(data->context & TAG_QUOTED) || data->context & TAG_NOTE_SPACE); + if (this == end_token && can_exit) { + if (data->context & (TAG_ATTR_NAME | TAG_ATTR_VALUE)) { + if (Tokenizer_push_tag_buffer(self, data)) { + TagData_dealloc(data); + return NULL; + } + } + if (Py_UNICODE_ISSPACE(this)) + Textbuffer_write(&(data->pad_first), this); + padding = Textbuffer_render(data->pad_first); + TagData_dealloc(data); + if (!padding) + return NULL; + return padding; + } + else if (!this || this == end_token) { + if (self->topstack->context & LC_TAG_ATTR) { + if (data->context & TAG_QUOTED) { + // Unclosed attribute quote: reset, don't die + data->context = TAG_ATTR_VALUE; + trash = Tokenizer_pop(self); + Py_XDECREF(trash); + self->head = data->reset; + continue; + } + trash = Tokenizer_pop(self); + Py_XDECREF(trash); + } + TagData_dealloc(data); + return Tokenizer_fail_route(self); + } + else { + if (Tokenizer_handle_tag_data(self, data, this) || BAD_ROUTE) { + TagData_dealloc(data); + return NULL; + } + } + self->head++; + } +} + +/* + Parse a wikicode table by starting with the first line. +*/ +static int Tokenizer_parse_table(Tokenizer* self) +{ + Py_ssize_t reset = self->head + 1; + PyObject *style, *padding; + PyObject *table = NULL; + self->head += 2; + + if(Tokenizer_push(self, LC_TABLE_OPEN)) + return -1; + padding = Tokenizer_handle_table_style(self, '\n'); + if (BAD_ROUTE) { + RESET_ROUTE(); + self->head = reset; + if (Tokenizer_emit_text(self, "{|")) + return -1; + return 0; + } + if (!padding) + return -1; + style = Tokenizer_pop(self); + if (!style) { + Py_DECREF(padding); + return -1; + } + + self->head++; + table = Tokenizer_parse(self, LC_TABLE_OPEN, 1); + if (BAD_ROUTE) { + RESET_ROUTE(); + Py_DECREF(padding); + Py_DECREF(style); + self->head = reset; + if (Tokenizer_emit_text(self, "{|")) + return -1; + return 0; + } + if (!table) { + Py_DECREF(padding); + Py_DECREF(style); + return -1; + } + + if (Tokenizer_emit_table_tag(self, "{|", "table", style, padding, NULL, + table, "|}")) + return -1; + // Offset displacement done by _parse() + self->head--; + return 0; +} + +/* + Parse as style until end of the line, then continue. +*/ +static int Tokenizer_handle_table_row(Tokenizer* self) +{ + PyObject *padding, *style, *row, *trash; + self->head += 2; + + if (!Tokenizer_CAN_RECURSE(self)) { + if (Tokenizer_emit_text(self, "|-")) + return -1; + self->head -= 1; + return 0; + } + + if(Tokenizer_push(self, LC_TABLE_OPEN | LC_TABLE_ROW_OPEN)) + return -1; + padding = Tokenizer_handle_table_style(self, '\n'); + if (BAD_ROUTE) { + trash = Tokenizer_pop(self); + Py_XDECREF(trash); + return 0; + } + if (!padding) + return -1; + style = Tokenizer_pop(self); + if (!style) { + Py_DECREF(padding); + return -1; + } + + // Don't parse the style separator + self->head++; + row = Tokenizer_parse(self, LC_TABLE_OPEN | LC_TABLE_ROW_OPEN, 1); + if (!row) { + Py_DECREF(padding); + Py_DECREF(style); + return -1; + } + + if (Tokenizer_emit_table_tag(self, "|-", "tr", style, padding, NULL, row, "")) + return -1; + // Offset displacement done by _parse() + self->head--; + return 0; +} + +/* + Parse as normal syntax unless we hit a style marker, then parse style + as HTML attributes and the remainder as normal syntax. +*/ +static int +Tokenizer_handle_table_cell(Tokenizer* self, const char *markup, + const char *tag, uint64_t line_context) +{ + uint64_t old_context = self->topstack->context; + uint64_t cell_context; + Py_ssize_t reset; + PyObject *padding, *cell, *style = NULL; + const char *close_open_markup = NULL; + + self->head += strlen(markup); + reset = self->head; + + if (!Tokenizer_CAN_RECURSE(self)) { + if (Tokenizer_emit_text(self, markup)) + return -1; + self->head--; + return 0; + } + + cell = Tokenizer_parse(self, LC_TABLE_OPEN | LC_TABLE_CELL_OPEN | + LC_TABLE_CELL_STYLE | line_context, 1); + if (!cell) + return -1; + cell_context = self->topstack->context; + self->topstack->context = old_context; + + if (cell_context & LC_TABLE_CELL_STYLE) { + Py_DECREF(cell); + self->head = reset; + if(Tokenizer_push(self, LC_TABLE_OPEN | LC_TABLE_CELL_OPEN | + line_context)) + return -1; + padding = Tokenizer_handle_table_style(self, '|'); + if (!padding) + return -1; + style = Tokenizer_pop(self); + if (!style) { + Py_DECREF(padding); + return -1; + } + // Don't parse the style separator + self->head++; + cell = Tokenizer_parse(self, LC_TABLE_OPEN | LC_TABLE_CELL_OPEN | + line_context, 1); + if (!cell) { + Py_DECREF(padding); + Py_DECREF(style); + return -1; + } + cell_context = self->topstack->context; + self->topstack->context = old_context; + } + else { + padding = PyUnicode_FromString(""); + if (!padding) { + Py_DECREF(cell); + return -1; + } + } + + if (style) { + close_open_markup = "|"; + } + if (Tokenizer_emit_table_tag(self, markup, tag, style, padding, + close_open_markup, cell, "")) + return -1; + // Keep header/cell line contexts + self->topstack->context |= cell_context & (LC_TABLE_TH_LINE | LC_TABLE_TD_LINE); + // Offset displacement done by parse() + self->head--; + return 0; +} + +/* + Returns the context, stack, and whether to reset the cell for style + in a tuple. +*/ +static PyObject* +Tokenizer_handle_table_cell_end(Tokenizer* self, int reset_for_style) +{ + if (reset_for_style) + self->topstack->context |= LC_TABLE_CELL_STYLE; + else + self->topstack->context &= ~LC_TABLE_CELL_STYLE; + return Tokenizer_pop_keeping_context(self); +} + +/* + Return the stack in order to handle the table row end. +*/ +static PyObject* Tokenizer_handle_table_row_end(Tokenizer* self) +{ + return Tokenizer_pop(self); +} + +/* + Return the stack in order to handle the table end. +*/ +static PyObject* Tokenizer_handle_table_end(Tokenizer* self) +{ + self->head += 2; + return Tokenizer_pop(self); +} + +/* Handle the end of the stream of wikitext. */ -static PyObject* Tokenizer_handle_end(Tokenizer* self, int context) +static PyObject* Tokenizer_handle_end(Tokenizer* self, uint64_t context) { PyObject *token, *text, *trash; int single; @@ -2423,9 +2842,16 @@ static PyObject* Tokenizer_handle_end(Tokenizer* self, int context) if (single) return Tokenizer_handle_single_tag_end(self); } - else if (context & AGG_DOUBLE) { - trash = Tokenizer_pop(self); - Py_XDECREF(trash); + else { + if (context & LC_TABLE_CELL_OPEN) { + trash = Tokenizer_pop(self); + Py_XDECREF(trash); + context = self->topstack->context; + } + if (context & AGG_DOUBLE) { + trash = Tokenizer_pop(self); + Py_XDECREF(trash); + } } return Tokenizer_fail_route(self); } @@ -2436,7 +2862,8 @@ static PyObject* Tokenizer_handle_end(Tokenizer* self, int context) Make sure we are not trying to write an invalid character. Return 0 if everything is safe, or -1 if the route must be failed. */ -static int Tokenizer_verify_safe(Tokenizer* self, int context, Py_UNICODE data) +static int +Tokenizer_verify_safe(Tokenizer* self, uint64_t context, Py_UNICODE data) { if (context & LC_FAIL_NEXT) return -1; @@ -2463,8 +2890,13 @@ static int Tokenizer_verify_safe(Tokenizer* self, int context, Py_UNICODE data) return 0; if (context & LC_HAS_TEXT) { if (context & LC_FAIL_ON_TEXT) { - if (!Py_UNICODE_ISSPACE(data)) + if (!Py_UNICODE_ISSPACE(data)) { + if (data == '<' && Tokenizer_READ(self, 1) == '!') { + self->topstack->context |= LC_FAIL_NEXT; + return 0; + } return -1; + } } else { if (data == '\n') @@ -2481,8 +2913,8 @@ static int Tokenizer_verify_safe(Tokenizer* self, int context, Py_UNICODE data) } } else if (context & LC_FAIL_ON_LBRACE) { - if (data == '{' || (Tokenizer_READ(self, -1) == '{' && - Tokenizer_READ(self, -2) == '{')) { + if (data == '{' || (Tokenizer_READ_BACKWARDS(self, 1) == '{' && + Tokenizer_READ_BACKWARDS(self, 2) == '{')) { if (context & LC_TEMPLATE) self->topstack->context |= LC_FAIL_ON_EQUALS; else @@ -2493,10 +2925,7 @@ static int Tokenizer_verify_safe(Tokenizer* self, int context, Py_UNICODE data) } else if (context & LC_FAIL_ON_RBRACE) { if (data == '}') { - if (context & LC_TEMPLATE) - self->topstack->context |= LC_FAIL_ON_EQUALS; - else - self->topstack->context |= LC_FAIL_NEXT; + self->topstack->context |= LC_FAIL_NEXT; return 0; } self->topstack->context ^= LC_FAIL_ON_RBRACE; @@ -2510,12 +2939,30 @@ static int Tokenizer_verify_safe(Tokenizer* self, int context, Py_UNICODE data) } /* + Returns whether the current head has leading whitespace. + TODO: treat comments and templates as whitespace, allow fail on non-newline spaces. +*/ +static int Tokenizer_has_leading_whitespace(Tokenizer* self) +{ + int offset = 1; + Py_UNICODE current_character; + while (1) { + current_character = Tokenizer_READ_BACKWARDS(self, offset); + if (!current_character || current_character == '\n') + return 1; + else if (!Py_UNICODE_ISSPACE(current_character)) + return 0; + offset++; + } +} + +/* Parse the wikicode string, using context for when to stop. If push is true, we will push a new context, otherwise we won't and context will be ignored. */ -static PyObject* Tokenizer_parse(Tokenizer* self, int context, int push) +static PyObject* Tokenizer_parse(Tokenizer* self, uint64_t context, int push) { - int this_context; + uint64_t this_context; Py_UNICODE this, next, next_next, last; PyObject* temp; @@ -2644,22 +3091,99 @@ static PyObject* Tokenizer_parse(Tokenizer* self, int context, int push) if (temp != Py_None) return temp; } - else if (!last || last == '\n') { - if (this == '#' || this == '*' || this == ';' || this == ':') { - if (Tokenizer_handle_list(self)) + else if ((!last || last == '\n') && (this == '#' || this == '*' || this == ';' || this == ':')) { + if (Tokenizer_handle_list(self)) + return NULL; + } + else if ((!last || last == '\n') && (this == '-' && this == next && + this == Tokenizer_READ(self, 2) && + this == Tokenizer_READ(self, 3))) { + if (Tokenizer_handle_hr(self)) + return NULL; + } + else if ((this == '\n' || this == ':') && this_context & LC_DLTERM) { + if (Tokenizer_handle_dl_term(self)) + return NULL; + // Kill potential table contexts + if (this == '\n') + self->topstack->context &= ~LC_TABLE_CELL_LINE_CONTEXTS; + } + + // Start of table parsing + else if (this == '{' && next == '|' && Tokenizer_has_leading_whitespace(self)) { + if (Tokenizer_CAN_RECURSE(self)) { + if (Tokenizer_parse_table(self)) return NULL; } - else if (this == '-' && this == next && - this == Tokenizer_READ(self, 2) && - this == Tokenizer_READ(self, 3)) { - if (Tokenizer_handle_hr(self)) + else if (Tokenizer_emit_char(self, this) || Tokenizer_emit_char(self, next)) + return NULL; + else + self->head++; + } + else if (this_context & LC_TABLE_OPEN) { + if (this == '|' && next == '|' && this_context & LC_TABLE_TD_LINE) { + if (this_context & LC_TABLE_CELL_OPEN) + return Tokenizer_handle_table_cell_end(self, 0); + else if (Tokenizer_handle_table_cell(self, "||", "td", LC_TABLE_TD_LINE)) + return NULL; + } + else if (this == '|' && next == '|' && this_context & LC_TABLE_TH_LINE) { + if (this_context & LC_TABLE_CELL_OPEN) + return Tokenizer_handle_table_cell_end(self, 0); + else if (Tokenizer_handle_table_cell(self, "||", "th", LC_TABLE_TH_LINE)) + return NULL; + } + else if (this == '!' && next == '!' && this_context & LC_TABLE_TH_LINE) { + if (this_context & LC_TABLE_CELL_OPEN) + return Tokenizer_handle_table_cell_end(self, 0); + else if (Tokenizer_handle_table_cell(self, "!!", "th", LC_TABLE_TH_LINE)) + return NULL; + } + else if (this == '|' && this_context & LC_TABLE_CELL_STYLE) { + return Tokenizer_handle_table_cell_end(self, 1); + } + // On newline, clear out cell line contexts + else if (this == '\n' && this_context & LC_TABLE_CELL_LINE_CONTEXTS) { + self->topstack->context &= ~LC_TABLE_CELL_LINE_CONTEXTS; + if (Tokenizer_emit_char(self, this)) + return NULL; + } + else if (Tokenizer_has_leading_whitespace(self)) { + if (this == '|' && next == '}') { + if (this_context & LC_TABLE_CELL_OPEN) + return Tokenizer_handle_table_cell_end(self, 0); + if (this_context & LC_TABLE_ROW_OPEN) + return Tokenizer_handle_table_row_end(self); + else + return Tokenizer_handle_table_end(self); + } + else if (this == '|' && next == '-') { + if (this_context & LC_TABLE_CELL_OPEN) + return Tokenizer_handle_table_cell_end(self, 0); + if (this_context & LC_TABLE_ROW_OPEN) + return Tokenizer_handle_table_row_end(self); + else if (Tokenizer_handle_table_row(self)) + return NULL; + } + else if (this == '|') { + if (this_context & LC_TABLE_CELL_OPEN) + return Tokenizer_handle_table_cell_end(self, 0); + else if (Tokenizer_handle_table_cell(self, "|", "td", LC_TABLE_TD_LINE)) + return NULL; + } + else if (this == '!') { + if (this_context & LC_TABLE_CELL_OPEN) + return Tokenizer_handle_table_cell_end(self, 0); + else if (Tokenizer_handle_table_cell(self, "!", "th", LC_TABLE_TH_LINE)) + return NULL; + } + else if (Tokenizer_emit_char(self, this)) return NULL; } else if (Tokenizer_emit_char(self, this)) return NULL; - } - else if ((this == '\n' || this == ':') && this_context & LC_DLTERM) { - if (Tokenizer_handle_dl_term(self)) + // Raise BadRoute to table start + if (BAD_ROUTE) return NULL; } else if (Tokenizer_emit_char(self, this)) @@ -2673,8 +3197,9 @@ static PyObject* Tokenizer_parse(Tokenizer* self, int context, int push) */ static PyObject* Tokenizer_tokenize(Tokenizer* self, PyObject* args) { - PyObject *text, *temp; - int context = 0, skip_style_tags = 0; + PyObject *text, *temp, *tokens; + uint64_t context = 0; + int skip_style_tags = 0; if (PyArg_ParseTuple(args, "U|ii", &text, &context, &skip_style_tags)) { Py_XDECREF(self->text); @@ -2696,13 +3221,32 @@ static PyObject* Tokenizer_tokenize(Tokenizer* self, PyObject* args) Py_XDECREF(temp); self->text = text; } + self->head = self->global = self->depth = self->cycles = 0; self->length = PyList_GET_SIZE(self->text); self->skip_style_tags = skip_style_tags; - return Tokenizer_parse(self, context, 1); + tokens = Tokenizer_parse(self, context, 1); + + if ((!tokens && !PyErr_Occurred()) || self->topstack) { + if (!ParserError) { + if (load_exceptions()) + return NULL; + } + if (BAD_ROUTE) { + RESET_ROUTE(); + PyErr_SetString(ParserError, "C tokenizer exited with BAD_ROUTE"); + } + else if (self->topstack) + PyErr_SetString(ParserError, + "C tokenizer exited with non-empty token stack"); + else + PyErr_SetString(ParserError, "C tokenizer exited unexpectedly"); + return NULL; + } + return tokens; } -static int load_entitydefs(void) +static int load_entities(void) { PyObject *tempmod, *defmap, *deflist; unsigned numdefs, i; @@ -2806,7 +3350,7 @@ static int load_tokens(void) return 0; } -static int load_definitions(void) +static int load_defs(void) { PyObject *tempmod, *globals = PyEval_GetGlobals(), @@ -2827,6 +3371,29 @@ static int load_definitions(void) return 0; } +static int load_exceptions(void) +{ + PyObject *tempmod, *parsermod, + *globals = PyEval_GetGlobals(), + *locals = PyEval_GetLocals(), + *fromlist = PyList_New(1), + *modname = IMPORT_NAME_FUNC("parser"); + char *name = "mwparserfromhell"; + + if (!fromlist || !modname) + return -1; + PyList_SET_ITEM(fromlist, 0, modname); + tempmod = PyImport_ImportModuleLevel(name, globals, locals, fromlist, 0); + Py_DECREF(fromlist); + if (!tempmod) + return -1; + parsermod = PyObject_GetAttrString(tempmod, "parser"); + Py_DECREF(tempmod); + ParserError = PyObject_GetAttrString(parsermod, "ParserError"); + Py_DECREF(parsermod); + return 0; +} + PyMODINIT_FUNC INIT_FUNC_NAME(void) { PyObject *module; @@ -2843,9 +3410,7 @@ PyMODINIT_FUNC INIT_FUNC_NAME(void) PyDict_SetItemString(TokenizerType.tp_dict, "USES_C", Py_True); EMPTY = PyUnicode_FromString(""); NOARGS = PyTuple_New(0); - if (!EMPTY || !NOARGS) - INIT_ERROR; - if (load_entitydefs() || load_tokens() || load_definitions()) + if (!EMPTY || !NOARGS || load_entities() || load_tokens() || load_defs()) INIT_ERROR; #ifdef IS_PY3K return module; diff --git a/mwparserfromhell/parser/tokenizer.h b/mwparserfromhell/parser/tokenizer.h index 032480d..842e65d 100644 --- a/mwparserfromhell/parser/tokenizer.h +++ b/mwparserfromhell/parser/tokenizer.h @@ -1,6 +1,6 @@ /* Tokenizer Header File for MWParserFromHell -Copyright (C) 2012-2014 Ben Kurtovic +Copyright (C) 2012-2015 Ben Kurtovic Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in @@ -29,6 +29,7 @@ SOFTWARE. #include #include #include +#include #if PY_MAJOR_VERSION >= 3 #define IS_PY3K @@ -43,16 +44,17 @@ SOFTWARE. static const char MARKERS[] = { '{', '}', '[', ']', '<', '>', '|', '=', '&', '\'', '#', '*', ';', ':', '/', - '-', '\n', '\0'}; + '-', '!', '\n', '\0'}; -#define NUM_MARKERS 18 +#define NUM_MARKERS 19 #define TEXTBUFFER_BLOCKSIZE 1024 #define MAX_DEPTH 40 #define MAX_CYCLES 100000 #define MAX_BRACES 255 #define MAX_ENTITY_SIZE 8 -static int route_state = 0, route_context = 0; +static int route_state = 0; +static uint64_t route_context = 0; #define BAD_ROUTE route_state #define BAD_ROUTE_CONTEXT route_context #define FAIL_ROUTE(context) route_state = 1; route_context = context @@ -62,6 +64,7 @@ static char** entitydefs; static PyObject* EMPTY; static PyObject* NOARGS; +static PyObject* ParserError; static PyObject* definitions; @@ -108,52 +111,61 @@ static PyObject* TagCloseClose; /* Local contexts: */ -#define LC_TEMPLATE 0x00000007 -#define LC_TEMPLATE_NAME 0x00000001 -#define LC_TEMPLATE_PARAM_KEY 0x00000002 -#define LC_TEMPLATE_PARAM_VALUE 0x00000004 - -#define LC_ARGUMENT 0x00000018 -#define LC_ARGUMENT_NAME 0x00000008 -#define LC_ARGUMENT_DEFAULT 0x00000010 - -#define LC_WIKILINK 0x00000060 -#define LC_WIKILINK_TITLE 0x00000020 -#define LC_WIKILINK_TEXT 0x00000040 - -#define LC_EXT_LINK 0x00000180 -#define LC_EXT_LINK_URI 0x00000080 -#define LC_EXT_LINK_TITLE 0x00000100 - -#define LC_HEADING 0x00007E00 -#define LC_HEADING_LEVEL_1 0x00000200 -#define LC_HEADING_LEVEL_2 0x00000400 -#define LC_HEADING_LEVEL_3 0x00000800 -#define LC_HEADING_LEVEL_4 0x00001000 -#define LC_HEADING_LEVEL_5 0x00002000 -#define LC_HEADING_LEVEL_6 0x00004000 - -#define LC_TAG 0x00078000 -#define LC_TAG_OPEN 0x00008000 -#define LC_TAG_ATTR 0x00010000 -#define LC_TAG_BODY 0x00020000 -#define LC_TAG_CLOSE 0x00040000 - -#define LC_STYLE 0x00780000 -#define LC_STYLE_ITALICS 0x00080000 -#define LC_STYLE_BOLD 0x00100000 -#define LC_STYLE_PASS_AGAIN 0x00200000 -#define LC_STYLE_SECOND_PASS 0x00400000 - -#define LC_DLTERM 0x00800000 - -#define LC_SAFETY_CHECK 0x3F000000 -#define LC_HAS_TEXT 0x01000000 -#define LC_FAIL_ON_TEXT 0x02000000 -#define LC_FAIL_NEXT 0x04000000 -#define LC_FAIL_ON_LBRACE 0x08000000 -#define LC_FAIL_ON_RBRACE 0x10000000 -#define LC_FAIL_ON_EQUALS 0x20000000 +#define LC_TEMPLATE 0x0000000000000007 +#define LC_TEMPLATE_NAME 0x0000000000000001 +#define LC_TEMPLATE_PARAM_KEY 0x0000000000000002 +#define LC_TEMPLATE_PARAM_VALUE 0x0000000000000004 + +#define LC_ARGUMENT 0x0000000000000018 +#define LC_ARGUMENT_NAME 0x0000000000000008 +#define LC_ARGUMENT_DEFAULT 0x0000000000000010 + +#define LC_WIKILINK 0x0000000000000060 +#define LC_WIKILINK_TITLE 0x0000000000000020 +#define LC_WIKILINK_TEXT 0x0000000000000040 + +#define LC_EXT_LINK 0x0000000000000180 +#define LC_EXT_LINK_URI 0x0000000000000080 +#define LC_EXT_LINK_TITLE 0x0000000000000100 + +#define LC_HEADING 0x0000000000007E00 +#define LC_HEADING_LEVEL_1 0x0000000000000200 +#define LC_HEADING_LEVEL_2 0x0000000000000400 +#define LC_HEADING_LEVEL_3 0x0000000000000800 +#define LC_HEADING_LEVEL_4 0x0000000000001000 +#define LC_HEADING_LEVEL_5 0x0000000000002000 +#define LC_HEADING_LEVEL_6 0x0000000000004000 + +#define LC_TAG 0x0000000000078000 +#define LC_TAG_OPEN 0x0000000000008000 +#define LC_TAG_ATTR 0x0000000000010000 +#define LC_TAG_BODY 0x0000000000020000 +#define LC_TAG_CLOSE 0x0000000000040000 + +#define LC_STYLE 0x0000000000780000 +#define LC_STYLE_ITALICS 0x0000000000080000 +#define LC_STYLE_BOLD 0x0000000000100000 +#define LC_STYLE_PASS_AGAIN 0x0000000000200000 +#define LC_STYLE_SECOND_PASS 0x0000000000400000 + +#define LC_DLTERM 0x0000000000800000 + +#define LC_SAFETY_CHECK 0x000000003F000000 +#define LC_HAS_TEXT 0x0000000001000000 +#define LC_FAIL_ON_TEXT 0x0000000002000000 +#define LC_FAIL_NEXT 0x0000000004000000 +#define LC_FAIL_ON_LBRACE 0x0000000008000000 +#define LC_FAIL_ON_RBRACE 0x0000000010000000 +#define LC_FAIL_ON_EQUALS 0x0000000020000000 + +#define LC_TABLE 0x0000000FC0000000 +#define LC_TABLE_CELL_LINE_CONTEXTS 0x0000000D00000000 +#define LC_TABLE_OPEN 0x0000000040000000 +#define LC_TABLE_CELL_OPEN 0x0000000080000000 +#define LC_TABLE_CELL_STYLE 0x0000000100000000 +#define LC_TABLE_ROW_OPEN 0x0000000200000000 +#define LC_TABLE_TD_LINE 0x0000000400000000 +#define LC_TABLE_TH_LINE 0x0000000800000000 /* Global contexts: */ @@ -161,9 +173,9 @@ static PyObject* TagCloseClose; /* Aggregate contexts: */ -#define AGG_FAIL (LC_TEMPLATE | LC_ARGUMENT | LC_WIKILINK | LC_EXT_LINK_TITLE | LC_HEADING | LC_TAG | LC_STYLE) +#define AGG_FAIL (LC_TEMPLATE | LC_ARGUMENT | LC_WIKILINK | LC_EXT_LINK_TITLE | LC_HEADING | LC_TAG | LC_STYLE | LC_TABLE_OPEN) #define AGG_UNSAFE (LC_TEMPLATE_NAME | LC_WIKILINK_TITLE | LC_EXT_LINK_TITLE | LC_TEMPLATE_PARAM_KEY | LC_ARGUMENT_NAME) -#define AGG_DOUBLE (LC_TEMPLATE_PARAM_KEY | LC_TAG_CLOSE) +#define AGG_DOUBLE (LC_TEMPLATE_PARAM_KEY | LC_TAG_CLOSE | LC_TABLE_ROW_OPEN) #define AGG_NO_WIKILINKS (LC_TEMPLATE_NAME | LC_ARGUMENT_NAME | LC_WIKILINK_TITLE | LC_EXT_LINK_URI) #define AGG_NO_EXT_LINKS (LC_TEMPLATE_NAME | LC_ARGUMENT_NAME | LC_WIKILINK_TITLE | LC_EXT_LINK) @@ -190,7 +202,7 @@ struct Textbuffer { struct Stack { PyObject* stack; - int context; + uint64_t context; struct Textbuffer* textbuffer; struct Stack* next; }; @@ -201,10 +213,11 @@ typedef struct { } HeadingData; typedef struct { - int context; + uint64_t context; struct Textbuffer* pad_first; struct Textbuffer* pad_before_eq; struct Textbuffer* pad_after_eq; + Py_UNICODE quoter; Py_ssize_t reset; } TagData; @@ -265,9 +278,11 @@ static int Tokenizer_parse_entity(Tokenizer*); static int Tokenizer_parse_comment(Tokenizer*); static int Tokenizer_handle_dl_term(Tokenizer*); static int Tokenizer_parse_tag(Tokenizer*); -static PyObject* Tokenizer_parse(Tokenizer*, int, int); +static PyObject* Tokenizer_parse(Tokenizer*, uint64_t, int); static PyObject* Tokenizer_tokenize(Tokenizer*, PyObject*); +static int load_exceptions(void); + /* Macros for Python 2/3 compatibility: */ diff --git a/mwparserfromhell/parser/tokenizer.py b/mwparserfromhell/parser/tokenizer.py index 29a7e25..36c83e1 100644 --- a/mwparserfromhell/parser/tokenizer.py +++ b/mwparserfromhell/parser/tokenizer.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal @@ -24,7 +24,7 @@ from __future__ import unicode_literals from math import log import re -from . import contexts, tokens +from . import contexts, tokens, ParserError from ..compat import htmlentities, range from ..definitions import (get_html_tag, is_parsable, is_single, is_single_only, is_scheme) @@ -53,6 +53,7 @@ class _TagOpenData(object): def __init__(self): self.context = self.CX_NAME self.padding_buffer = {"first": "", "before_eq": "", "after_eq": ""} + self.quoter = None self.reset = 0 @@ -62,11 +63,11 @@ class Tokenizer(object): START = object() END = object() MARKERS = ["{", "}", "[", "]", "<", ">", "|", "=", "&", "'", "#", "*", ";", - ":", "/", "-", "\n", START, END] + ":", "/", "-", "!", "\n", START, END] MAX_DEPTH = 40 MAX_CYCLES = 100000 regex = re.compile(r"([{}\[\]<>|=&'#*;:/\\\"\-!\n])", flags=re.IGNORECASE) - tag_splitter = re.compile(r"([\s\"\\]+)") + tag_splitter = re.compile(r"([\s\"\'\\]+)") def __init__(self): self._text = None @@ -112,7 +113,7 @@ class Tokenizer(object): self._textbuffer = [] def _pop(self, keep_context=False): - """Pop the current stack/context/textbuffer, returing the stack. + """Pop the current stack/context/textbuffer, returning the stack. If *keep_context* is ``True``, then we will replace the underlying stack's context with the current stack's. @@ -134,7 +135,7 @@ class Tokenizer(object): """Fail the current tokenization route. Discards the current stack/context/textbuffer and raises - :py:exc:`~.BadRoute`. + :exc:`.BadRoute`. """ context = self._context self._pop() @@ -172,14 +173,14 @@ class Tokenizer(object): def _read(self, delta=0, wrap=False, strict=False): """Read the value at a relative point in the wikicode. - The value is read from :py:attr:`self._head <_head>` plus the value of + The value is read from :attr:`self._head <_head>` plus the value of *delta* (which can be negative). If *wrap* is ``False``, we will not allow attempts to read from the end of the string if ``self._head + delta`` is negative. If *strict* is ``True``, the route will be failed - (with :py:meth:`_fail_route`) if we try to read from past the end of - the string; otherwise, :py:attr:`self.END ` is returned. If we try - to read from before the start of the string, :py:attr:`self.START - ` is returned. + (with :meth:`_fail_route`) if we try to read from past the end of the + string; otherwise, :attr:`self.END ` is returned. If we try to + read from before the start of the string, :attr:`self.START ` is + returned. """ index = self._head + delta if index < 0 and (not wrap or abs(index) > len(self._text)): @@ -255,7 +256,7 @@ class Tokenizer(object): self._context ^= contexts.TEMPLATE_NAME elif self._context & contexts.TEMPLATE_PARAM_VALUE: self._context ^= contexts.TEMPLATE_PARAM_VALUE - elif self._context & contexts.TEMPLATE_PARAM_KEY: + else: self._emit_all(self._pop(keep_context=True)) self._context |= contexts.TEMPLATE_PARAM_KEY self._emit(tokens.TemplateParamSeparator()) @@ -296,8 +297,6 @@ class Tokenizer(object): self._head = reset self._emit_text("[[") else: - if self._context & contexts.FAIL_NEXT: - self._context ^= contexts.FAIL_NEXT self._emit(tokens.WikilinkOpen()) self._emit_all(wikilink) self._emit(tokens.WikilinkClose()) @@ -370,9 +369,11 @@ class Tokenizer(object): if "(" in this and ")" in punct: punct = punct[:-1] # ')' is not longer valid punctuation if this.endswith(punct): - for i in reversed(range(-len(this), 0)): - if i == -len(this) or this[i - 1] not in punct: + for i in range(len(this) - 1, 0, -1): + if this[i - 1] not in punct: break + else: + i = 0 stripped = this[:i] if stripped and tail: self._emit_text(tail) @@ -607,6 +608,11 @@ class Tokenizer(object): self._emit(tokens.CommentEnd()) self._emit_all(self._pop()) self._head += 2 + if self._context & contexts.FAIL_NEXT: + # _verify_safe() sets this flag while parsing a template + # name when it encounters what might be a comment -- we + # must unset it to let _verify_safe() know it was correct: + self._context ^= contexts.FAIL_NEXT return self._emit_text(this) self._head += 1 @@ -614,7 +620,7 @@ class Tokenizer(object): def _push_tag_buffer(self, data): """Write a pending tag attribute from *data* to the stack.""" if data.context & data.CX_QUOTED: - self._emit_first(tokens.TagAttrQuote()) + self._emit_first(tokens.TagAttrQuote(char=data.quoter)) self._emit_all(self._pop()) buf = data.padding_buffer self._emit_first(tokens.TagAttrStart(pad_first=buf["first"], @@ -687,17 +693,18 @@ class Tokenizer(object): self._push_tag_buffer(data) data.context = data.CX_ATTR_NAME self._push(contexts.TAG_ATTR) - elif data.context & data.CX_ATTR_VALUE: + else: # data.context & data.CX_ATTR_VALUE assured escaped = self._read(-1) == "\\" and self._read(-2) != "\\" if data.context & data.CX_NOTE_QUOTE: data.context ^= data.CX_NOTE_QUOTE - if chunk == '"' and not escaped: + if chunk in "'\"" and not escaped: data.context |= data.CX_QUOTED - self._push(self._context) + data.quoter = chunk data.reset = self._head + self._push(self._context) continue elif data.context & data.CX_QUOTED: - if chunk == '"' and not escaped: + if chunk == data.quoter and not escaped: data.context |= data.CX_NOTE_SPACE continue self._handle_tag_text(chunk) @@ -728,14 +735,22 @@ class Tokenizer(object): def _handle_blacklisted_tag(self): """Handle the body of an HTML tag that is parser-blacklisted.""" + strip = lambda text: text.rstrip().lower() while True: this, next = self._read(), self._read(1) if this is self.END: self._fail_route() elif this == "<" and next == "/": - self._handle_tag_open_close() - self._head += 1 - return self._parse(push=False) + self._head += 3 + if self._read() != ">" or (strip(self._read(-1)) != + strip(self._stack[1].text)): + self._head -= 1 + self._emit_text(" +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal @@ -24,8 +24,8 @@ This module contains the token definitions that are used as an intermediate parsing data type - they are stored in a flat list, with each token being identified by its type and optional attributes. The token list is generated in -a syntactically valid form by the :py:class:`~.Tokenizer`, and then converted -into the :py:class`~.Wikicode` tree by the :py:class:`~.Builder`. +a syntactically valid form by the :class:`.Tokenizer`, and then converted into +the :class`.Wikicode` tree by the :class:`.Builder`. """ from __future__ import unicode_literals @@ -34,7 +34,7 @@ from ..compat import py3k, str __all__ = ["Token"] -class Token (dict): +class Token(dict): """A token stores the semantic meaning of a unit of wikicode.""" def __repr__(self): @@ -100,7 +100,7 @@ CommentEnd = make("CommentEnd") # --> TagOpenOpen = make("TagOpenOpen") # < TagAttrStart = make("TagAttrStart") TagAttrEquals = make("TagAttrEquals") # = -TagAttrQuote = make("TagAttrQuote") # " +TagAttrQuote = make("TagAttrQuote") # ", ' TagCloseOpen = make("TagCloseOpen") # > TagCloseSelfclose = make("TagCloseSelfclose") # /> TagOpenClose = make("TagOpenClose") # +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal @@ -21,8 +21,8 @@ # SOFTWARE. """ -This module contains the :py:class:`~.SmartList` type, as well as its -:py:class:`~._ListProxy` child, which together implement a list whose sublists +This module contains the :class:`.SmartList` type, as well as its +:class:`._ListProxy` child, which together implement a list whose sublists reflect changes made to the main list, and vice-versa. """ @@ -35,12 +35,13 @@ __all__ = ["SmartList"] def inheritdoc(method): """Set __doc__ of *method* to __doc__ of *method* in its parent class. - Since this is used on :py:class:`~.SmartList`, the "parent class" used is + Since this is used on :class:`.SmartList`, the "parent class" used is ``list``. This function can be used as a decorator. """ method.__doc__ = getattr(list, method.__name__).__doc__ return method + class _SliceNormalizerMixIn(object): """MixIn that provides a private method to normalize slices.""" @@ -64,9 +65,9 @@ class SmartList(_SliceNormalizerMixIn, list): list (such as the addition, removal, or replacement of elements) will be reflected in the sublist, or vice-versa, to the greatest degree possible. This is implemented by having sublists - instances of the - :py:class:`~._ListProxy` type - dynamically determine their elements by - storing their slice info and retrieving that slice from the parent. Methods - that change the size of the list also change the slice info. For example:: + :class:`._ListProxy` type - dynamically determine their elements by storing + their slice info and retrieving that slice from the parent. Methods that + change the size of the list also change the slice info. For example:: >>> parent = SmartList([0, 1, 2, 3]) >>> parent @@ -83,7 +84,9 @@ class SmartList(_SliceNormalizerMixIn, list): The parent needs to keep a list of its children in order to update them, which prevents them from being garbage-collected. If you are keeping the parent around for a while but creating many children, it is advisable to - call :py:meth:`~._ListProxy.destroy` when you're finished with them. + call :meth:`._ListProxy.detach` when you're finished with them. Certain + parent methods, like :meth:`reverse` and :meth:`sort`, will do this + automatically. """ def __init__(self, iterable=None): @@ -151,10 +154,10 @@ class SmartList(_SliceNormalizerMixIn, list): self.extend(other) return self - def _release_children(self): - copy = list(self) - for child in self._children: - child._parent = copy + def _detach_children(self): + children = [val[0] for val in self._children.values()] + for child in children: + child.detach() @inheritdoc def append(self, item): @@ -184,13 +187,13 @@ class SmartList(_SliceNormalizerMixIn, list): @inheritdoc def reverse(self): - self._release_children() + self._detach_children() super(SmartList, self).reverse() if py3k: @inheritdoc def sort(self, key=None, reverse=None): - self._release_children() + self._detach_children() kwargs = {} if key is not None: kwargs["key"] = key @@ -200,7 +203,7 @@ class SmartList(_SliceNormalizerMixIn, list): else: @inheritdoc def sort(self, cmp=None, key=None, reverse=None): - self._release_children() + self._detach_children() kwargs = {} if cmp is not None: kwargs["cmp"] = cmp @@ -214,15 +217,16 @@ class SmartList(_SliceNormalizerMixIn, list): class _ListProxy(_SliceNormalizerMixIn, list): """Implement the ``list`` interface by getting elements from a parent. - This is created by a :py:class:`~.SmartList` object when slicing. It does - not actually store the list at any time; instead, whenever the list is - needed, it builds it dynamically using the :py:meth:`_render` method. + This is created by a :class:`.SmartList` object when slicing. It does not + actually store the list at any time; instead, whenever the list is needed, + it builds it dynamically using the :meth:`_render` method. """ def __init__(self, parent, sliceinfo): super(_ListProxy, self).__init__() self._parent = parent self._sliceinfo = sliceinfo + self._detached = False def __repr__(self): return repr(self._render()) @@ -452,9 +456,17 @@ class _ListProxy(_SliceNormalizerMixIn, list): item.sort(**kwargs) self._parent[self._start:self._stop:self._step] = item - def destroy(self): - """Make the parent forget this child. The child will no longer work.""" - self._parent._children.pop(id(self)) + def detach(self): + """Detach the child so it operates like a normal list. + + This allows children to be properly garbage-collected if their parent + is being kept around for a long time. This method has no effect if the + child is already detached. + """ + if not self._detached: + self._parent._children.pop(id(self)) + self._parent = list(self._parent) + self._detached = True del inheritdoc diff --git a/mwparserfromhell/string_mixin.py b/mwparserfromhell/string_mixin.py index fe41d6d..01809a7 100644 --- a/mwparserfromhell/string_mixin.py +++ b/mwparserfromhell/string_mixin.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal @@ -21,7 +21,7 @@ # SOFTWARE. """ -This module contains the :py:class:`~.StringMixIn` type, which implements the +This module contains the :class:`.StringMixIn` type, which implements the interface for the ``unicode`` type (``str`` on py3k) in a dynamic manner. """ @@ -35,7 +35,7 @@ __all__ = ["StringMixIn"] def inheritdoc(method): """Set __doc__ of *method* to __doc__ of *method* in its parent class. - Since this is used on :py:class:`~.StringMixIn`, the "parent class" used is + Since this is used on :class:`.StringMixIn`, the "parent class" used is ``str``. This function can be used as a decorator. """ method.__doc__ = getattr(str, method.__name__).__doc__ @@ -44,11 +44,10 @@ def inheritdoc(method): class StringMixIn(object): """Implement the interface for ``unicode``/``str`` in a dynamic manner. - To use this class, inherit from it and override the :py:meth:`__unicode__` + To use this class, inherit from it and override the :meth:`__unicode__` method (same on py3k) to return the string representation of the object. - The various string methods will operate on the value of - :py:meth:`__unicode__` instead of the immutable ``self`` like the regular - ``str`` type. + The various string methods will operate on the value of :meth:`__unicode__` + instead of the immutable ``self`` like the regular ``str`` type. """ if py3k: diff --git a/mwparserfromhell/utils.py b/mwparserfromhell/utils.py index 486170d..28823fc 100644 --- a/mwparserfromhell/utils.py +++ b/mwparserfromhell/utils.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal @@ -33,23 +33,19 @@ from .smart_list import SmartList __all__ = ["parse_anything"] -def parse_anything(value, context=0): - """Return a :py:class:`~.Wikicode` for *value*, allowing multiple types. +def parse_anything(value, context=0, skip_style_tags=False): + """Return a :class:`.Wikicode` for *value*, allowing multiple types. - This differs from :py:meth:`.Parser.parse` in that we accept more than just - a string to be parsed. Unicode objects (strings in py3k), strings (bytes in - py3k), integers (converted to strings), ``None``, existing - :py:class:`~.Node` or :py:class:`~.Wikicode` objects, as well as an - iterable of these types, are supported. This is used to parse input - on-the-fly by various methods of :py:class:`~.Wikicode` and others like - :py:class:`~.Template`, such as :py:meth:`wikicode.insert() - <.Wikicode.insert>` or setting :py:meth:`template.name <.Template.name>`. + This differs from :meth:`.Parser.parse` in that we accept more than just a + string to be parsed. Unicode objects (strings in py3k), strings (bytes in + py3k), integers (converted to strings), ``None``, existing :class:`.Node` + or :class:`.Wikicode` objects, as well as an iterable of these types, are + supported. This is used to parse input on-the-fly by various methods of + :class:`.Wikicode` and others like :class:`.Template`, such as + :meth:`wikicode.insert() <.Wikicode.insert>` or setting + :meth:`template.name <.Template.name>`. - If given, *context* will be passed as a starting context to the parser. - This is helpful when this function is used inside node attribute setters. - For example, :py:class:`~.ExternalLink`\ 's :py:attr:`~.ExternalLink.url` - setter sets *context* to :py:mod:`contexts.EXT_LINK_URI <.contexts>` to - prevent the URL itself from becoming an :py:class:`~.ExternalLink`. + Additional arguments are passed directly to :meth:`.Parser.parse`. """ from .parser import Parser from .wikicode import Wikicode @@ -59,18 +55,18 @@ def parse_anything(value, context=0): elif isinstance(value, Node): return Wikicode(SmartList([value])) elif isinstance(value, str): - return Parser().parse(value, context) + return Parser().parse(value, context, skip_style_tags) elif isinstance(value, bytes): - return Parser().parse(value.decode("utf8"), context) + return Parser().parse(value.decode("utf8"), context, skip_style_tags) elif isinstance(value, int): - return Parser().parse(str(value), context) + return Parser().parse(str(value), context, skip_style_tags) elif value is None: return Wikicode(SmartList()) try: nodelist = SmartList() for item in value: - nodelist += parse_anything(item, context).nodes + nodelist += parse_anything(item, context, skip_style_tags).nodes + return Wikicode(nodelist) except TypeError: error = "Needs string, Node, Wikicode, int, None, or iterable of these, but got {0}: {1}" raise ValueError(error.format(type(value).__name__, value)) - return Wikicode(nodelist) diff --git a/mwparserfromhell/wikicode.py b/mwparserfromhell/wikicode.py index 44515a6..c623971 100644 --- a/mwparserfromhell/wikicode.py +++ b/mwparserfromhell/wikicode.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal @@ -39,11 +39,12 @@ class Wikicode(StringMixIn): Additionally, it contains methods that can be used to extract data from or modify the nodes, implemented in an interface similar to a list. For - example, :py:meth:`index` can get the index of a node in the list, and - :py:meth:`insert` can add a new node at that index. The :py:meth:`filter() + example, :meth:`index` can get the index of a node in the list, and + :meth:`insert` can add a new node at that index. The :meth:`filter() ` series of functions is very useful for extracting and iterating over, for example, all of the templates in the object. """ + RECURSE_OTHERS = 2 def __init__(self, nodes): super(Wikicode, self).__init__() @@ -53,12 +54,15 @@ class Wikicode(StringMixIn): return "".join([str(node) for node in self.nodes]) @staticmethod - def _get_children(node, contexts=False, parent=None): - """Iterate over all child :py:class:`.Node`\ s of a given *node*.""" + def _get_children(node, contexts=False, restrict=None, parent=None): + """Iterate over all child :class:`.Node`\ s of a given *node*.""" yield (parent, node) if contexts else node + if restrict and isinstance(node, restrict): + return for code in node.__children__(): for child in code.nodes: - for result in Wikicode._get_children(child, contexts, code): + sub = Wikicode._get_children(child, contexts, restrict, code) + for result in sub: yield result @staticmethod @@ -70,7 +74,7 @@ class Wikicode(StringMixIn): @staticmethod def _build_matcher(matches, flags): - """Helper for :py:meth:`_indexed_ifilter` and others. + """Helper for :meth:`_indexed_ifilter` and others. If *matches* is a function, return it. If it's a regex, return a wrapper around it that can be called with a node to do a search. If @@ -79,22 +83,23 @@ class Wikicode(StringMixIn): if matches: if callable(matches): return matches - return lambda obj: re.search(matches, str(obj), flags) # r + return lambda obj: re.search(matches, str(obj), flags) return lambda obj: True def _indexed_ifilter(self, recursive=True, matches=None, flags=FLAGS, forcetype=None): """Iterate over nodes and their corresponding indices in the node list. - The arguments are interpreted as for :py:meth:`ifilter`. For each tuple + The arguments are interpreted as for :meth:`ifilter`. For each tuple ``(i, node)`` yielded by this method, ``self.index(node) == i``. Note that if *recursive* is ``True``, ``self.nodes[i]`` might not be the node itself, but will still contain it. """ match = self._build_matcher(matches, flags) if recursive: + restrict = forcetype if recursive == self.RECURSE_OTHERS else None def getter(i, node): - for ch in self._get_children(node): + for ch in self._get_children(node, restrict=restrict): yield (i, ch) inodes = chain(*(getter(i, n) for i, n in enumerate(self.nodes))) else: @@ -106,17 +111,17 @@ class Wikicode(StringMixIn): def _do_strong_search(self, obj, recursive=True): """Search for the specific element *obj* within the node list. - *obj* can be either a :py:class:`.Node` or a :py:class:`.Wikicode` - object. If found, we return a tuple (*context*, *index*) where - *context* is the :py:class:`.Wikicode` that contains *obj* and *index* - is its index there, as a :py:class:`slice`. Note that if *recursive* is - ``False``, *context* will always be ``self`` (since we only look for - *obj* among immediate descendants), but if *recursive* is ``True``, - then it could be any :py:class:`.Wikicode` contained by a node within - ``self``. If *obj* is not found, :py:exc:`ValueError` is raised. + *obj* can be either a :class:`.Node` or a :class:`.Wikicode` object. If + found, we return a tuple (*context*, *index*) where *context* is the + :class:`.Wikicode` that contains *obj* and *index* is its index there, + as a :class:`slice`. Note that if *recursive* is ``False``, *context* + will always be ``self`` (since we only look for *obj* among immediate + descendants), but if *recursive* is ``True``, then it could be any + :class:`.Wikicode` contained by a node within ``self``. If *obj* is not + found, :exc:`ValueError` is raised. """ - mkslice = lambda i: slice(i, i + 1) if isinstance(obj, Node): + mkslice = lambda i: slice(i, i + 1) if not recursive: return self, mkslice(self.index(obj)) for i, node in enumerate(self.nodes): @@ -125,26 +130,25 @@ class Wikicode(StringMixIn): if not context: context = self return context, mkslice(context.index(child)) - else: - context, ind = self._do_strong_search(obj.get(0), recursive) - for i in range(1, len(obj.nodes)): - if obj.get(i) is not context.get(ind.start + i): - break - else: - return context, slice(ind.start, ind.start + len(obj.nodes)) - raise ValueError(obj) + raise ValueError(obj) + + context, ind = self._do_strong_search(obj.get(0), recursive) + for i in range(1, len(obj.nodes)): + if obj.get(i) is not context.get(ind.start + i): + raise ValueError(obj) + return context, slice(ind.start, ind.start + len(obj.nodes)) def _do_weak_search(self, obj, recursive): """Search for an element that looks like *obj* within the node list. - This follows the same rules as :py:meth:`_do_strong_search` with some + This follows the same rules as :meth:`_do_strong_search` with some differences. *obj* is treated as a string that might represent any - :py:class:`.Node`, :py:class:`.Wikicode`, or combination of the two - present in the node list. Thus, matching is weak (using string - comparisons) rather than strong (using ``is``). Because multiple nodes - can match *obj*, the result is a list of tuples instead of just one - (however, :py:exc:`ValueError` is still raised if nothing is found). - Individual matches will never overlap. + :class:`.Node`, :class:`.Wikicode`, or combination of the two present + in the node list. Thus, matching is weak (using string comparisons) + rather than strong (using ``is``). Because multiple nodes can match + *obj*, the result is a list of tuples instead of just one (however, + :exc:`ValueError` is still raised if nothing is found). Individual + matches will never overlap. The tuples contain a new first element, *exact*, which is ``True`` if we were able to match *obj* exactly to one or more adjacent nodes, or @@ -208,24 +212,24 @@ class Wikicode(StringMixIn): def _build_filter_methods(cls, **meths): """Given Node types, build the corresponding i?filter shortcuts. - The should be given as keys storing the method's base name paired - with values storing the corresponding :py:class:`~.Node` type. For - example, the dict may contain the pair ``("templates", Template)``, - which will produce the methods :py:meth:`ifilter_templates` and - :py:meth:`filter_templates`, which are shortcuts for - :py:meth:`ifilter(forcetype=Template) ` and - :py:meth:`filter(forcetype=Template) `, respectively. These + The should be given as keys storing the method's base name paired with + values storing the corresponding :class:`.Node` type. For example, the + dict may contain the pair ``("templates", Template)``, which will + produce the methods :meth:`ifilter_templates` and + :meth:`filter_templates`, which are shortcuts for + :meth:`ifilter(forcetype=Template) ` and + :meth:`filter(forcetype=Template) `, respectively. These shortcuts are added to the class itself, with an appropriate docstring. """ doc = """Iterate over {0}. - This is equivalent to :py:meth:`{1}` with *forcetype* set to - :py:class:`~{2.__module__}.{2.__name__}`. + This is equivalent to :meth:`{1}` with *forcetype* set to + :class:`~{2.__module__}.{2.__name__}`. """ - make_ifilter = lambda ftype: (lambda self, **kw: - self.ifilter(forcetype=ftype, **kw)) - make_filter = lambda ftype: (lambda self, **kw: - self.filter(forcetype=ftype, **kw)) + make_ifilter = lambda ftype: (lambda self, *a, **kw: + self.ifilter(forcetype=ftype, *a, **kw)) + make_filter = lambda ftype: (lambda self, *a, **kw: + self.filter(forcetype=ftype, *a, **kw)) for name, ftype in (meths.items() if py3k else meths.iteritems()): ifilter = make_ifilter(ftype) filter = make_filter(ftype) @@ -236,10 +240,10 @@ class Wikicode(StringMixIn): @property def nodes(self): - """A list of :py:class:`~.Node` objects. + """A list of :class:`.Node` objects. - This is the internal data actually stored within a - :py:class:`~.Wikicode` object. + This is the internal data actually stored within a :class:`.Wikicode` + object. """ return self._nodes @@ -256,11 +260,10 @@ class Wikicode(StringMixIn): def set(self, index, value): """Set the ``Node`` at *index* to *value*. - Raises :py:exc:`IndexError` if *index* is out of range, or - :py:exc:`ValueError` if *value* cannot be coerced into one - :py:class:`~.Node`. To insert multiple nodes at an index, use - :py:meth:`get` with either :py:meth:`remove` and :py:meth:`insert` or - :py:meth:`replace`. + Raises :exc:`IndexError` if *index* is out of range, or + :exc:`ValueError` if *value* cannot be coerced into one :class:`.Node`. + To insert multiple nodes at an index, use :meth:`get` with either + :meth:`remove` and :meth:`insert` or :meth:`replace`. """ nodes = parse_anything(value).nodes if len(nodes) > 1: @@ -275,7 +278,7 @@ class Wikicode(StringMixIn): def index(self, obj, recursive=False): """Return the index of *obj* in the list of nodes. - Raises :py:exc:`ValueError` if *obj* is not found. If *recursive* is + Raises :exc:`ValueError` if *obj* is not found. If *recursive* is ``True``, we will look in all nodes of ours and their descendants, and return the index of our direct descendant node within *our* list of nodes. Otherwise, the lookup is done only on direct descendants. @@ -294,9 +297,8 @@ class Wikicode(StringMixIn): def insert(self, index, value): """Insert *value* at *index* in the list of nodes. - *value* can be anything parasable by :py:func:`.parse_anything`, which - includes strings or other :py:class:`~.Wikicode` or :py:class:`~.Node` - objects. + *value* can be anything parsable by :func:`.parse_anything`, which + includes strings or other :class:`.Wikicode` or :class:`.Node` objects. """ nodes = parse_anything(value).nodes for node in reversed(nodes): @@ -305,15 +307,14 @@ class Wikicode(StringMixIn): def insert_before(self, obj, value, recursive=True): """Insert *value* immediately before *obj*. - *obj* can be either a string, a :py:class:`~.Node`, or another - :py:class:`~.Wikicode` object (as created by :py:meth:`get_sections`, - for example). If *obj* is a string, we will operate on all instances - of that string within the code, otherwise only on the specific instance - given. *value* can be anything parasable by :py:func:`.parse_anything`. - If *recursive* is ``True``, we will try to find *obj* within our child - nodes even if it is not a direct descendant of this - :py:class:`~.Wikicode` object. If *obj* is not found, - :py:exc:`ValueError` is raised. + *obj* can be either a string, a :class:`.Node`, or another + :class:`.Wikicode` object (as created by :meth:`get_sections`, for + example). If *obj* is a string, we will operate on all instances of + that string within the code, otherwise only on the specific instance + given. *value* can be anything parsable by :func:`.parse_anything`. If + *recursive* is ``True``, we will try to find *obj* within our child + nodes even if it is not a direct descendant of this :class:`.Wikicode` + object. If *obj* is not found, :exc:`ValueError` is raised. """ if isinstance(obj, (Node, Wikicode)): context, index = self._do_strong_search(obj, recursive) @@ -329,15 +330,14 @@ class Wikicode(StringMixIn): def insert_after(self, obj, value, recursive=True): """Insert *value* immediately after *obj*. - *obj* can be either a string, a :py:class:`~.Node`, or another - :py:class:`~.Wikicode` object (as created by :py:meth:`get_sections`, - for example). If *obj* is a string, we will operate on all instances - of that string within the code, otherwise only on the specific instance - given. *value* can be anything parasable by :py:func:`.parse_anything`. - If *recursive* is ``True``, we will try to find *obj* within our child - nodes even if it is not a direct descendant of this - :py:class:`~.Wikicode` object. If *obj* is not found, - :py:exc:`ValueError` is raised. + *obj* can be either a string, a :class:`.Node`, or another + :class:`.Wikicode` object (as created by :meth:`get_sections`, for + example). If *obj* is a string, we will operate on all instances of + that string within the code, otherwise only on the specific instance + given. *value* can be anything parsable by :func:`.parse_anything`. If + *recursive* is ``True``, we will try to find *obj* within our child + nodes even if it is not a direct descendant of this :class:`.Wikicode` + object. If *obj* is not found, :exc:`ValueError` is raised. """ if isinstance(obj, (Node, Wikicode)): context, index = self._do_strong_search(obj, recursive) @@ -353,15 +353,14 @@ class Wikicode(StringMixIn): def replace(self, obj, value, recursive=True): """Replace *obj* with *value*. - *obj* can be either a string, a :py:class:`~.Node`, or another - :py:class:`~.Wikicode` object (as created by :py:meth:`get_sections`, - for example). If *obj* is a string, we will operate on all instances - of that string within the code, otherwise only on the specific instance - given. *value* can be anything parasable by :py:func:`.parse_anything`. + *obj* can be either a string, a :class:`.Node`, or another + :class:`.Wikicode` object (as created by :meth:`get_sections`, for + example). If *obj* is a string, we will operate on all instances of + that string within the code, otherwise only on the specific instance + given. *value* can be anything parsable by :func:`.parse_anything`. If *recursive* is ``True``, we will try to find *obj* within our child - nodes even if it is not a direct descendant of this - :py:class:`~.Wikicode` object. If *obj* is not found, - :py:exc:`ValueError` is raised. + nodes even if it is not a direct descendant of this :class:`.Wikicode` + object. If *obj* is not found, :exc:`ValueError` is raised. """ if isinstance(obj, (Node, Wikicode)): context, index = self._do_strong_search(obj, recursive) @@ -380,7 +379,7 @@ class Wikicode(StringMixIn): def append(self, value): """Insert *value* at the end of the list of nodes. - *value* can be anything parasable by :py:func:`.parse_anything`. + *value* can be anything parsable by :func:`.parse_anything`. """ nodes = parse_anything(value).nodes for node in nodes: @@ -389,14 +388,14 @@ class Wikicode(StringMixIn): def remove(self, obj, recursive=True): """Remove *obj* from the list of nodes. - *obj* can be either a string, a :py:class:`~.Node`, or another - :py:class:`~.Wikicode` object (as created by :py:meth:`get_sections`, - for example). If *obj* is a string, we will operate on all instances - of that string within the code, otherwise only on the specific instance + *obj* can be either a string, a :class:`.Node`, or another + :class:`.Wikicode` object (as created by :meth:`get_sections`, for + example). If *obj* is a string, we will operate on all instances of + that string within the code, otherwise only on the specific instance given. If *recursive* is ``True``, we will try to find *obj* within our child nodes even if it is not a direct descendant of this - :py:class:`~.Wikicode` object. If *obj* is not found, - :py:exc:`ValueError` is raised. + :class:`.Wikicode` object. If *obj* is not found, :exc:`ValueError` is + raised. """ if isinstance(obj, (Node, Wikicode)): context, index = self._do_strong_search(obj, recursive) @@ -413,10 +412,10 @@ class Wikicode(StringMixIn): def matches(self, other): """Do a loose equivalency test suitable for comparing page names. - *other* can be any string-like object, including - :py:class:`~.Wikicode`, or a tuple of these. This operation is - symmetric; both sides are adjusted. Specifically, whitespace and markup - is stripped and the first letter's case is normalized. Typical usage is + *other* can be any string-like object, including :class:`.Wikicode`, or + a tuple of these. This operation is symmetric; both sides are adjusted. + Specifically, whitespace and markup is stripped and the first letter's + case is normalized. Typical usage is ``if template.name.matches("stub"): ...``. """ cmp = lambda a, b: (a[0].upper() + a[1:] == b[0].upper() + b[1:] @@ -435,35 +434,44 @@ class Wikicode(StringMixIn): forcetype=None): """Iterate over nodes in our list matching certain conditions. - If *recursive* is ``True``, we will iterate over our children and all - of their descendants, otherwise just our immediate children. If - *forcetype* is given, only nodes that are instances of this type are - yielded. *matches* can be used to further restrict the nodes, either as - a function (taking a single :py:class:`.Node` and returning a boolean) - or a regular expression (matched against the node's string - representation with :py:func:`re.search`). If *matches* is a regex, the - flags passed to :py:func:`re.search` are :py:const:`re.IGNORECASE`, - :py:const:`re.DOTALL`, and :py:const:`re.UNICODE`, but custom flags can - be specified by passing *flags*. + If *forcetype* is given, only nodes that are instances of this type (or + tuple of types) are yielded. Setting *recursive* to ``True`` will + iterate over all children and their descendants. ``RECURSE_OTHERS`` + will only iterate over children that are not the instances of + *forcetype*. ``False`` will only iterate over immediate children. + + ``RECURSE_OTHERS`` can be used to iterate over all un-nested templates, + even if they are inside of HTML tags, like so: + + >>> code = mwparserfromhell.parse("{{foo}}{{foo|{{bar}}}}") + >>> code.filter_templates(code.RECURSE_OTHERS) + ["{{foo}}", "{{foo|{{bar}}}}"] + + *matches* can be used to further restrict the nodes, either as a + function (taking a single :class:`.Node` and returning a boolean) or a + regular expression (matched against the node's string representation + with :func:`re.search`). If *matches* is a regex, the flags passed to + :func:`re.search` are :const:`re.IGNORECASE`, :const:`re.DOTALL`, and + :const:`re.UNICODE`, but custom flags can be specified by passing + *flags*. """ - return (node for i, node in - self._indexed_ifilter(recursive, matches, flags, forcetype)) + gen = self._indexed_ifilter(recursive, matches, flags, forcetype) + return (node for i, node in gen) - def filter(self, recursive=True, matches=None, flags=FLAGS, - forcetype=None): + def filter(self, *args, **kwargs): """Return a list of nodes within our list matching certain conditions. - This is equivalent to calling :py:func:`list` on :py:meth:`ifilter`. + This is equivalent to calling :func:`list` on :meth:`ifilter`. """ - return list(self.ifilter(recursive, matches, flags, forcetype)) + return list(self.ifilter(*args, **kwargs)) def get_sections(self, levels=None, matches=None, flags=FLAGS, flat=False, include_lead=None, include_headings=True): """Return a list of sections within the page. - Sections are returned as :py:class:`~.Wikicode` objects with a shared - node list (implemented using :py:class:`~.SmartList`) so that changes - to sections are reflected in the parent Wikicode object. + Sections are returned as :class:`.Wikicode` objects with a shared node + list (implemented using :class:`.SmartList`) so that changes to + sections are reflected in the parent Wikicode object. Each section contains all of its subsections, unless *flat* is ``True``. If *levels* is given, it should be a iterable of integers; @@ -471,14 +479,13 @@ class Wikicode(StringMixIn): *matches* is given, it should be either a function or a regex; only sections whose headings match it (without the surrounding equal signs) will be included. *flags* can be used to override the default regex - flags (see :py:meth:`ifilter`) if a regex *matches* is used. + flags (see :meth:`ifilter`) if a regex *matches* is used. If *include_lead* is ``True``, the first, lead section (without a heading) will be included in the list; ``False`` will not include it; the default will include it only if no specific *levels* were given. If *include_headings* is ``True``, the section's beginning - :py:class:`~.Heading` object will be included; otherwise, this is - skipped. + :class:`.Heading` object will be included; otherwise, this is skipped. """ title_matcher = self._build_matcher(matches, flags) matcher = lambda heading: (title_matcher(heading.title) and @@ -527,7 +534,7 @@ class Wikicode(StringMixIn): """Return a rendered string without unprintable code such as templates. The way a node is stripped is handled by the - :py:meth:`~.Node.__strip__` method of :py:class:`~.Node` objects, which + :meth:`~.Node.__strip__` method of :class:`.Node` objects, which generally return a subset of their nodes or ``None``. For example, templates and tags are removed completely, links are stripped to just their display part, headings are stripped to just their title. If @@ -555,12 +562,12 @@ class Wikicode(StringMixIn): """Return a hierarchical tree representation of the object. The representation is a string makes the most sense printed. It is - built by calling :py:meth:`_get_tree` on the - :py:class:`~.Wikicode` object and its children recursively. The end - result may look something like the following:: + built by calling :meth:`_get_tree` on the :class:`.Wikicode` object and + its children recursively. The end result may look something like the + following:: >>> text = "Lorem ipsum {{foo|bar|{{baz}}|spam=eggs}}" - >>> print mwparserfromhell.parse(text).get_tree() + >>> print(mwparserfromhell.parse(text).get_tree()) Lorem ipsum {{ foo diff --git a/scripts/memtest.py b/scripts/memtest.py new file mode 100644 index 0000000..824d992 --- /dev/null +++ b/scripts/memtest.py @@ -0,0 +1,170 @@ +# -*- coding: utf-8 -*- +# +# Copyright (C) 2012-2015 Ben Kurtovic +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. + +""" +Tests for memory leaks in the CTokenizer. Python 2 and 3 compatible. + +This appears to work mostly fine under Linux, but gives an absurd number of +false positives on OS X. I'm not sure why. Running the tests multiple times +yields different results (tests don't always leak, and the amount they leak by +varies). Increasing the number of loops results in a smaller bytes/loop value, +too, indicating the increase in memory usage might be due to something else. +Actual memory leaks typically leak very large amounts of memory (megabytes) +and scale with the number of loops. +""" + +from __future__ import unicode_literals, print_function +from locale import LC_ALL, setlocale +from multiprocessing import Process, Pipe +from os import listdir, path +import sys + +import psutil + +from mwparserfromhell.compat import py3k +from mwparserfromhell.parser._tokenizer import CTokenizer + +if sys.version_info[0] == 2: + range = xrange + +LOOPS = 10000 + +class Color(object): + GRAY = "\x1b[30;1m" + GREEN = "\x1b[92m" + YELLOW = "\x1b[93m" + RESET = "\x1b[0m" + + +class MemoryTest(object): + """Manages a memory test.""" + + def __init__(self): + self._tests = [] + self._load() + + def _parse_file(self, name, text): + tests = text.split("\n---\n") + counter = 1 + digits = len(str(len(tests))) + for test in tests: + data = {"name": None, "label": None, "input": None, "output": None} + for line in test.strip().splitlines(): + if line.startswith("name:"): + data["name"] = line[len("name:"):].strip() + elif line.startswith("label:"): + data["label"] = line[len("label:"):].strip() + elif line.startswith("input:"): + raw = line[len("input:"):].strip() + if raw[0] == '"' and raw[-1] == '"': + raw = raw[1:-1] + raw = raw.encode("raw_unicode_escape") + data["input"] = raw.decode("unicode_escape") + number = str(counter).zfill(digits) + fname = "test_{0}{1}_{2}".format(name, number, data["name"]) + self._tests.append((fname, data["input"])) + counter += 1 + + def _load(self): + def load_file(filename): + with open(filename, "rU") as fp: + text = fp.read() + if not py3k: + text = text.decode("utf8") + name = path.split(filename)[1][:0-len(extension)] + self._parse_file(name, text) + + root = path.split(path.dirname(path.abspath(__file__)))[0] + directory = path.join(root, "tests", "tokenizer") + extension = ".mwtest" + if len(sys.argv) > 2 and sys.argv[1] == "--use": + for name in sys.argv[2:]: + load_file(path.join(directory, name + extension)) + sys.argv = [sys.argv[0]] # So unittest doesn't try to load these + else: + for filename in listdir(directory): + if not filename.endswith(extension): + continue + load_file(path.join(directory, filename)) + + @staticmethod + def _print_results(info1, info2): + r1, r2 = info1.rss, info2.rss + buff = 8192 + if r2 - buff > r1: + d = r2 - r1 + p = float(d) / r1 + bpt = d // LOOPS + tmpl = "{0}LEAKING{1}: {2:n} bytes, {3:.2%} inc ({4:n} bytes/loop)" + sys.stdout.write(tmpl.format(Color.YELLOW, Color.RESET, d, p, bpt)) + else: + sys.stdout.write("{0}OK{1}".format(Color.GREEN, Color.RESET)) + + def run(self): + """Run the memory test suite.""" + width = 1 + for (name, _) in self._tests: + if len(name) > width: + width = len(name) + + tmpl = "{0}[{1:03}/{2}]{3} {4}: " + for i, (name, text) in enumerate(self._tests, 1): + sys.stdout.write(tmpl.format(Color.GRAY, i, len(self._tests), + Color.RESET, name.ljust(width))) + sys.stdout.flush() + parent, child = Pipe() + p = Process(target=_runner, args=(text, child)) + p.start() + try: + proc = psutil.Process(p.pid) + parent.recv() + parent.send("OK") + parent.recv() + info1 = proc.get_memory_info() + sys.stdout.flush() + parent.send("OK") + parent.recv() + info2 = proc.get_memory_info() + self._print_results(info1, info2) + sys.stdout.flush() + parent.send("OK") + finally: + proc.kill() + print() + + +def _runner(text, child): + r1, r2 = range(250), range(LOOPS) + for i in r1: + CTokenizer().tokenize(text) + child.send("OK") + child.recv() + child.send("OK") + child.recv() + for i in r2: + CTokenizer().tokenize(text) + child.send("OK") + child.recv() + +if __name__ == "__main__": + setlocale(LC_ALL, "") + MemoryTest().run() diff --git a/scripts/release.sh b/scripts/release.sh new file mode 100755 index 0000000..4becf1a --- /dev/null +++ b/scripts/release.sh @@ -0,0 +1,165 @@ +#! /usr/bin/env bash + +if [[ -z "$1" ]]; then + echo "usage: $0 1.2.3" + exit 1 +fi + +VERSION=$1 +SCRIPT_DIR=$(dirname "$0") +RELEASE_DATE=$(date +"%B %d, %Y") + +check_git() { + if [[ -n "$(git status --porcelain --untracked-files=no)" ]]; then + echo "Aborting: dirty working directory." + exit 1 + fi + if [[ "$(git rev-parse --abbrev-ref HEAD)" != "develop" ]]; then + echo "Aborting: not on develop." + exit 1 + fi + echo -n "Are you absolutely ready to release? [yN] " + read confirm + if [[ ${confirm,,} != "y" ]]; then + exit 1 + fi +} + +update_version() { + echo -n "Updating mwparserfromhell.__version__..." + sed -e 's/__version__ = .*/__version__ = "'$VERSION'"/' -i "" mwparserfromhell/__init__.py + echo " done." +} + +update_changelog() { + filename="CHANGELOG" + echo -n "Updating $filename..." + sed -e '1s/.*/v'$VERSION' (released '$RELEASE_DATE'):/' -i "" $filename + echo " done." +} + +update_docs_changelog() { + filename="docs/changelog.rst" + echo -n "Updating $filename..." + dashes=$(seq 1 $(expr ${#VERSION} + 1) | sed 's/.*/-/' | tr -d '\n') + previous_lineno=$(expr $(grep -n -e "^---" $filename | sed '2q;d' | cut -d ':' -f 1) - 1) + previous_version=$(sed $previous_lineno'q;d' $filename) + sed \ + -e '4s/.*/v'$VERSION \ + -e '5s/.*/'$dashes \ + -e '7s/.*/`Released '$RELEASE_DATE' `_/' \ + -e '8s/.*/(`changes `__):/' \ + -i "" $filename + echo " done." +} + +do_git_stuff() { + echo -n "Git: committing, tagging, and merging release..." + git commit -qam "release/$VERSION" + git tag v$VERSION -s -m "version $VERSION" + git checkout -q master + git merge -q --no-ff develop -m "Merge branch 'develop'" + echo -n " pushing..." + git push -q --tags origin master + git checkout -q develop + git push -q origin develop + echo " done." +} + +upload_to_pypi() { + # TODO: check whether these commands give output + echo -n "PyPI: uploading source tarball and docs..." + python setup.py register sdist upload -s + python setup.py upload_docs + echo " done." +} + +windows_build() { + echo "PyPI: building/uploading Windows binaries..." + echo "*** Run in Windows: ./scripts/win_build.py" + echo "*** Press enter when done." + read +} + +post_release() { + echo + echo "*** Release completed." + echo "*** Update: https://github.com/earwig/mwparserfromhell/releases/tag/v$VERSION" + echo "*** Verify: https://pypi.python.org/pypi/mwparserfromhell" + echo "*** Verify: https://mwparserfromhell.readthedocs.org" + echo "*** Press enter to sanity-check the release." + read +} + +test_release() { + echo + echo "Checking mwparserfromhell v$VERSION..." + echo -n "Creating a virtualenv..." + virtdir="mwparser-test-env" + virtualenv -q $virtdir + cd $virtdir + source bin/activate + echo " done." + echo -n "Installing mwparserfromhell with pip..." + pip -q install mwparserfromhell + echo " done." + echo -n "Checking version..." + reported_version=$(python -c 'print __import__("mwparserfromhell").__version__') + if [[ "$reported_version" != "$VERSION" ]]; then + echo " error." + echo "*** ERROR: mwparserfromhell is reporting its version as $reported_version, not $VERSION!" + deactivate + cd .. + rm -rf $virtdir + exit 1 + else + echo " done." + fi + pip -q uninstall -y mwparserfromhell + echo -n "Downloading mwparserfromhell source tarball and GPG signature..." + curl -sL "https://pypi.python.org/packages/source/m/mwparserfromhell/mwparserfromhell-$VERSION.tar.gz" -o "mwparserfromhell.tar.gz" + curl -sL "https://pypi.python.org/packages/source/m/mwparserfromhell/mwparserfromhell-$VERSION.tar.gz.asc" -o "mwparserfromhell.tar.gz.asc" + echo " done." + echo "Verifying tarball..." + gpg --verify mwparserfromhell.tar.gz.asc + if [[ "$?" != "0" ]]; then + echo "*** ERROR: GPG signature verification failed!" + deactivate + cd .. + rm -rf $virtdir + exit 1 + fi + tar -xf mwparserfromhell.tar.gz + rm mwparserfromhell.tar.gz mwparserfromhell.tar.gz.asc + cd mwparserfromhell-$VERSION + echo "Running unit tests..." + python setup.py -q test + if [[ "$?" != "0" ]]; then + echo "*** ERROR: Unit tests failed!" + deactivate + cd ../.. + rm -rf $virtdir + exit 1 + fi + echo -n "Everything looks good. Cleaning up..." + deactivate + cd ../.. + rm -rf $virtdir + echo " done." +} + +echo "Preparing mwparserfromhell v$VERSION..." +cd "$SCRIPT_DIR/.." + +check_git +update_version +update_changelog +update_docs_changelog +do_git_stuff +upload_to_pypi +windows_build +post_release +test_release + +echo "All done." +exit 0 diff --git a/scripts/win_build.py b/scripts/win_build.py new file mode 100644 index 0000000..2d51909 --- /dev/null +++ b/scripts/win_build.py @@ -0,0 +1,58 @@ +# Build requirements: +# +# Python 2.6-3.2: Visual C++ Express Edition 2008: +# http://go.microsoft.com/?linkid=7729279 +# +# Python 3.3+: Visual C++ Express Edition 2010: +# http://go.microsoft.com/?linkid=9709949 +# +# x64 builds: Microsoft Windows SDK for Windows 7 and .NET Framework 3.5 SP1: +# http://www.microsoft.com/en-us/download/details.aspx?id=3138 +# +# Python interpreter, 2.6, 2.7, 3.2-3.4: +# https://www.python.org/downloads/ +# +# Pip, setuptools, wheel: +# https://bootstrap.pypa.io/get-pip.py +# and run *for each* Python version: +# c:\pythonXX\python get-pip.py +# c:\pythonXX\scripts\pip install wheel +# +# Afterwards, run this script with any of the python interpreters (2.7 suggested) + +from __future__ import print_function +import os +from subprocess import call, STDOUT + +ENVIRONMENTS = ["26", "27", "32", "33", "34"] + +def run(pyver, cmds): + cmd = [r"C:\Python%s\Python.exe" % pyver, "setup.py"] + cmds + print(" ".join(cmd), end=" ") + + with open("%s%s.log" % (cmds[0], pyver), "w") as logfile: + retval = call(cmd, stdout=logfile, stderr=STDOUT, cwd="..") + if not retval: + print("[OK]") + else: + print("[FAILED (%i)]" % retval) + return retval + +def main(): + path = os.path.split(__file__)[0] + if path: + os.chdir(path) + + print("Building Windows wheels for Python %s:" % ", ".join(ENVIRONMENTS)) + for pyver in ENVIRONMENTS: + print() + try: + os.unlink("mwparserfromhell/parser/_tokenizer.pyd") + except OSError: + pass + + if run(pyver, ["test"]) == 0: + run(pyver, ["bdist_wheel", "upload"]) # TODO: add "-s" to GPG sign + +if __name__ == "__main__": + main() diff --git a/setup.py b/setup.py index 5a45902..5c21db5 100644 --- a/setup.py +++ b/setup.py @@ -1,7 +1,7 @@ #! /usr/bin/env python # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal @@ -21,24 +21,85 @@ # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE # SOFTWARE. +import os import sys if (sys.version_info[0] == 2 and sys.version_info[1] < 6) or \ (sys.version_info[1] == 3 and sys.version_info[1] < 2): - raise Exception('mwparserfromhell needs Python 2.6+ or 3.2+') + raise Exception("mwparserfromhell needs Python 2.6+ or 3.2+") + +if sys.version_info >= (3, 0): + basestring = (str, ) from setuptools import setup, find_packages, Extension from mwparserfromhell import __version__ from mwparserfromhell.compat import py26, py3k -with open("README.rst") as fp: +with open("README.rst", **{'encoding':'utf-8'} if py3k else {}) as fp: long_docs = fp.read() tokenizer = Extension("mwparserfromhell.parser._tokenizer", - sources = ["mwparserfromhell/parser/tokenizer.c"]) + sources=["mwparserfromhell/parser/tokenizer.c"], + depends=["mwparserfromhell/parser/tokenizer.h"]) + +use_extension=True + +# Allow env var WITHOUT_EXTENSION and args --with[out]-extension +if '--without-extension' in sys.argv: + use_extension = False +elif '--with-extension' in sys.argv: + pass +elif os.environ.get('WITHOUT_EXTENSION', '0') == '1': + use_extension = False + +# Remove the command line argument as it isnt understood by +# setuptools/distutils +sys.argv = [arg for arg in sys.argv + if not arg.startswith('--with') + and not arg.endswith('-extension')] + + +def optional_compile_setup(func=setup, use_ext=use_extension, + *args, **kwargs): + """ + Wrap setup to allow optional compilation of extensions. + + Falls back to pure python mode (no extensions) + if compilation of extensions fails. + """ + extensions = kwargs.get('ext_modules', None) + + if use_ext and extensions: + try: + func(*args, **kwargs) + return + except SystemExit as e: + assert(e.args) + if e.args[0] is False: + raise + elif isinstance(e.args[0], basestring): + if e.args[0].startswith('usage: '): + raise + else: + # Fallback to pure python mode + print('setup with extension failed: %s' % repr(e)) + pass + except Exception as e: + print('setup with extension failed: %s' % repr(e)) + + if extensions: + if use_ext: + print('Falling back to pure python mode.') + else: + print('Using pure python mode.') + + del kwargs['ext_modules'] + + func(*args, **kwargs) + -setup( +optional_compile_setup( name = "mwparserfromhell", packages = find_packages(exclude=("tests",)), ext_modules = [tokenizer], diff --git a/tests/_test_tokenizer.py b/tests/_test_tokenizer.py index 7487241..1cbbc3d 100644 --- a/tests/_test_tokenizer.py +++ b/tests/_test_tokenizer.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal @@ -21,11 +21,13 @@ # SOFTWARE. from __future__ import print_function, unicode_literals +import codecs from os import listdir, path import sys -from mwparserfromhell.compat import py3k +from mwparserfromhell.compat import py3k, str from mwparserfromhell.parser import tokens +from mwparserfromhell.parser.builder import Builder class _TestParseError(Exception): """Raised internally when a test could not be parsed.""" @@ -49,8 +51,12 @@ class TokenizerTestCase(object): *label* for the method's docstring. """ def inner(self): - expected = data["output"] - actual = self.tokenizer().tokenize(data["input"]) + if hasattr(self, "roundtrip"): + expected = data["input"] + actual = str(Builder().build(data["output"][:])) + else: + expected = data["output"] + actual = self.tokenizer().tokenize(data["input"]) self.assertEqual(expected, actual) if not py3k: inner.__name__ = funcname.encode("utf8") @@ -109,10 +115,8 @@ class TokenizerTestCase(object): def build(cls): """Load and install all tests from the 'tokenizer' directory.""" def load_file(filename): - with open(filename, "rU") as fp: + with codecs.open(filename, "rU", encoding="utf8") as fp: text = fp.read() - if not py3k: - text = text.decode("utf8") name = path.split(filename)[1][:0-len(extension)] cls._load_tests(filename, name, text) diff --git a/tests/_test_tree_equality.py b/tests/_test_tree_equality.py index 10d491e..086f113 100644 --- a/tests/_test_tree_equality.py +++ b/tests/_test_tree_equality.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal @@ -98,7 +98,7 @@ class TreeEqualityTestCase(TestCase): self.assertWikicodeEqual(exp_attr.name, act_attr.name) if exp_attr.value is not None: self.assertWikicodeEqual(exp_attr.value, act_attr.value) - self.assertIs(exp_attr.quoted, act_attr.quoted) + self.assertEqual(exp_attr.quotes, act_attr.quotes) self.assertEqual(exp_attr.pad_first, act_attr.pad_first) self.assertEqual(exp_attr.pad_before_eq, act_attr.pad_before_eq) self.assertEqual(exp_attr.pad_after_eq, act_attr.pad_after_eq) diff --git a/tests/test_argument.py b/tests/test_argument.py index 3539ec4..70d8006 100644 --- a/tests/test_argument.py +++ b/tests/test_argument.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal diff --git a/tests/test_attribute.py b/tests/test_attribute.py index 50eed74..b3e325d 100644 --- a/tests/test_attribute.py +++ b/tests/test_attribute.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal @@ -42,12 +42,14 @@ class TestAttribute(TreeEqualityTestCase): self.assertEqual(" foo", str(node)) node2 = Attribute(wraptext("foo"), wraptext("bar")) self.assertEqual(' foo="bar"', str(node2)) - node3 = Attribute(wraptext("a"), wraptext("b"), True, "", " ", " ") + node3 = Attribute(wraptext("a"), wraptext("b"), '"', "", " ", " ") self.assertEqual('a = "b"', str(node3)) - node3 = Attribute(wraptext("a"), wraptext("b"), False, "", " ", " ") - self.assertEqual("a = b", str(node3)) - node4 = Attribute(wraptext("a"), wrap([]), False, " ", "", " ") - self.assertEqual(" a= ", str(node4)) + node4 = Attribute(wraptext("a"), wraptext("b"), "'", "", " ", " ") + self.assertEqual("a = 'b'", str(node4)) + node5 = Attribute(wraptext("a"), wraptext("b"), None, "", " ", " ") + self.assertEqual("a = b", str(node5)) + node6 = Attribute(wraptext("a"), wrap([]), None, " ", "", " ") + self.assertEqual(" a= ", str(node6)) def test_name(self): """test getter/setter for the name attribute""" @@ -66,17 +68,35 @@ class TestAttribute(TreeEqualityTestCase): self.assertWikicodeEqual(wrap([Template(wraptext("bar"))]), node.value) node.value = None self.assertIs(None, node.value) + node2 = Attribute(wraptext("id"), wraptext("foo"), None) + node2.value = "foo bar baz" + self.assertWikicodeEqual(wraptext("foo bar baz"), node2.value) + self.assertEqual('"', node2.quotes) + node2.value = 'foo "bar" baz' + self.assertWikicodeEqual(wraptext('foo "bar" baz'), node2.value) + self.assertEqual("'", node2.quotes) + node2.value = "foo 'bar' baz" + self.assertWikicodeEqual(wraptext("foo 'bar' baz"), node2.value) + self.assertEqual('"', node2.quotes) + node2.value = "fo\"o 'bar' b\"az" + self.assertWikicodeEqual(wraptext("fo\"o 'bar' b\"az"), node2.value) + self.assertEqual('"', node2.quotes) - def test_quoted(self): - """test getter/setter for the quoted attribute""" - node1 = Attribute(wraptext("id"), wraptext("foo"), False) + def test_quotes(self): + """test getter/setter for the quotes attribute""" + node1 = Attribute(wraptext("id"), wraptext("foo"), None) node2 = Attribute(wraptext("id"), wraptext("bar")) - self.assertFalse(node1.quoted) - self.assertTrue(node2.quoted) - node1.quoted = True - node2.quoted = "" - self.assertTrue(node1.quoted) - self.assertFalse(node2.quoted) + node3 = Attribute(wraptext("id"), wraptext("foo bar baz")) + self.assertIs(None, node1.quotes) + self.assertEqual('"', node2.quotes) + node1.quotes = "'" + node2.quotes = None + self.assertEqual("'", node1.quotes) + self.assertIs(None, node2.quotes) + self.assertRaises(ValueError, setattr, node1, "quotes", "foobar") + self.assertRaises(ValueError, setattr, node3, "quotes", None) + self.assertRaises(ValueError, Attribute, wraptext("id"), + wraptext("foo bar baz"), None) def test_padding(self): """test getter/setter for the padding attributes""" diff --git a/tests/test_builder.py b/tests/test_builder.py index c8fdca3..9af4f21 100644 --- a/tests/test_builder.py +++ b/tests/test_builder.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal @@ -27,10 +27,11 @@ try: except ImportError: import unittest +from mwparserfromhell.compat import py3k from mwparserfromhell.nodes import (Argument, Comment, ExternalLink, Heading, HTMLEntity, Tag, Template, Text, Wikilink) from mwparserfromhell.nodes.extras import Attribute, Parameter -from mwparserfromhell.parser import tokens +from mwparserfromhell.parser import tokens, ParserError from mwparserfromhell.parser.builder import Builder from ._test_tree_equality import TreeEqualityTestCase, wrap, wraptext @@ -269,7 +270,7 @@ class TestBuilder(TreeEqualityTestCase): tokens.TagAttrStart(pad_first=" ", pad_before_eq="", pad_after_eq=""), tokens.Text(text="name"), tokens.TagAttrEquals(), - tokens.TagAttrQuote(), tokens.Text(text="abc"), + tokens.TagAttrQuote(char='"'), tokens.Text(text="abc"), tokens.TagCloseSelfclose(padding=" ")], wrap([Tag(wraptext("ref"), attrs=[Attribute(wraptext("name"), wraptext("abc"))], @@ -297,7 +298,7 @@ class TestBuilder(TreeEqualityTestCase): wrap([Tag(wraptext("br"), self_closing=True, invalid=True)])), # [[Source]] + # mno = '{{p}} [[q]] {{r}}'>[[Source]] ([tokens.TagOpenOpen(), tokens.Text(text="ref"), tokens.TagAttrStart(pad_first=" ", pad_before_eq="", pad_after_eq=""), @@ -307,7 +308,7 @@ class TestBuilder(TreeEqualityTestCase): tokens.TagAttrStart(pad_first=" ", pad_before_eq="", pad_after_eq=""), tokens.Text(text="foo"), tokens.TagAttrEquals(), - tokens.TagAttrQuote(), tokens.Text(text="bar "), + tokens.TagAttrQuote(char='"'), tokens.Text(text="bar "), tokens.TemplateOpen(), tokens.Text(text="baz"), tokens.TemplateClose(), tokens.TagAttrStart(pad_first=" ", pad_before_eq="", @@ -325,7 +326,7 @@ class TestBuilder(TreeEqualityTestCase): tokens.TagAttrStart(pad_first=" \n ", pad_before_eq=" ", pad_after_eq=" "), tokens.Text(text="mno"), tokens.TagAttrEquals(), - tokens.TagAttrQuote(), tokens.TemplateOpen(), + tokens.TagAttrQuote(char="'"), tokens.TemplateOpen(), tokens.Text(text="p"), tokens.TemplateClose(), tokens.Text(text=" "), tokens.WikilinkOpen(), tokens.Text(text="q"), tokens.WikilinkClose(), @@ -337,17 +338,17 @@ class TestBuilder(TreeEqualityTestCase): tokens.TagCloseClose()], wrap([Tag(wraptext("ref"), wrap([Wikilink(wraptext("Source"))]), [ Attribute(wraptext("name"), - wrap([Template(wraptext("abc"))]), False), + wrap([Template(wraptext("abc"))]), None), Attribute(wraptext("foo"), wrap([Text("bar "), Template(wraptext("baz"))]), pad_first=" "), Attribute(wraptext("abc"), wrap([Template(wraptext("de")), - Text("f")]), False), + Text("f")]), None), Attribute(wraptext("ghi"), wrap([Text("j"), Template(wraptext("k")), - Template(wraptext("l"))]), False), + Template(wraptext("l"))]), None), Attribute(wraptext("mno"), wrap([Template(wraptext("p")), Text(" "), Wikilink(wraptext("q")), Text(" "), - Template(wraptext("r"))]), True, " \n ", " ", + Template(wraptext("r"))]), "'", " \n ", " ", " ")])])), # "''italic text''" @@ -420,5 +421,22 @@ class TestBuilder(TreeEqualityTestCase): named=True)]))])]) self.assertWikicodeEqual(valid, self.builder.build(test)) + def test_parser_errors(self): + """test whether ParserError gets thrown for bad input""" + missing_closes = [ + [tokens.TemplateOpen(), tokens.TemplateParamSeparator()], + [tokens.TemplateOpen()], [tokens.ArgumentOpen()], + [tokens.WikilinkOpen()], [tokens.ExternalLinkOpen()], + [tokens.HeadingStart()], [tokens.CommentStart()], + [tokens.TagOpenOpen(), tokens.TagAttrStart()], + [tokens.TagOpenOpen()] + ] + + func = self.assertRaisesRegex if py3k else self.assertRaisesRegexp + msg = r"_handle_token\(\) got unexpected TemplateClose" + func(ParserError, msg, self.builder.build, [tokens.TemplateClose()]) + for test in missing_closes: + self.assertRaises(ParserError, self.builder.build, test) + if __name__ == "__main__": unittest.main(verbosity=2) diff --git a/tests/test_comment.py b/tests/test_comment.py index cac8719..ad13f4a 100644 --- a/tests/test_comment.py +++ b/tests/test_comment.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal diff --git a/tests/test_ctokenizer.py b/tests/test_ctokenizer.py index 52427e3..0d37485 100644 --- a/tests/test_ctokenizer.py +++ b/tests/test_ctokenizer.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal diff --git a/tests/test_docs.py b/tests/test_docs.py index c873f0e..d50e90e 100644 --- a/tests/test_docs.py +++ b/tests/test_docs.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal @@ -22,6 +22,7 @@ from __future__ import print_function, unicode_literals import json +import os try: import unittest2 as unittest @@ -111,6 +112,7 @@ class TestDocs(unittest.TestCase): self.assertPrint(text, res) self.assertEqual(text, code) + @unittest.skipIf("NOWEB" in os.environ, "web test disabled by environ var") def test_readme_5(self): """test a block of example code in the README; includes a web call""" url1 = "http://en.wikipedia.org/w/api.php" diff --git a/tests/test_external_link.py b/tests/test_external_link.py index c81470e..5137247 100644 --- a/tests/test_external_link.py +++ b/tests/test_external_link.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal diff --git a/tests/test_heading.py b/tests/test_heading.py index 7c7a7ee..effc03b 100644 --- a/tests/test_heading.py +++ b/tests/test_heading.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal diff --git a/tests/test_html_entity.py b/tests/test_html_entity.py index eb6f606..a13fd71 100644 --- a/tests/test_html_entity.py +++ b/tests/test_html_entity.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal @@ -108,6 +108,7 @@ class TestHTMLEntity(TreeEqualityTestCase): self.assertRaises(ValueError, setattr, node3, "value", -1) self.assertRaises(ValueError, setattr, node1, "value", 110000) self.assertRaises(ValueError, setattr, node1, "value", "1114112") + self.assertRaises(ValueError, setattr, node1, "value", "12FFFF") def test_named(self): """test getter/setter for the named attribute""" @@ -163,10 +164,14 @@ class TestHTMLEntity(TreeEqualityTestCase): node2 = HTMLEntity("107") node3 = HTMLEntity("e9") node4 = HTMLEntity("1f648") + node5 = HTMLEntity("-2") + node6 = HTMLEntity("110000", named=False, hexadecimal=True) self.assertEqual("\xa0", node1.normalize()) self.assertEqual("k", node2.normalize()) self.assertEqual("é", node3.normalize()) self.assertEqual("\U0001F648", node4.normalize()) + self.assertRaises(ValueError, node5.normalize) + self.assertRaises(ValueError, node6.normalize) if __name__ == "__main__": unittest.main(verbosity=2) diff --git a/tests/test_parameter.py b/tests/test_parameter.py index ee52b59..71b298c 100644 --- a/tests/test_parameter.py +++ b/tests/test_parameter.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal @@ -71,9 +71,10 @@ class TestParameter(TreeEqualityTestCase): self.assertFalse(node1.showkey) self.assertTrue(node2.showkey) node1.showkey = True - node2.showkey = "" self.assertTrue(node1.showkey) - self.assertFalse(node2.showkey) + node1.showkey = "" + self.assertFalse(node1.showkey) + self.assertRaises(ValueError, setattr, node2, "showkey", False) if __name__ == "__main__": unittest.main(verbosity=2) diff --git a/tests/test_parser.py b/tests/test_parser.py index 955f455..6885c37 100644 --- a/tests/test_parser.py +++ b/tests/test_parser.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal diff --git a/tests/test_pytokenizer.py b/tests/test_pytokenizer.py index 40e2caf..f009c14 100644 --- a/tests/test_pytokenizer.py +++ b/tests/test_pytokenizer.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal diff --git a/tests/test_roundtripping.py b/tests/test_roundtripping.py new file mode 100644 index 0000000..5c64535 --- /dev/null +++ b/tests/test_roundtripping.py @@ -0,0 +1,41 @@ +# -*- coding: utf-8 -*- +# +# Copyright (C) 2012-2015 Ben Kurtovic +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. + +from __future__ import unicode_literals + +try: + import unittest2 as unittest +except ImportError: + import unittest + +from ._test_tokenizer import TokenizerTestCase + +class TestRoundtripping(TokenizerTestCase, unittest.TestCase): + """Test cases for roundtripping tokens back to wikitext.""" + + @classmethod + def setUpClass(cls): + cls.roundtrip = True + + +if __name__ == "__main__": + unittest.main(verbosity=2) diff --git a/tests/test_smart_list.py b/tests/test_smart_list.py index b739d62..a7106e4 100644 --- a/tests/test_smart_list.py +++ b/tests/test_smart_list.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal @@ -88,6 +88,10 @@ class TestSmartList(unittest.TestCase): self.assertEqual([0, 1, 2, 3, 4, 5, 6], list2) self.assertRaises(ValueError, assign, list2, 0, 5, 2, [100, 102, 104, 106]) + with self.assertRaises(IndexError): + list2[7] = "foo" + with self.assertRaises(IndexError): + list2[-8] = "foo" del list2[2] self.assertEqual([0, 1, 3, 4, 5, 6], list2) @@ -271,6 +275,13 @@ class TestSmartList(unittest.TestCase): list3.sort(key=lambda i: i[1], reverse=True) self.assertEqual([("b", 8), ("a", 5), ("c", 3), ("d", 2)], list3) + def _dispatch_test_for_children(self, meth): + """Run a test method on various different types of children.""" + meth(lambda L: SmartList(list(L))[:]) + meth(lambda L: SmartList([999] + list(L))[1:]) + meth(lambda L: SmartList(list(L) + [999])[:-1]) + meth(lambda L: SmartList([101, 102] + list(L) + [201, 202])[2:-2]) + def test_docs(self): """make sure the methods of SmartList/_ListProxy have docstrings""" methods = ["append", "count", "extend", "index", "insert", "pop", @@ -300,8 +311,8 @@ class TestSmartList(unittest.TestCase): """make sure SmartList's add/radd/iadd work""" self._test_add_radd_iadd(SmartList) - def test_parent_unaffected_magics(self): - """sanity checks against SmartList features that were not modified""" + def test_parent_other_magics(self): + """make sure SmartList's other magically implemented features work""" self._test_other_magic_methods(SmartList) def test_parent_methods(self): @@ -310,41 +321,29 @@ class TestSmartList(unittest.TestCase): def test_child_get_set_del(self): """make sure _ListProxy's getitem/setitem/delitem work""" - self._test_get_set_del_item(lambda L: SmartList(list(L))[:]) - self._test_get_set_del_item(lambda L: SmartList([999] + list(L))[1:]) - self._test_get_set_del_item(lambda L: SmartList(list(L) + [999])[:-1]) - builder = lambda L: SmartList([101, 102] + list(L) + [201, 202])[2:-2] - self._test_get_set_del_item(builder) + self._dispatch_test_for_children(self._test_get_set_del_item) def test_child_add(self): """make sure _ListProxy's add/radd/iadd work""" - self._test_add_radd_iadd(lambda L: SmartList(list(L))[:]) - self._test_add_radd_iadd(lambda L: SmartList([999] + list(L))[1:]) - self._test_add_radd_iadd(lambda L: SmartList(list(L) + [999])[:-1]) - builder = lambda L: SmartList([101, 102] + list(L) + [201, 202])[2:-2] - self._test_add_radd_iadd(builder) + self._dispatch_test_for_children(self._test_add_radd_iadd) def test_child_other_magics(self): """make sure _ListProxy's other magically implemented features work""" - self._test_other_magic_methods(lambda L: SmartList(list(L))[:]) - self._test_other_magic_methods(lambda L: SmartList([999] + list(L))[1:]) - self._test_other_magic_methods(lambda L: SmartList(list(L) + [999])[:-1]) - builder = lambda L: SmartList([101, 102] + list(L) + [201, 202])[2:-2] - self._test_other_magic_methods(builder) + self._dispatch_test_for_children(self._test_other_magic_methods) def test_child_methods(self): """make sure _ListProxy's non-magic methods work, like append()""" - self._test_list_methods(lambda L: SmartList(list(L))[:]) - self._test_list_methods(lambda L: SmartList([999] + list(L))[1:]) - self._test_list_methods(lambda L: SmartList(list(L) + [999])[:-1]) - builder = lambda L: SmartList([101, 102] + list(L) + [201, 202])[2:-2] - self._test_list_methods(builder) + self._dispatch_test_for_children(self._test_list_methods) def test_influence(self): """make sure changes are propagated from parents to children""" parent = SmartList([0, 1, 2, 3, 4, 5]) child1 = parent[2:] child2 = parent[2:5] + self.assertEqual([0, 1, 2, 3, 4, 5], parent) + self.assertEqual([2, 3, 4, 5], child1) + self.assertEqual([2, 3, 4], child2) + self.assertEqual(2, len(parent._children)) parent.append(6) child1.append(7) @@ -390,5 +389,28 @@ class TestSmartList(unittest.TestCase): self.assertEqual([4, 3, 2, 1.9, 1.8, 5, 6, 7, 8, 8.1, 8.2], child1) self.assertEqual([4, 3, 2, 1.9, 1.8], child2) + child1.detach() + self.assertEqual([1, 4, 3, 2, 1.9, 1.8, 5, 6, 7, 8, 8.1, 8.2], parent) + self.assertEqual([4, 3, 2, 1.9, 1.8, 5, 6, 7, 8, 8.1, 8.2], child1) + self.assertEqual([4, 3, 2, 1.9, 1.8], child2) + self.assertEqual(1, len(parent._children)) + + parent.remove(1.9) + parent.remove(1.8) + self.assertEqual([1, 4, 3, 2, 5, 6, 7, 8, 8.1, 8.2], parent) + self.assertEqual([4, 3, 2, 1.9, 1.8, 5, 6, 7, 8, 8.1, 8.2], child1) + self.assertEqual([4, 3, 2], child2) + + parent.reverse() + self.assertEqual([8.2, 8.1, 8, 7, 6, 5, 2, 3, 4, 1], parent) + self.assertEqual([4, 3, 2, 1.9, 1.8, 5, 6, 7, 8, 8.1, 8.2], child1) + self.assertEqual([4, 3, 2], child2) + self.assertEqual(0, len(parent._children)) + + child2.detach() + self.assertEqual([8.2, 8.1, 8, 7, 6, 5, 2, 3, 4, 1], parent) + self.assertEqual([4, 3, 2, 1.9, 1.8, 5, 6, 7, 8, 8.1, 8.2], child1) + self.assertEqual([4, 3, 2], child2) + if __name__ == "__main__": unittest.main(verbosity=2) diff --git a/tests/test_string_mixin.py b/tests/test_string_mixin.py index bc44f55..09e2e63 100644 --- a/tests/test_string_mixin.py +++ b/tests/test_string_mixin.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal diff --git a/tests/test_tag.py b/tests/test_tag.py index 111511a..0f0040a 100644 --- a/tests/test_tag.py +++ b/tests/test_tag.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal @@ -33,9 +33,10 @@ from mwparserfromhell.nodes.extras import Attribute from ._test_tree_equality import TreeEqualityTestCase, wrap, wraptext agen = lambda name, value: Attribute(wraptext(name), wraptext(value)) -agennq = lambda name, value: Attribute(wraptext(name), wraptext(value), False) -agenp = lambda name, v, a, b, c: Attribute(wraptext(name), v, True, a, b, c) -agenpnv = lambda name, a, b, c: Attribute(wraptext(name), None, True, a, b, c) +agennv = lambda name: Attribute(wraptext(name)) +agennq = lambda name, value: Attribute(wraptext(name), wraptext(value), None) +agenp = lambda name, v, a, b, c: Attribute(wraptext(name), v, '"', a, b, c) +agenpnv = lambda name, a, b, c: Attribute(wraptext(name), None, '"', a, b, c) class TestTag(TreeEqualityTestCase): """Test cases for the Tag node.""" @@ -74,10 +75,10 @@ class TestTag(TreeEqualityTestCase): node1 = Tag(wraptext("ref"), wraptext("foobar")) # '''bold text''' node2 = Tag(wraptext("b"), wraptext("bold text"), wiki_markup="'''") - # + # node3 = Tag(wraptext("img"), - attrs=[Attribute(wraptext("id"), wraptext("foo")), - Attribute(wraptext("class"), wraptext("bar"))], + attrs=[agen("id", "foo"), agen("class", "bar"), + agennv("selected")], self_closing=True, padding=" ") gen1 = node1.__children__() @@ -89,6 +90,7 @@ class TestTag(TreeEqualityTestCase): self.assertEqual(node3.attributes[0].value, next(gen3)) self.assertEqual(node3.attributes[1].name, next(gen3)) self.assertEqual(node3.attributes[1].value, next(gen3)) + self.assertEqual(node3.attributes[2].name, next(gen3)) self.assertEqual(node1.contents, next(gen1)) self.assertEqual(node2.contents, next(gen2)) self.assertEqual(node1.closing_tag, next(gen1)) @@ -113,7 +115,8 @@ class TestTag(TreeEqualityTestCase): getter, marker = object(), object() get = lambda code: output.append((getter, code)) mark = lambda: output.append(marker) - node1 = Tag(wraptext("ref"), wraptext("text"), [agen("name", "foo")]) + node1 = Tag(wraptext("ref"), wraptext("text"), + [agen("name", "foo"), agennv("selected")]) node2 = Tag(wraptext("br"), self_closing=True, padding=" ") node3 = Tag(wraptext("br"), self_closing=True, invalid=True, implicit=True, padding=" ") @@ -122,9 +125,10 @@ class TestTag(TreeEqualityTestCase): node3.__showtree__(output.append, get, mark) valid = [ "<", (getter, node1.tag), (getter, node1.attributes[0].name), - " = ", marker, (getter, node1.attributes[0].value), ">", - (getter, node1.contents), "", - "<", (getter, node2.tag), "/>", ""] + " = ", marker, (getter, node1.attributes[0].value), + (getter, node1.attributes[1].name), ">", (getter, node1.contents), + "", "<", (getter, node2.tag), + "/>", ""] self.assertEqual(valid, output) def test_tag(self): @@ -222,6 +226,38 @@ class TestTag(TreeEqualityTestCase): self.assertWikicodeEqual(parsed, node.closing_tag) self.assertEqual("foobar", node) + def test_wiki_style_separator(self): + """test getter/setter for wiki_style_separator attribute""" + node = Tag(wraptext("table"), wraptext("\n")) + self.assertIs(None, node.wiki_style_separator) + node.wiki_style_separator = "|" + self.assertEqual("|", node.wiki_style_separator) + node.wiki_markup = "{" + self.assertEqual("{|\n{", node) + node2 = Tag(wraptext("table"), wraptext("\n"), wiki_style_separator="|") + self.assertEqual("|", node.wiki_style_separator) + + def test_closing_wiki_markup(self): + """test getter/setter for closing_wiki_markup attribute""" + node = Tag(wraptext("table"), wraptext("\n")) + self.assertIs(None, node.closing_wiki_markup) + node.wiki_markup = "{|" + self.assertEqual("{|", node.closing_wiki_markup) + node.closing_wiki_markup = "|}" + self.assertEqual("|}", node.closing_wiki_markup) + self.assertEqual("{|\n|}", node) + node.wiki_markup = "!!" + self.assertEqual("|}", node.closing_wiki_markup) + self.assertEqual("!!\n|}", node) + node.wiki_markup = False + self.assertFalse(node.closing_wiki_markup) + self.assertEqual("\n
    ", node) + node2 = Tag(wraptext("table"), wraptext("\n"), + attrs=[agen("id", "foo")], wiki_markup="{|", + closing_wiki_markup="|}") + self.assertEqual("|}", node2.closing_wiki_markup) + self.assertEqual('{| id="foo"\n|}', node2) + def test_has(self): """test Tag.has()""" node = Tag(wraptext("ref"), wraptext("cite"), [agen("name", "foo")]) @@ -272,28 +308,33 @@ class TestTag(TreeEqualityTestCase): """test Tag.add()""" node = Tag(wraptext("ref"), wraptext("cite")) node.add("name", "value") - node.add("name", "value", quoted=False) + node.add("name", "value", quotes=None) + node.add("name", "value", quotes="'") node.add("name") node.add(1, False) node.add("style", "{{foobar}}") - node.add("name", "value", True, "\n", " ", " ") + node.add("name", "value", '"', "\n", " ", " ") attr1 = ' name="value"' attr2 = " name=value" - attr3 = " name" - attr4 = ' 1="False"' - attr5 = ' style="{{foobar}}"' - attr6 = '\nname = "value"' + attr3 = " name='value'" + attr4 = " name" + attr5 = ' 1="False"' + attr6 = ' style="{{foobar}}"' + attr7 = '\nname = "value"' self.assertEqual(attr1, node.attributes[0]) self.assertEqual(attr2, node.attributes[1]) self.assertEqual(attr3, node.attributes[2]) self.assertEqual(attr4, node.attributes[3]) self.assertEqual(attr5, node.attributes[4]) self.assertEqual(attr6, node.attributes[5]) - self.assertEqual(attr6, node.get("name")) + self.assertEqual(attr7, node.attributes[6]) + self.assertEqual(attr7, node.get("name")) self.assertWikicodeEqual(wrap([Template(wraptext("foobar"))]), - node.attributes[4].value) + node.attributes[5].value) self.assertEqual("".join(("cite")), node) + attr6, attr7, ">cite")), node) + self.assertRaises(ValueError, node.add, "name", "foo", quotes="bar") + self.assertRaises(ValueError, node.add, "name", "a bc d", quotes=None) def test_remove(self): """test Tag.remove()""" diff --git a/tests/test_template.py b/tests/test_template.py index 584b02f..7ba3f64 100644 --- a/tests/test_template.py +++ b/tests/test_template.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal @@ -130,6 +130,8 @@ class TestTemplate(TreeEqualityTestCase): self.assertTrue(node4.has("b", False)) self.assertTrue(node3.has("b", True)) self.assertFalse(node4.has("b", True)) + self.assertFalse(node1.has_param("foobar", False)) + self.assertTrue(node2.has_param(1, False)) def test_get(self): """test Template.get()""" @@ -176,52 +178,41 @@ class TestTemplate(TreeEqualityTestCase): pgens("b ", " c\n"), pgens("\nd ", " e"), pgens("\nf ", "g ")]) node16 = Template(wraptext("a"), [ pgens("\nb ", " c"), pgens("\nd ", " e"), pgens("\nf ", " g")]) - node17 = Template(wraptext("a"), [ - pgens("\nb ", " c"), pgens("\nd ", " e"), pgens("\nf ", " g")]) - node18 = Template(wraptext("a\n"), [ - pgens("b ", "c\n"), pgens("d ", " e"), pgens("f ", "g\n"), - pgens("h ", " i\n")]) - node19 = Template(wraptext("a"), [ - pgens("b ", " c\n"), pgens("\nd ", " e"), pgens("\nf ", "g ")]) - node20 = Template(wraptext("a"), [ - pgens("\nb ", " c"), pgens("\nd ", " e"), pgens("\nf ", " g")]) - node21 = Template(wraptext("a"), [pgenh("1", "b")]) - node22 = Template(wraptext("a"), [pgenh("1", "b")]) - node23 = Template(wraptext("a"), [pgenh("1", "b")]) - node24 = Template(wraptext("a"), [pgenh("1", "b"), pgenh("2", "c"), + node17 = Template(wraptext("a"), [pgenh("1", "b")]) + node18 = Template(wraptext("a"), [pgenh("1", "b")]) + node19 = Template(wraptext("a"), [pgenh("1", "b")]) + node20 = Template(wraptext("a"), [pgenh("1", "b"), pgenh("2", "c"), pgenh("3", "d"), pgenh("4", "e")]) - node25 = Template(wraptext("a"), [pgenh("1", "b"), pgenh("2", "c"), + node21 = Template(wraptext("a"), [pgenh("1", "b"), pgenh("2", "c"), pgens("4", "d"), pgens("5", "e")]) - node26 = Template(wraptext("a"), [pgenh("1", "b"), pgenh("2", "c"), + node22 = Template(wraptext("a"), [pgenh("1", "b"), pgenh("2", "c"), pgens("4", "d"), pgens("5", "e")]) + node23 = Template(wraptext("a"), [pgenh("1", "b")]) + node24 = Template(wraptext("a"), [pgenh("1", "b")]) + node25 = Template(wraptext("a"), [pgens("b", "c")]) + node26 = Template(wraptext("a"), [pgenh("1", "b")]) node27 = Template(wraptext("a"), [pgenh("1", "b")]) - node28 = Template(wraptext("a"), [pgenh("1", "b")]) - node29 = Template(wraptext("a"), [pgens("b", "c")]) - node30 = Template(wraptext("a"), [pgenh("1", "b")]) - node31 = Template(wraptext("a"), [pgenh("1", "b")]) - node32 = Template(wraptext("a"), [pgens("1", "b")]) - node33 = Template(wraptext("a"), [ + node28 = Template(wraptext("a"), [pgens("1", "b")]) + node29 = Template(wraptext("a"), [ pgens("\nb ", " c"), pgens("\nd ", " e"), pgens("\nf ", " g")]) - node34 = Template(wraptext("a\n"), [ + node30 = Template(wraptext("a\n"), [ pgens("b ", "c\n"), pgens("d ", " e"), pgens("f ", "g\n"), pgens("h ", " i\n")]) - node35 = Template(wraptext("a"), [ + node31 = Template(wraptext("a"), [ pgens("b ", " c\n"), pgens("\nd ", " e"), pgens("\nf ", "g ")]) - node36 = Template(wraptext("a"), [ + node32 = Template(wraptext("a"), [ pgens("\nb ", " c "), pgens("\nd ", " e "), pgens("\nf ", " g ")]) - node37 = Template(wraptext("a"), [pgens("b", "c"), pgens("d", "e"), - pgens("b", "f"), pgens("b", "h"), - pgens("i", "j")]) - node37 = Template(wraptext("a"), [pgens("b", "c"), pgens("d", "e"), + node33 = Template(wraptext("a"), [pgens("b", "c"), pgens("d", "e"), pgens("b", "f"), pgens("b", "h"), pgens("i", "j")]) - node38 = Template(wraptext("a"), [pgens("1", "b"), pgens("x", "y"), + node34 = Template(wraptext("a"), [pgens("1", "b"), pgens("x", "y"), pgens("1", "c"), pgens("2", "d")]) - node39 = Template(wraptext("a"), [pgens("1", "b"), pgens("x", "y"), + node35 = Template(wraptext("a"), [pgens("1", "b"), pgens("x", "y"), pgenh("1", "c"), pgenh("2", "d")]) - node40 = Template(wraptext("a"), [pgens("b", "c"), pgens("d", "e"), + node36 = Template(wraptext("a"), [pgens("b", "c"), pgens("d", "e"), pgens("f", "g")]) - node41 = Template(wraptext("a"), [pgenh("1", "")]) + node37 = Template(wraptext("a"), [pgenh("1", "")]) + node38 = Template(wraptext("abc")) node1.add("e", "f", showkey=True) node2.add(2, "g", showkey=False) @@ -241,31 +232,29 @@ class TestTemplate(TreeEqualityTestCase): node14.add("j", "k", showkey=True) node15.add("h", "i", showkey=True) node16.add("h", "i", showkey=True, preserve_spacing=False) - node17.add("h", "i", showkey=False) - node18.add("j", "k", showkey=False) - node19.add("h", "i", showkey=False) - node20.add("h", "i", showkey=False, preserve_spacing=False) - node21.add("2", "c") - node22.add("3", "c") - node23.add("c", "d") - node24.add("5", "f") - node25.add("3", "f") - node26.add("6", "f") - node27.add("c", "foo=bar") - node28.add("2", "foo=bar") - node29.add("b", "d") - node30.add("1", "foo=bar") - node31.add("1", "foo=bar", showkey=True) - node32.add("1", "foo=bar", showkey=False) - node33.add("d", "foo") - node34.add("f", "foo") - node35.add("f", "foo") - node36.add("d", "foo", preserve_spacing=False) - node37.add("b", "k") - node38.add("1", "e") - node39.add("1", "e") - node40.add("d", "h", before="b") - node41.add(1, "b") + node17.add("2", "c") + node18.add("3", "c") + node19.add("c", "d") + node20.add("5", "f") + node21.add("3", "f") + node22.add("6", "f") + node23.add("c", "foo=bar") + node24.add("2", "foo=bar") + node25.add("b", "d") + node26.add("1", "foo=bar") + node27.add("1", "foo=bar", showkey=True) + node28.add("1", "foo=bar", showkey=False) + node29.add("d", "foo") + node30.add("f", "foo") + node31.add("f", "foo") + node32.add("d", "foo", preserve_spacing=False) + node33.add("b", "k") + node34.add("1", "e") + node35.add("1", "e") + node36.add("d", "h", before="b") + node37.add(1, "b") + node38.add("1", "foo") + self.assertRaises(ValueError, node38.add, "z", "bar", showkey=False) self.assertEqual("{{a|b=c|d|e=f}}", node1) self.assertEqual("{{a|b=c|d|g}}", node2) @@ -285,34 +274,31 @@ class TestTemplate(TreeEqualityTestCase): self.assertEqual("{{a\n|b =c\n|d = e|f =g\n|h = i\n|j =k\n}}", node14) self.assertEqual("{{a|b = c\n|\nd = e|\nf =g |h =i}}", node15) self.assertEqual("{{a|\nb = c|\nd = e|\nf = g|h=i}}", node16) - self.assertEqual("{{a|\nb = c|\nd = e|\nf = g| i}}", node17) - self.assertEqual("{{a\n|b =c\n|d = e|f =g\n|h = i\n|k\n}}", node18) - self.assertEqual("{{a|b = c\n|\nd = e|\nf =g |i}}", node19) - self.assertEqual("{{a|\nb = c|\nd = e|\nf = g|i}}", node20) - self.assertEqual("{{a|b|c}}", node21) - self.assertEqual("{{a|b|3=c}}", node22) - self.assertEqual("{{a|b|c=d}}", node23) - self.assertEqual("{{a|b|c|d|e|f}}", node24) - self.assertEqual("{{a|b|c|4=d|5=e|f}}", node25) - self.assertEqual("{{a|b|c|4=d|5=e|6=f}}", node26) - self.assertEqual("{{a|b|c=foo=bar}}", node27) - self.assertEqual("{{a|b|foo=bar}}", node28) - self.assertIsInstance(node28.params[1].value.get(1), HTMLEntity) - self.assertEqual("{{a|b=d}}", node29) - self.assertEqual("{{a|foo=bar}}", node30) - self.assertIsInstance(node30.params[0].value.get(1), HTMLEntity) - self.assertEqual("{{a|1=foo=bar}}", node31) - self.assertEqual("{{a|foo=bar}}", node32) - self.assertIsInstance(node32.params[0].value.get(1), HTMLEntity) - self.assertEqual("{{a|\nb = c|\nd = foo|\nf = g}}", node33) - self.assertEqual("{{a\n|b =c\n|d = e|f =foo\n|h = i\n}}", node34) - self.assertEqual("{{a|b = c\n|\nd = e|\nf =foo }}", node35) - self.assertEqual("{{a|\nb = c |\nd =foo|\nf = g }}", node36) - self.assertEqual("{{a|b=k|d=e|i=j}}", node37) - self.assertEqual("{{a|1=e|x=y|2=d}}", node38) - self.assertEqual("{{a|x=y|e|d}}", node39) - self.assertEqual("{{a|b=c|d=h|f=g}}", node40) - self.assertEqual("{{a|b}}", node41) + self.assertEqual("{{a|b|c}}", node17) + self.assertEqual("{{a|b|3=c}}", node18) + self.assertEqual("{{a|b|c=d}}", node19) + self.assertEqual("{{a|b|c|d|e|f}}", node20) + self.assertEqual("{{a|b|c|4=d|5=e|f}}", node21) + self.assertEqual("{{a|b|c|4=d|5=e|6=f}}", node22) + self.assertEqual("{{a|b|c=foo=bar}}", node23) + self.assertEqual("{{a|b|foo=bar}}", node24) + self.assertIsInstance(node24.params[1].value.get(1), HTMLEntity) + self.assertEqual("{{a|b=d}}", node25) + self.assertEqual("{{a|foo=bar}}", node26) + self.assertIsInstance(node26.params[0].value.get(1), HTMLEntity) + self.assertEqual("{{a|1=foo=bar}}", node27) + self.assertEqual("{{a|foo=bar}}", node28) + self.assertIsInstance(node28.params[0].value.get(1), HTMLEntity) + self.assertEqual("{{a|\nb = c|\nd = foo|\nf = g}}", node29) + self.assertEqual("{{a\n|b =c\n|d = e|f =foo\n|h = i\n}}", node30) + self.assertEqual("{{a|b = c\n|\nd = e|\nf =foo }}", node31) + self.assertEqual("{{a|\nb = c |\nd =foo|\nf = g }}", node32) + self.assertEqual("{{a|b=k|d=e|i=j}}", node33) + self.assertEqual("{{a|1=e|x=y|2=d}}", node34) + self.assertEqual("{{a|x=y|e|d}}", node35) + self.assertEqual("{{a|b=c|d=h|f=g}}", node36) + self.assertEqual("{{a|b}}", node37) + self.assertEqual("{{abc|foo}}", node38) def test_remove(self): """test Template.remove()""" @@ -373,6 +359,8 @@ class TestTemplate(TreeEqualityTestCase): node26 = Template(wraptext("foo"), [ pgens("a", "b"), pgens("c", "d"), pgens("e", "f"), pgens("a", "b"), pgens("a", "b")]) + node27 = Template(wraptext("foo"), [pgenh("1", "bar")]) + node28 = Template(wraptext("foo"), [pgenh("1", "bar")]) node2.remove("1") node2.remove("abc") @@ -430,6 +418,7 @@ class TestTemplate(TreeEqualityTestCase): self.assertEqual("{{foo|a=|c=d|e=f|a=b|a=b}}", node24) self.assertEqual("{{foo|a=b|c=d|e=f|a=b}}", node25) self.assertEqual("{{foo|a=b|c=d|e=f|a=|a=b}}", node26) + self.assertRaises(ValueError, node27.remove, node28.get(1)) if __name__ == "__main__": unittest.main(verbosity=2) diff --git a/tests/test_text.py b/tests/test_text.py index ee2e5c7..9093824 100644 --- a/tests/test_text.py +++ b/tests/test_text.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal diff --git a/tests/test_tokens.py b/tests/test_tokens.py index 3efce86..98f9a56 100644 --- a/tests/test_tokens.py +++ b/tests/test_tokens.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal diff --git a/tests/test_utils.py b/tests/test_utils.py index ddcc078..a9d4119 100644 --- a/tests/test_utils.py +++ b/tests/test_utils.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal diff --git a/tests/test_wikicode.py b/tests/test_wikicode.py index 9ff5949..d97830c 100644 --- a/tests/test_wikicode.py +++ b/tests/test_wikicode.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal @@ -188,6 +188,13 @@ class TestWikicode(TreeEqualityTestCase): func("is {{some", "cd", recursive=True) self.assertEqual(expected[5], code6) + code7 = parse("{{foo}}{{bar}}{{baz}}{{foo}}{{baz}}") + func = partial(meth, code7) + obj = wrap([code7.get(0), code7.get(2)]) + self.assertRaises(ValueError, func, obj, "{{lol}}") + func("{{foo}}{{baz}}", "{{lol}}") + self.assertEqual(expected[6], code7) + def test_insert_before(self): """test Wikicode.insert_before()""" meth = lambda code, *args, **kw: code.insert_before(*args, **kw) @@ -197,7 +204,8 @@ class TestWikicode(TreeEqualityTestCase): "{{a|x{{b}}|{{c|d=y{{f}}}}}}", "{{a}}w{{b}}{{c}}x{{d}}{{e}}{{f}}{{g}}{{h}}yz{{i}}{{j}}", "{{a|x{{b}}{{c}}|{{f|{{g}}=y{{h}}{{i}}}}}}", - "here cdis {{some abtext and a {{template}}}}"] + "here cdis {{some abtext and a {{template}}}}", + "{{foo}}{{bar}}{{baz}}{{lol}}{{foo}}{{baz}}"] self._test_search(meth, expected) def test_insert_after(self): @@ -209,7 +217,8 @@ class TestWikicode(TreeEqualityTestCase): "{{a|{{b}}x|{{c|d={{f}}y}}}}", "{{a}}{{b}}{{c}}w{{d}}{{e}}x{{f}}{{g}}{{h}}{{i}}{{j}}yz", "{{a|{{b}}{{c}}x|{{f|{{g}}={{h}}{{i}}y}}}}", - "here is {{somecd text andab a {{template}}}}"] + "here is {{somecd text andab a {{template}}}}", + "{{foo}}{{bar}}{{baz}}{{foo}}{{baz}}{{lol}}"] self._test_search(meth, expected) def test_replace(self): @@ -218,7 +227,7 @@ class TestWikicode(TreeEqualityTestCase): expected = [ "{{a}}xz[[y]]{{e}}", "dcdffe", "{{a|x|{{c|d=y}}}}", "{{a}}wx{{f}}{{g}}z", "{{a|x|{{f|{{g}}=y}}}}", - "here cd ab a {{template}}}}"] + "here cd ab a {{template}}}}", "{{foo}}{{bar}}{{baz}}{{lol}}"] self._test_search(meth, expected) def test_append(self): @@ -238,8 +247,8 @@ class TestWikicode(TreeEqualityTestCase): meth = lambda code, obj, value, **kw: code.remove(obj, **kw) expected = [ "{{a}}{{c}}", "", "{{a||{{c|d=}}}}", "{{a}}{{f}}", - "{{a||{{f|{{g}}=}}}}", "here a {{template}}}}" - ] + "{{a||{{f|{{g}}=}}}}", "here a {{template}}}}", + "{{foo}}{{bar}}{{baz}}"] self._test_search(meth, expected) def test_matches(self): @@ -319,11 +328,14 @@ class TestWikicode(TreeEqualityTestCase): self.assertEqual(["{{baz}}", "{{bz}}"], func(matches=r"^{{b.*?z")) self.assertEqual(["{{baz}}"], func(matches=r"^{{b.+?z}}")) - self.assertEqual(["{{a|{{b}}|{{c|d={{f}}{{h}}}}}}"], - code2.filter_templates(recursive=False)) - self.assertEqual(["{{a|{{b}}|{{c|d={{f}}{{h}}}}}}", "{{b}}", - "{{c|d={{f}}{{h}}}}", "{{f}}", "{{h}}"], - code2.filter_templates(recursive=True)) + exp_rec = ["{{a|{{b}}|{{c|d={{f}}{{h}}}}}}", "{{b}}", + "{{c|d={{f}}{{h}}}}", "{{f}}", "{{h}}"] + exp_unrec = ["{{a|{{b}}|{{c|d={{f}}{{h}}}}}}"] + self.assertEqual(exp_rec, code2.filter_templates()) + self.assertEqual(exp_unrec, code2.filter_templates(recursive=False)) + self.assertEqual(exp_rec, code2.filter_templates(recursive=True)) + self.assertEqual(exp_rec, code2.filter_templates(True)) + self.assertEqual(exp_unrec, code2.filter_templates(False)) self.assertEqual(["{{foobar}}"], code3.filter_templates( matches=lambda node: node.name.matches("Foobar"))) @@ -332,9 +344,15 @@ class TestWikicode(TreeEqualityTestCase): self.assertEqual([], code3.filter_tags(matches=r"^{{b.*?z")) self.assertEqual([], code3.filter_tags(matches=r"^{{b.*?z", flags=0)) - self.assertRaises(TypeError, code.filter_templates, 100) self.assertRaises(TypeError, code.filter_templates, a=42) self.assertRaises(TypeError, code.filter_templates, forcetype=Template) + self.assertRaises(TypeError, code.filter_templates, 1, 0, 0, Template) + + code4 = parse("{{foo}}{{foo|{{bar}}}}") + actual1 = code4.filter_templates(recursive=code4.RECURSE_OTHERS) + actual2 = code4.filter_templates(code4.RECURSE_OTHERS) + self.assertEqual(["{{foo}}", "{{foo|{{bar}}}}"], actual1) + self.assertEqual(["{{foo}}", "{{foo|{{bar}}}}"], actual2) def test_get_sections(self): """test Wikicode.get_sections()""" diff --git a/tests/test_wikilink.py b/tests/test_wikilink.py index 1bdc907..e95cd84 100644 --- a/tests/test_wikilink.py +++ b/tests/test_wikilink.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- # -# Copyright (C) 2012-2014 Ben Kurtovic +# Copyright (C) 2012-2015 Ben Kurtovic # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal diff --git a/tests/tokenizer/comments.mwtest b/tests/tokenizer/comments.mwtest index ea2e89f..4bf82a9 100644 --- a/tests/tokenizer/comments.mwtest +++ b/tests/tokenizer/comments.mwtest @@ -37,3 +37,10 @@ name: incomplete_partial_close label: a comment that doesn't close, with a partial close input: "bingo" +output: [ExternalLinkOpen(brackets=False), Text(text="http://example.com/foo."), TemplateOpen(), Text(text="bar"), TemplateClose(), Text(text="baz.&biz;"), CommentStart(), Text(text="hello"), CommentEnd(), Text(text="bingo"), ExternalLinkClose()] + +--- + +name: newline_and_comment_in_template_name +label: a template name containing a newline followed by a comment +input: "{{foobar\n}}" +output: [TemplateOpen(), Text(text="foobar\n"), CommentStart(), Text(text=" comment "), CommentEnd(), TemplateClose()] + +--- + +name: newline_and_comment_in_template_name_2 +label: a template name containing a newline followed by a comment +input: "{{foobar\n|key=value}}" +output: [TemplateOpen(), Text(text="foobar\n"), CommentStart(), Text(text=" comment "), CommentEnd(), TemplateParamSeparator(), Text(text="key"), TemplateParamEquals(), Text(text="value"), TemplateClose()] + +--- + +name: newline_and_comment_in_template_name_3 +label: a template name containing a newline followed by a comment +input: "{{foobar\n\n|key=value}}" +output: [TemplateOpen(), Text(text="foobar\n"), CommentStart(), Text(text=" comment "), CommentEnd(), Text(text="\n"), TemplateParamSeparator(), Text(text="key"), TemplateParamEquals(), Text(text="value"), TemplateClose()] + +--- + +name: newline_and_comment_in_template_name_4 +label: a template name containing a newline followed by a comment +input: "{{foobar\ninvalid|key=value}}" +output: [Text(text="{{foobar\n"), CommentStart(), Text(text=" comment "), CommentEnd(), Text(text="invalid|key=value}}")] + +--- + +name: newline_and_comment_in_template_name_5 +label: a template name containing a newline followed by a comment +input: "{{foobar\n\ninvalid|key=value}}" +output: [Text(text="{{foobar\n"), CommentStart(), Text(text=" comment "), CommentEnd(), Text(text="\ninvalid|key=value}}")] + +--- + +name: newline_and_comment_in_template_name_6 +label: a template name containing a newline followed by a comment +input: "{{foobar\n