@@ -1,3 +1,25 @@ | |||||
v0.5 (released June 23, 2017): | |||||
- Added Wikicode.contains() to determine whether a Node or Wikicode object is | |||||
contained within another Wikicode object. | |||||
- Added Wikicode.get_ancestors() and Wikicode.get_parent() to find all | |||||
ancestors and the direct parent of a Node, respectively. | |||||
- Fixed a long-standing performance issue with deeply nested, invalid syntax | |||||
(issue #42). The parser should be much faster on certain complex pages. The | |||||
"max cycle" restriction has also been removed, so some situations where | |||||
templates at the end of a page were being skipped are now resolved. | |||||
- Made Template.remove(keep_field=True) behave more reasonably when the | |||||
parameter is already empty. | |||||
- Added the keep_template_params argument to Wikicode.strip_code(). If True, | |||||
then template parameters will be preserved in the output. | |||||
- Wikicode objects can now be pickled properly (fixed infinite recursion error | |||||
on incompletely-constructed StringMixIn subclasses). | |||||
- Fixed Wikicode.matches()'s behavior on iterables besides lists and tuples. | |||||
- Fixed len() sometimes raising ValueError on empty node lists. | |||||
- Fixed a rare parsing bug involving self-closing tags inside the attributes of | |||||
unpaired tags. | |||||
- Fixed release script after changes to PyPI. | |||||
v0.4.4 (released December 30, 2016): | v0.4.4 (released December 30, 2016): | ||||
- Added support for Python 3.6. | - Added support for Python 3.6. | ||||
@@ -1,4 +1,4 @@ | |||||
Copyright (C) 2012-2016 Ben Kurtovic <ben.kurtovic@gmail.com> | |||||
Copyright (C) 2012-2017 Ben Kurtovic <ben.kurtovic@gmail.com> | |||||
Permission is hereby granted, free of charge, to any person obtaining a copy | Permission is hereby granted, free of charge, to any person obtaining a copy | ||||
of this software and associated documentation files (the "Software"), to deal | of this software and associated documentation files (the "Software"), to deal | ||||
@@ -113,23 +113,49 @@ saving the page!) by calling ``str()`` on it:: | |||||
Likewise, use ``unicode(code)`` in Python 2. | Likewise, use ``unicode(code)`` in Python 2. | ||||
Caveats | |||||
Limitations | |||||
----------- | |||||
While the MediaWiki parser generates HTML and has access to the contents of | |||||
templates, among other things, mwparserfromhell acts as a direct interface to | |||||
the source code only. This has several implications: | |||||
* Syntax elements produced by a template transclusion cannot be detected. For | |||||
example, imagine a hypothetical page ``"Template:End-bold"`` that contained | |||||
the text ``</b>``. While MediaWiki would correctly understand that | |||||
``<b>foobar{{end-bold}}`` translates to ``<b>foobar</b>``, mwparserfromhell | |||||
has no way of examining the contents of ``{{end-bold}}``. Instead, it would | |||||
treat the bold tag as unfinished, possibly extending further down the page. | |||||
* Templates adjacent to external links, as in ``http://example.com{{foo}}``, | |||||
are considered part of the link. In reality, this would depend on the | |||||
contents of the template. | |||||
* When different syntax elements cross over each other, as in | |||||
``{{echo|''Hello}}, world!''``, the parser gets confused because this cannot | |||||
be represented by an ordinary syntax tree. Instead, the parser will treat the | |||||
first syntax construct as plain text. In this case, only the italic tag would | |||||
be properly parsed. | |||||
**Workaround:** Since this commonly occurs with text formatting and text | |||||
formatting is often not of interest to users, you may pass | |||||
*skip_style_tags=True* to ``mwparserfromhell.parse()``. This treats ``''`` | |||||
and ``'''`` as plain text. | |||||
A future version of mwparserfromhell may include multiple parsing modes to | |||||
get around this restriction more sensibly. | |||||
Additionally, the parser lacks awareness of certain wiki-specific settings: | |||||
An inherent limitation in wikicode prevents us from generating complete parse | |||||
trees in certain cases. For example, the string ``{{echo|''Hello}}, world!''`` | |||||
produces the valid output ``<i>Hello, world!</i>`` in MediaWiki, assuming | |||||
``{{echo}}`` is a template that returns its first parameter. But since | |||||
representing this in mwparserfromhell's node tree would be impossible, we | |||||
compromise by treating the first node (i.e., the template) as plain text, | |||||
parsing only the italics. | |||||
* `Word-ending links`_ are not supported, since the linktrail rules are | |||||
language-specific. | |||||
The current workaround for cases where you are not interested in text | |||||
formatting is to pass ``skip_style_tags=True`` to ``mwparserfromhell.parse()``. | |||||
This treats ``''`` and ``'''`` like plain text. | |||||
* Localized namespace names aren't recognized, so file links (such as | |||||
``[[File:...]]``) are treated as regular wikilinks. | |||||
A future version of mwparserfromhell will include multiple parsing modes to get | |||||
around this restriction. | |||||
* Anything that looks like an XML tag is treated as a tag, even if it is not a | |||||
recognized tag name, since the list of valid tags depends on loaded MediaWiki | |||||
extensions. | |||||
Integration | Integration | ||||
----------- | ----------- | ||||
@@ -174,6 +200,7 @@ Python 3 code (via the API_):: | |||||
.. _GitHub: https://github.com/earwig/mwparserfromhell | .. _GitHub: https://github.com/earwig/mwparserfromhell | ||||
.. _Python Package Index: http://pypi.python.org | .. _Python Package Index: http://pypi.python.org | ||||
.. _get pip: http://pypi.python.org/pypi/pip | .. _get pip: http://pypi.python.org/pypi/pip | ||||
.. _Word-ending links: https://www.mediawiki.org/wiki/Help:Links#linktrail | |||||
.. _EarwigBot: https://github.com/earwig/earwigbot | .. _EarwigBot: https://github.com/earwig/earwigbot | ||||
.. _Pywikibot: https://www.mediawiki.org/wiki/Manual:Pywikibot | .. _Pywikibot: https://www.mediawiki.org/wiki/Manual:Pywikibot | ||||
.. _API: http://mediawiki.org/wiki/API | .. _API: http://mediawiki.org/wiki/API |
@@ -1,6 +1,6 @@ | |||||
# This config file is used by appveyor.com to build Windows release binaries | # This config file is used by appveyor.com to build Windows release binaries | ||||
version: 0.4.4-b{build} | |||||
version: 0.5-b{build} | |||||
branches: | branches: | ||||
only: | only: | ||||
@@ -52,6 +52,14 @@ environment: | |||||
PYTHON_VERSION: "3.5" | PYTHON_VERSION: "3.5" | ||||
PYTHON_ARCH: "64" | PYTHON_ARCH: "64" | ||||
- PYTHON: "C:\\Python36" | |||||
PYTHON_VERSION: "3.6" | |||||
PYTHON_ARCH: "32" | |||||
- PYTHON: "C:\\Python36-x64" | |||||
PYTHON_VERSION: "3.6" | |||||
PYTHON_ARCH: "64" | |||||
install: | install: | ||||
- "%PIP% install --disable-pip-version-check --user --upgrade pip" | - "%PIP% install --disable-pip-version-check --user --upgrade pip" | ||||
- "%PIP% install wheel twine" | - "%PIP% install wheel twine" | ||||
@@ -1,17 +0,0 @@ | |||||
Caveats | |||||
======= | |||||
An inherent limitation in wikicode prevents us from generating complete parse | |||||
trees in certain cases. For example, the string ``{{echo|''Hello}}, world!''`` | |||||
produces the valid output ``<i>Hello, world!</i>`` in MediaWiki, assuming | |||||
``{{echo}}`` is a template that returns its first parameter. But since | |||||
representing this in mwparserfromhell's node tree would be impossible, we | |||||
compromise by treating the first node (i.e., the template) as plain text, | |||||
parsing only the italics. | |||||
The current workaround for cases where you are not interested in text | |||||
formatting is to pass *skip_style_tags=True* to :func:`mwparserfromhell.parse`. | |||||
This treats ``''`` and ``'''`` like plain text. | |||||
A future version of mwparserfromhell will include multiple parsing modes to get | |||||
around this restriction. |
@@ -1,6 +1,36 @@ | |||||
Changelog | Changelog | ||||
========= | ========= | ||||
v0.5 | |||||
---- | |||||
`Released June 23, 2017 <https://github.com/earwig/mwparserfromhell/tree/v0.5>`_ | |||||
(`changes <https://github.com/earwig/mwparserfromhell/compare/v0.4.4...v0.5>`__): | |||||
- Added :meth:`.Wikicode.contains` to determine whether a :class:`.Node` or | |||||
:class:`.Wikicode` object is contained within another :class:`.Wikicode` | |||||
object. | |||||
- Added :meth:`.Wikicode.get_ancestors` and :meth:`.Wikicode.get_parent` to | |||||
find all ancestors and the direct parent of a :class:`.Node`, respectively. | |||||
- Fixed a long-standing performance issue with deeply nested, invalid syntax | |||||
(`issue #42 <https://github.com/earwig/mwparserfromhell/issues/42>`_). The | |||||
parser should be much faster on certain complex pages. The "max cycle" | |||||
restriction has also been removed, so some situations where templates at the | |||||
end of a page were being skipped are now resolved. | |||||
- Made :meth:`Template.remove(keep_field=True) <.Template.remove>` behave more | |||||
reasonably when the parameter is already empty. | |||||
- Added the *keep_template_params* argument to :meth:`.Wikicode.strip_code`. | |||||
If *True*, then template parameters will be preserved in the output. | |||||
- :class:`.Wikicode` objects can now be pickled properly (fixed infinite | |||||
recursion error on incompletely-constructed :class:`.StringMixIn` | |||||
subclasses). | |||||
- Fixed :meth:`.Wikicode.matches`\ 's behavior on iterables besides lists and | |||||
tuples. | |||||
- Fixed ``len()`` sometimes raising ``ValueError`` on empty node lists. | |||||
- Fixed a rare parsing bug involving self-closing tags inside the attributes of | |||||
unpaired tags. | |||||
- Fixed release script after changes to PyPI. | |||||
v0.4.4 | v0.4.4 | ||||
------ | ------ | ||||
@@ -31,7 +61,7 @@ v0.4.3 | |||||
v0.4.2 | v0.4.2 | ||||
------ | ------ | ||||
`Released July 30, 2015 <https://github.com/earwig/mwparserfromhell/tree/v0.4.2>`_ | |||||
`Released July 30, 2015 <https://github.com/earwig/mwparserfromhell/tree/v0.4.2>`__ | |||||
(`changes <https://github.com/earwig/mwparserfromhell/compare/v0.4.1...v0.4.2>`__): | (`changes <https://github.com/earwig/mwparserfromhell/compare/v0.4.1...v0.4.2>`__): | ||||
- Fixed setup script not including header files in releases. | - Fixed setup script not including header files in releases. | ||||
@@ -40,7 +70,7 @@ v0.4.2 | |||||
v0.4.1 | v0.4.1 | ||||
------ | ------ | ||||
`Released July 30, 2015 <https://github.com/earwig/mwparserfromhell/tree/v0.4.1>`_ | |||||
`Released July 30, 2015 <https://github.com/earwig/mwparserfromhell/tree/v0.4.1>`__ | |||||
(`changes <https://github.com/earwig/mwparserfromhell/compare/v0.4...v0.4.1>`__): | (`changes <https://github.com/earwig/mwparserfromhell/compare/v0.4...v0.4.1>`__): | ||||
- The process for building Windows binaries has been fixed, and these should be | - The process for building Windows binaries has been fixed, and these should be | ||||
@@ -42,7 +42,7 @@ master_doc = 'index' | |||||
# General information about the project. | # General information about the project. | ||||
project = u'mwparserfromhell' | project = u'mwparserfromhell' | ||||
copyright = u'2012, 2013, 2014, 2015, 2016 Ben Kurtovic' | |||||
copyright = u'2012, 2013, 2014, 2015, 2016, 2017 Ben Kurtovic' | |||||
# The version info for the project you're documenting, acts as replacement for | # The version info for the project you're documenting, acts as replacement for | ||||
# |version| and |release|, also used in various other places throughout the | # |version| and |release|, also used in various other places throughout the | ||||
@@ -40,7 +40,7 @@ Contents | |||||
:maxdepth: 2 | :maxdepth: 2 | ||||
usage | usage | ||||
caveats | |||||
limitations | |||||
integration | integration | ||||
changelog | changelog | ||||
API Reference <api/modules> | API Reference <api/modules> | ||||
@@ -0,0 +1,45 @@ | |||||
Limitations | |||||
=========== | |||||
While the MediaWiki parser generates HTML and has access to the contents of | |||||
templates, among other things, mwparserfromhell acts as a direct interface to | |||||
the source code only. This has several implications: | |||||
* Syntax elements produced by a template transclusion cannot be detected. For | |||||
example, imagine a hypothetical page ``"Template:End-bold"`` that contained | |||||
the text ``</b>``. While MediaWiki would correctly understand that | |||||
``<b>foobar{{end-bold}}`` translates to ``<b>foobar</b>``, mwparserfromhell | |||||
has no way of examining the contents of ``{{end-bold}}``. Instead, it would | |||||
treat the bold tag as unfinished, possibly extending further down the page. | |||||
* Templates adjacent to external links, as in ``http://example.com{{foo}}``, | |||||
are considered part of the link. In reality, this would depend on the | |||||
contents of the template. | |||||
* When different syntax elements cross over each other, as in | |||||
``{{echo|''Hello}}, world!''``, the parser gets confused because this cannot | |||||
be represented by an ordinary syntax tree. Instead, the parser will treat the | |||||
first syntax construct as plain text. In this case, only the italic tag would | |||||
be properly parsed. | |||||
**Workaround:** Since this commonly occurs with text formatting and text | |||||
formatting is often not of interest to users, you may pass | |||||
*skip_style_tags=True* to ``mwparserfromhell.parse()``. This treats ``''`` | |||||
and ``'''`` as plain text. | |||||
A future version of mwparserfromhell may include multiple parsing modes to | |||||
get around this restriction more sensibly. | |||||
Additionally, the parser lacks awareness of certain wiki-specific settings: | |||||
* `Word-ending links`_ are not supported, since the linktrail rules are | |||||
language-specific. | |||||
* Localized namespace names aren't recognized, so file links (such as | |||||
``[[File:...]]``) are treated as regular wikilinks. | |||||
* Anything that looks like an XML tag is treated as a tag, even if it is not a | |||||
recognized tag name, since the list of valid tags depends on loaded MediaWiki | |||||
extensions. | |||||
.. _Word-ending links: https://www.mediawiki.org/wiki/Help:Links#linktrail |
@@ -1,6 +1,6 @@ | |||||
# -*- coding: utf-8 -*- | # -*- coding: utf-8 -*- | ||||
# | # | ||||
# Copyright (C) 2012-2016 Ben Kurtovic <ben.kurtovic@gmail.com> | |||||
# Copyright (C) 2012-2017 Ben Kurtovic <ben.kurtovic@gmail.com> | |||||
# | # | ||||
# Permission is hereby granted, free of charge, to any person obtaining a copy | # Permission is hereby granted, free of charge, to any person obtaining a copy | ||||
# of this software and associated documentation files (the "Software"), to deal | # of this software and associated documentation files (the "Software"), to deal | ||||
@@ -29,7 +29,7 @@ outrageously powerful parser for `MediaWiki <http://mediawiki.org>`_ wikicode. | |||||
__author__ = "Ben Kurtovic" | __author__ = "Ben Kurtovic" | ||||
__copyright__ = "Copyright (C) 2012, 2013, 2014, 2015, 2016 Ben Kurtovic" | __copyright__ = "Copyright (C) 2012, 2013, 2014, 2015, 2016 Ben Kurtovic" | ||||
__license__ = "MIT License" | __license__ = "MIT License" | ||||
__version__ = "0.4.4" | |||||
__version__ = "0.5" | |||||
__email__ = "ben.kurtovic@gmail.com" | __email__ = "ben.kurtovic@gmail.com" | ||||
from . import (compat, definitions, nodes, parser, smart_list, string_mixin, | from . import (compat, definitions, nodes, parser, smart_list, string_mixin, | ||||
@@ -58,7 +58,7 @@ class Node(StringMixIn): | |||||
return | return | ||||
yield # pragma: no cover (this is a generator that yields nothing) | yield # pragma: no cover (this is a generator that yields nothing) | ||||
def __strip__(self, normalize, collapse): | |||||
def __strip__(self, **kwargs): | |||||
return None | return None | ||||
def __showtree__(self, write, get, mark): | def __showtree__(self, write, get, mark): | ||||
@@ -47,9 +47,9 @@ class Argument(Node): | |||||
if self.default is not None: | if self.default is not None: | ||||
yield self.default | yield self.default | ||||
def __strip__(self, normalize, collapse): | |||||
def __strip__(self, **kwargs): | |||||
if self.default is not None: | if self.default is not None: | ||||
return self.default.strip_code(normalize, collapse) | |||||
return self.default.strip_code(**kwargs) | |||||
return None | return None | ||||
def __showtree__(self, write, get, mark): | def __showtree__(self, write, get, mark): | ||||
@@ -49,12 +49,12 @@ class ExternalLink(Node): | |||||
if self.title is not None: | if self.title is not None: | ||||
yield self.title | yield self.title | ||||
def __strip__(self, normalize, collapse): | |||||
def __strip__(self, **kwargs): | |||||
if self.brackets: | if self.brackets: | ||||
if self.title: | if self.title: | ||||
return self.title.strip_code(normalize, collapse) | |||||
return self.title.strip_code(**kwargs) | |||||
return None | return None | ||||
return self.url.strip_code(normalize, collapse) | |||||
return self.url.strip_code(**kwargs) | |||||
def __showtree__(self, write, get, mark): | def __showtree__(self, write, get, mark): | ||||
if self.brackets: | if self.brackets: | ||||
@@ -42,8 +42,8 @@ class Heading(Node): | |||||
def __children__(self): | def __children__(self): | ||||
yield self.title | yield self.title | ||||
def __strip__(self, normalize, collapse): | |||||
return self.title.strip_code(normalize, collapse) | |||||
def __strip__(self, **kwargs): | |||||
return self.title.strip_code(**kwargs) | |||||
def __showtree__(self, write, get, mark): | def __showtree__(self, write, get, mark): | ||||
write("=" * self.level) | write("=" * self.level) | ||||
@@ -58,8 +58,8 @@ class HTMLEntity(Node): | |||||
return "&#{0}{1};".format(self.hex_char, self.value) | return "&#{0}{1};".format(self.hex_char, self.value) | ||||
return "&#{0};".format(self.value) | return "&#{0};".format(self.value) | ||||
def __strip__(self, normalize, collapse): | |||||
if normalize: | |||||
def __strip__(self, **kwargs): | |||||
if kwargs.get("normalize"): | |||||
return self.normalize() | return self.normalize() | ||||
return self | return self | ||||
@@ -98,9 +98,9 @@ class Tag(Node): | |||||
if not self.self_closing and not self.wiki_markup and self.closing_tag: | if not self.self_closing and not self.wiki_markup and self.closing_tag: | ||||
yield self.closing_tag | yield self.closing_tag | ||||
def __strip__(self, normalize, collapse): | |||||
def __strip__(self, **kwargs): | |||||
if self.contents and is_visible(self.tag): | if self.contents and is_visible(self.tag): | ||||
return self.contents.strip_code(normalize, collapse) | |||||
return self.contents.strip_code(**kwargs) | |||||
return None | return None | ||||
def __showtree__(self, write, get, mark): | def __showtree__(self, write, get, mark): | ||||
@@ -1,6 +1,6 @@ | |||||
# -*- coding: utf-8 -*- | # -*- coding: utf-8 -*- | ||||
# | # | ||||
# Copyright (C) 2012-2016 Ben Kurtovic <ben.kurtovic@gmail.com> | |||||
# Copyright (C) 2012-2017 Ben Kurtovic <ben.kurtovic@gmail.com> | |||||
# | # | ||||
# Permission is hereby granted, free of charge, to any person obtaining a copy | # Permission is hereby granted, free of charge, to any person obtaining a copy | ||||
# of this software and associated documentation files (the "Software"), to deal | # of this software and associated documentation files (the "Software"), to deal | ||||
@@ -58,6 +58,12 @@ class Template(Node): | |||||
yield param.name | yield param.name | ||||
yield param.value | yield param.value | ||||
def __strip__(self, **kwargs): | |||||
if kwargs.get("keep_template_params"): | |||||
parts = [param.value.strip_code(**kwargs) for param in self.params] | |||||
return " ".join(part for part in parts if part) | |||||
return None | |||||
def __showtree__(self, write, get, mark): | def __showtree__(self, write, get, mark): | ||||
write("{{") | write("{{") | ||||
get(self.name) | get(self.name) | ||||
@@ -70,7 +76,8 @@ class Template(Node): | |||||
get(param.value) | get(param.value) | ||||
write("}}") | write("}}") | ||||
def _surface_escape(self, code, char): | |||||
@staticmethod | |||||
def _surface_escape(code, char): | |||||
"""Return *code* with *char* escaped as an HTML entity. | """Return *code* with *char* escaped as an HTML entity. | ||||
The main use of this is to escape pipes (``|``) or equal signs (``=``) | The main use of this is to escape pipes (``|``) or equal signs (``=``) | ||||
@@ -82,7 +89,8 @@ class Template(Node): | |||||
if char in node: | if char in node: | ||||
code.replace(node, node.replace(char, replacement), False) | code.replace(node, node.replace(char, replacement), False) | ||||
def _select_theory(self, theories): | |||||
@staticmethod | |||||
def _select_theory(theories): | |||||
"""Return the most likely spacing convention given different options. | """Return the most likely spacing convention given different options. | ||||
Given a dictionary of convention options as keys and their occurrence | Given a dictionary of convention options as keys and their occurrence | ||||
@@ -96,6 +104,22 @@ class Template(Node): | |||||
if confidence >= 0.75: | if confidence >= 0.75: | ||||
return tuple(theories.keys())[values.index(best)] | return tuple(theories.keys())[values.index(best)] | ||||
@staticmethod | |||||
def _blank_param_value(value): | |||||
"""Remove the content from *value* while keeping its whitespace. | |||||
Replace *value*\ 's nodes with two text nodes, the first containing | |||||
whitespace from before its content and the second containing whitespace | |||||
from after its content. | |||||
""" | |||||
sval = str(value) | |||||
if sval.isspace(): | |||||
before, after = "", sval | |||||
else: | |||||
match = re.search(r"^(\s*).*?(\s*)$", sval, FLAGS) | |||||
before, after = match.group(1), match.group(2) | |||||
value.nodes = [Text(before), Text(after)] | |||||
def _get_spacing_conventions(self, use_names): | def _get_spacing_conventions(self, use_names): | ||||
"""Try to determine the whitespace conventions for parameters. | """Try to determine the whitespace conventions for parameters. | ||||
@@ -112,6 +136,11 @@ class Template(Node): | |||||
component = str(param.value) | component = str(param.value) | ||||
match = re.search(r"^(\s*).*?(\s*)$", component, FLAGS) | match = re.search(r"^(\s*).*?(\s*)$", component, FLAGS) | ||||
before, after = match.group(1), match.group(2) | before, after = match.group(1), match.group(2) | ||||
if not use_names and component.isspace() and "\n" in before: | |||||
# If the value is empty, we expect newlines in the whitespace | |||||
# to be after the content, not before it: | |||||
before, after = before.split("\n", 1) | |||||
after = "\n" + after | |||||
before_theories[before] += 1 | before_theories[before] += 1 | ||||
after_theories[after] += 1 | after_theories[after] += 1 | ||||
@@ -119,16 +148,6 @@ class Template(Node): | |||||
after = self._select_theory(after_theories) | after = self._select_theory(after_theories) | ||||
return before, after | return before, after | ||||
def _blank_param_value(self, value): | |||||
"""Remove the content from *value* while keeping its whitespace. | |||||
Replace *value*\ 's nodes with two text nodes, the first containing | |||||
whitespace from before its content and the second containing whitespace | |||||
from after its content. | |||||
""" | |||||
match = re.search(r"^(\s*).*?(\s*)$", str(value), FLAGS) | |||||
value.nodes = [Text(match.group(1)), Text(match.group(2))] | |||||
def _fix_dependendent_params(self, i): | def _fix_dependendent_params(self, i): | ||||
"""Unhide keys if necessary after removing the param at index *i*.""" | """Unhide keys if necessary after removing the param at index *i*.""" | ||||
if not self.params[i].showkey: | if not self.params[i].showkey: | ||||
@@ -37,7 +37,7 @@ class Text(Node): | |||||
def __unicode__(self): | def __unicode__(self): | ||||
return self.value | return self.value | ||||
def __strip__(self, normalize, collapse): | |||||
def __strip__(self, **kwargs): | |||||
return self | return self | ||||
def __showtree__(self, write, get, mark): | def __showtree__(self, write, get, mark): | ||||
@@ -46,10 +46,10 @@ class Wikilink(Node): | |||||
if self.text is not None: | if self.text is not None: | ||||
yield self.text | yield self.text | ||||
def __strip__(self, normalize, collapse): | |||||
def __strip__(self, **kwargs): | |||||
if self.text is not None: | if self.text is not None: | ||||
return self.text.strip_code(normalize, collapse) | |||||
return self.title.strip_code(normalize, collapse) | |||||
return self.text.strip_code(**kwargs) | |||||
return self.title.strip_code(**kwargs) | |||||
def __showtree__(self, write, get, mark): | def __showtree__(self, write, get, mark): | ||||
write("[[") | write("[[") | ||||
@@ -1,6 +1,6 @@ | |||||
# -*- coding: utf-8 -*- | # -*- coding: utf-8 -*- | ||||
# | # | ||||
# Copyright (C) 2012-2016 Ben Kurtovic <ben.kurtovic@gmail.com> | |||||
# Copyright (C) 2012-2017 Ben Kurtovic <ben.kurtovic@gmail.com> | |||||
# | # | ||||
# Permission is hereby granted, free of charge, to any person obtaining a copy | # Permission is hereby granted, free of charge, to any person obtaining a copy | ||||
# of this software and associated documentation files (the "Software"), to deal | # of this software and associated documentation files (the "Software"), to deal | ||||
@@ -100,6 +100,8 @@ Local (stack-specific) contexts: | |||||
* :const:`TABLE_TH_LINE` | * :const:`TABLE_TH_LINE` | ||||
* :const:`TABLE_CELL_LINE_CONTEXTS` | * :const:`TABLE_CELL_LINE_CONTEXTS` | ||||
* :const:`HTML_ENTITY` | |||||
Global contexts: | Global contexts: | ||||
* :const:`GL_HEADING` | * :const:`GL_HEADING` | ||||
@@ -176,6 +178,8 @@ TABLE_CELL_LINE_CONTEXTS = TABLE_TD_LINE + TABLE_TH_LINE + TABLE_CELL_STYLE | |||||
TABLE = (TABLE_OPEN + TABLE_CELL_OPEN + TABLE_CELL_STYLE + TABLE_ROW_OPEN + | TABLE = (TABLE_OPEN + TABLE_CELL_OPEN + TABLE_CELL_STYLE + TABLE_ROW_OPEN + | ||||
TABLE_TD_LINE + TABLE_TH_LINE) | TABLE_TD_LINE + TABLE_TH_LINE) | ||||
HTML_ENTITY = 1 << 37 | |||||
# Global contexts: | # Global contexts: | ||||
GL_HEADING = 1 << 0 | GL_HEADING = 1 << 0 | ||||
@@ -0,0 +1,795 @@ | |||||
/* | |||||
* avl_tree.c - intrusive, nonrecursive AVL tree data structure (self-balancing | |||||
* binary search tree), implementation file | |||||
* | |||||
* Written in 2014-2016 by Eric Biggers <ebiggers3@gmail.com> | |||||
* Slight changes for compatibility by Ben Kurtovic <ben.kurtovic@gmail.com> | |||||
* | |||||
* To the extent possible under law, the author(s) have dedicated all copyright | |||||
* and related and neighboring rights to this software to the public domain | |||||
* worldwide via the Creative Commons Zero 1.0 Universal Public Domain | |||||
* Dedication (the "CC0"). | |||||
* | |||||
* This software is distributed in the hope that it will be useful, but WITHOUT | |||||
* ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS | |||||
* FOR A PARTICULAR PURPOSE. See the CC0 for more details. | |||||
* | |||||
* You should have received a copy of the CC0 along with this software; if not | |||||
* see <http://creativecommons.org/publicdomain/zero/1.0/>. | |||||
*/ | |||||
#define false 0 | |||||
#define true 1 | |||||
typedef int bool; | |||||
#include "avl_tree.h" | |||||
/* Returns the left child (sign < 0) or the right child (sign > 0) of the | |||||
* specified AVL tree node. | |||||
* Note: for all calls of this, 'sign' is constant at compilation time, | |||||
* so the compiler can remove the conditional. */ | |||||
static AVL_INLINE struct avl_tree_node * | |||||
avl_get_child(const struct avl_tree_node *parent, int sign) | |||||
{ | |||||
if (sign < 0) | |||||
return parent->left; | |||||
else | |||||
return parent->right; | |||||
} | |||||
static AVL_INLINE struct avl_tree_node * | |||||
avl_tree_first_or_last_in_order(const struct avl_tree_node *root, int sign) | |||||
{ | |||||
const struct avl_tree_node *first = root; | |||||
if (first) | |||||
while (avl_get_child(first, +sign)) | |||||
first = avl_get_child(first, +sign); | |||||
return (struct avl_tree_node *)first; | |||||
} | |||||
/* Starts an in-order traversal of the tree: returns the least-valued node, or | |||||
* NULL if the tree is empty. */ | |||||
struct avl_tree_node * | |||||
avl_tree_first_in_order(const struct avl_tree_node *root) | |||||
{ | |||||
return avl_tree_first_or_last_in_order(root, -1); | |||||
} | |||||
/* Starts a *reverse* in-order traversal of the tree: returns the | |||||
* greatest-valued node, or NULL if the tree is empty. */ | |||||
struct avl_tree_node * | |||||
avl_tree_last_in_order(const struct avl_tree_node *root) | |||||
{ | |||||
return avl_tree_first_or_last_in_order(root, 1); | |||||
} | |||||
static AVL_INLINE struct avl_tree_node * | |||||
avl_tree_next_or_prev_in_order(const struct avl_tree_node *node, int sign) | |||||
{ | |||||
const struct avl_tree_node *next; | |||||
if (avl_get_child(node, +sign)) | |||||
for (next = avl_get_child(node, +sign); | |||||
avl_get_child(next, -sign); | |||||
next = avl_get_child(next, -sign)) | |||||
; | |||||
else | |||||
for (next = avl_get_parent(node); | |||||
next && node == avl_get_child(next, +sign); | |||||
node = next, next = avl_get_parent(next)) | |||||
; | |||||
return (struct avl_tree_node *)next; | |||||
} | |||||
/* Continues an in-order traversal of the tree: returns the next-greatest-valued | |||||
* node, or NULL if there is none. */ | |||||
struct avl_tree_node * | |||||
avl_tree_next_in_order(const struct avl_tree_node *node) | |||||
{ | |||||
return avl_tree_next_or_prev_in_order(node, 1); | |||||
} | |||||
/* Continues a *reverse* in-order traversal of the tree: returns the | |||||
* previous-greatest-valued node, or NULL if there is none. */ | |||||
struct avl_tree_node * | |||||
avl_tree_prev_in_order(const struct avl_tree_node *node) | |||||
{ | |||||
return avl_tree_next_or_prev_in_order(node, -1); | |||||
} | |||||
/* Starts a postorder traversal of the tree. */ | |||||
struct avl_tree_node * | |||||
avl_tree_first_in_postorder(const struct avl_tree_node *root) | |||||
{ | |||||
const struct avl_tree_node *first = root; | |||||
if (first) | |||||
while (first->left || first->right) | |||||
first = first->left ? first->left : first->right; | |||||
return (struct avl_tree_node *)first; | |||||
} | |||||
/* Continues a postorder traversal of the tree. @prev will not be deferenced as | |||||
* it's allowed that its memory has been freed; @prev_parent must be its saved | |||||
* parent node. Returns NULL if there are no more nodes (i.e. @prev was the | |||||
* root of the tree). */ | |||||
struct avl_tree_node * | |||||
avl_tree_next_in_postorder(const struct avl_tree_node *prev, | |||||
const struct avl_tree_node *prev_parent) | |||||
{ | |||||
const struct avl_tree_node *next = prev_parent; | |||||
if (next && prev == next->left && next->right) | |||||
for (next = next->right; | |||||
next->left || next->right; | |||||
next = next->left ? next->left : next->right) | |||||
; | |||||
return (struct avl_tree_node *)next; | |||||
} | |||||
/* Sets the left child (sign < 0) or the right child (sign > 0) of the | |||||
* specified AVL tree node. | |||||
* Note: for all calls of this, 'sign' is constant at compilation time, | |||||
* so the compiler can remove the conditional. */ | |||||
static AVL_INLINE void | |||||
avl_set_child(struct avl_tree_node *parent, int sign, | |||||
struct avl_tree_node *child) | |||||
{ | |||||
if (sign < 0) | |||||
parent->left = child; | |||||
else | |||||
parent->right = child; | |||||
} | |||||
/* Sets the parent and balance factor of the specified AVL tree node. */ | |||||
static AVL_INLINE void | |||||
avl_set_parent_balance(struct avl_tree_node *node, struct avl_tree_node *parent, | |||||
int balance_factor) | |||||
{ | |||||
node->parent_balance = (uintptr_t)parent | (balance_factor + 1); | |||||
} | |||||
/* Sets the parent of the specified AVL tree node. */ | |||||
static AVL_INLINE void | |||||
avl_set_parent(struct avl_tree_node *node, struct avl_tree_node *parent) | |||||
{ | |||||
node->parent_balance = (uintptr_t)parent | (node->parent_balance & 3); | |||||
} | |||||
/* Returns the balance factor of the specified AVL tree node --- that is, the | |||||
* height of its right subtree minus the height of its left subtree. */ | |||||
static AVL_INLINE int | |||||
avl_get_balance_factor(const struct avl_tree_node *node) | |||||
{ | |||||
return (int)(node->parent_balance & 3) - 1; | |||||
} | |||||
/* Adds @amount to the balance factor of the specified AVL tree node. | |||||
* The caller must ensure this still results in a valid balance factor | |||||
* (-1, 0, or 1). */ | |||||
static AVL_INLINE void | |||||
avl_adjust_balance_factor(struct avl_tree_node *node, int amount) | |||||
{ | |||||
node->parent_balance += amount; | |||||
} | |||||
static AVL_INLINE void | |||||
avl_replace_child(struct avl_tree_node **root_ptr, | |||||
struct avl_tree_node *parent, | |||||
struct avl_tree_node *old_child, | |||||
struct avl_tree_node *new_child) | |||||
{ | |||||
if (parent) { | |||||
if (old_child == parent->left) | |||||
parent->left = new_child; | |||||
else | |||||
parent->right = new_child; | |||||
} else { | |||||
*root_ptr = new_child; | |||||
} | |||||
} | |||||
/* | |||||
* Template for performing a single rotation --- | |||||
* | |||||
* sign > 0: Rotate clockwise (right) rooted at A: | |||||
* | |||||
* P? P? | |||||
* | | | |||||
* A B | |||||
* / \ / \ | |||||
* B C? => D? A | |||||
* / \ / \ | |||||
* D? E? E? C? | |||||
* | |||||
* (nodes marked with ? may not exist) | |||||
* | |||||
* sign < 0: Rotate counterclockwise (left) rooted at A: | |||||
* | |||||
* P? P? | |||||
* | | | |||||
* A B | |||||
* / \ / \ | |||||
* C? B => A D? | |||||
* / \ / \ | |||||
* E? D? C? E? | |||||
* | |||||
* This updates pointers but not balance factors! | |||||
*/ | |||||
static AVL_INLINE void | |||||
avl_rotate(struct avl_tree_node ** const root_ptr, | |||||
struct avl_tree_node * const A, const int sign) | |||||
{ | |||||
struct avl_tree_node * const B = avl_get_child(A, -sign); | |||||
struct avl_tree_node * const E = avl_get_child(B, +sign); | |||||
struct avl_tree_node * const P = avl_get_parent(A); | |||||
avl_set_child(A, -sign, E); | |||||
avl_set_parent(A, B); | |||||
avl_set_child(B, +sign, A); | |||||
avl_set_parent(B, P); | |||||
if (E) | |||||
avl_set_parent(E, A); | |||||
avl_replace_child(root_ptr, P, A, B); | |||||
} | |||||
/* | |||||
* Template for performing a double rotation --- | |||||
* | |||||
* sign > 0: Rotate counterclockwise (left) rooted at B, then | |||||
* clockwise (right) rooted at A: | |||||
* | |||||
* P? P? P? | |||||
* | | | | |||||
* A A E | |||||
* / \ / \ / \ | |||||
* B C? => E C? => B A | |||||
* / \ / \ / \ / \ | |||||
* D? E B G? D? F?G? C? | |||||
* / \ / \ | |||||
* F? G? D? F? | |||||
* | |||||
* (nodes marked with ? may not exist) | |||||
* | |||||
* sign < 0: Rotate clockwise (right) rooted at B, then | |||||
* counterclockwise (left) rooted at A: | |||||
* | |||||
* P? P? P? | |||||
* | | | | |||||
* A A E | |||||
* / \ / \ / \ | |||||
* C? B => C? E => A B | |||||
* / \ / \ / \ / \ | |||||
* E D? G? B C? G?F? D? | |||||
* / \ / \ | |||||
* G? F? F? D? | |||||
* | |||||
* Returns a pointer to E and updates balance factors. Except for those | |||||
* two things, this function is equivalent to: | |||||
* avl_rotate(root_ptr, B, -sign); | |||||
* avl_rotate(root_ptr, A, +sign); | |||||
* | |||||
* See comment in avl_handle_subtree_growth() for explanation of balance | |||||
* factor updates. | |||||
*/ | |||||
static AVL_INLINE struct avl_tree_node * | |||||
avl_do_double_rotate(struct avl_tree_node ** const root_ptr, | |||||
struct avl_tree_node * const B, | |||||
struct avl_tree_node * const A, const int sign) | |||||
{ | |||||
struct avl_tree_node * const E = avl_get_child(B, +sign); | |||||
struct avl_tree_node * const F = avl_get_child(E, -sign); | |||||
struct avl_tree_node * const G = avl_get_child(E, +sign); | |||||
struct avl_tree_node * const P = avl_get_parent(A); | |||||
const int e = avl_get_balance_factor(E); | |||||
avl_set_child(A, -sign, G); | |||||
avl_set_parent_balance(A, E, ((sign * e >= 0) ? 0 : -e)); | |||||
avl_set_child(B, +sign, F); | |||||
avl_set_parent_balance(B, E, ((sign * e <= 0) ? 0 : -e)); | |||||
avl_set_child(E, +sign, A); | |||||
avl_set_child(E, -sign, B); | |||||
avl_set_parent_balance(E, P, 0); | |||||
if (G) | |||||
avl_set_parent(G, A); | |||||
if (F) | |||||
avl_set_parent(F, B); | |||||
avl_replace_child(root_ptr, P, A, E); | |||||
return E; | |||||
} | |||||
/* | |||||
* This function handles the growth of a subtree due to an insertion. | |||||
* | |||||
* @root_ptr | |||||
* Location of the tree's root pointer. | |||||
* | |||||
* @node | |||||
* A subtree that has increased in height by 1 due to an insertion. | |||||
* | |||||
* @parent | |||||
* Parent of @node; must not be NULL. | |||||
* | |||||
* @sign | |||||
* -1 if @node is the left child of @parent; | |||||
* +1 if @node is the right child of @parent. | |||||
* | |||||
* This function will adjust @parent's balance factor, then do a (single | |||||
* or double) rotation if necessary. The return value will be %true if | |||||
* the full AVL tree is now adequately balanced, or %false if the subtree | |||||
* rooted at @parent is now adequately balanced but has increased in | |||||
* height by 1, so the caller should continue up the tree. | |||||
* | |||||
* Note that if %false is returned, no rotation will have been done. | |||||
* Indeed, a single node insertion cannot require that more than one | |||||
* (single or double) rotation be done. | |||||
*/ | |||||
static AVL_INLINE bool | |||||
avl_handle_subtree_growth(struct avl_tree_node ** const root_ptr, | |||||
struct avl_tree_node * const node, | |||||
struct avl_tree_node * const parent, | |||||
const int sign) | |||||
{ | |||||
int old_balance_factor, new_balance_factor; | |||||
old_balance_factor = avl_get_balance_factor(parent); | |||||
if (old_balance_factor == 0) { | |||||
avl_adjust_balance_factor(parent, sign); | |||||
/* @parent is still sufficiently balanced (-1 or +1 | |||||
* balance factor), but must have increased in height. | |||||
* Continue up the tree. */ | |||||
return false; | |||||
} | |||||
new_balance_factor = old_balance_factor + sign; | |||||
if (new_balance_factor == 0) { | |||||
avl_adjust_balance_factor(parent, sign); | |||||
/* @parent is now perfectly balanced (0 balance factor). | |||||
* It cannot have increased in height, so there is | |||||
* nothing more to do. */ | |||||
return true; | |||||
} | |||||
/* @parent is too left-heavy (new_balance_factor == -2) or | |||||
* too right-heavy (new_balance_factor == +2). */ | |||||
/* Test whether @node is left-heavy (-1 balance factor) or | |||||
* right-heavy (+1 balance factor). | |||||
* Note that it cannot be perfectly balanced (0 balance factor) | |||||
* because here we are under the invariant that @node has | |||||
* increased in height due to the insertion. */ | |||||
if (sign * avl_get_balance_factor(node) > 0) { | |||||
/* @node (B below) is heavy in the same direction @parent | |||||
* (A below) is heavy. | |||||
* | |||||
* @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | |||||
* The comment, diagram, and equations below assume sign < 0. | |||||
* The other case is symmetric! | |||||
* @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | |||||
* | |||||
* Do a clockwise rotation rooted at @parent (A below): | |||||
* | |||||
* A B | |||||
* / \ / \ | |||||
* B C? => D A | |||||
* / \ / \ / \ | |||||
* D E? F? G?E? C? | |||||
* / \ | |||||
* F? G? | |||||
* | |||||
* Before the rotation: | |||||
* balance(A) = -2 | |||||
* balance(B) = -1 | |||||
* Let x = height(C). Then: | |||||
* height(B) = x + 2 | |||||
* height(D) = x + 1 | |||||
* height(E) = x | |||||
* max(height(F), height(G)) = x. | |||||
* | |||||
* After the rotation: | |||||
* height(D) = max(height(F), height(G)) + 1 | |||||
* = x + 1 | |||||
* height(A) = max(height(E), height(C)) + 1 | |||||
* = max(x, x) + 1 = x + 1 | |||||
* balance(B) = 0 | |||||
* balance(A) = 0 | |||||
*/ | |||||
avl_rotate(root_ptr, parent, -sign); | |||||
/* Equivalent to setting @parent's balance factor to 0. */ | |||||
avl_adjust_balance_factor(parent, -sign); /* A */ | |||||
/* Equivalent to setting @node's balance factor to 0. */ | |||||
avl_adjust_balance_factor(node, -sign); /* B */ | |||||
} else { | |||||
/* @node (B below) is heavy in the direction opposite | |||||
* from the direction @parent (A below) is heavy. | |||||
* | |||||
* @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | |||||
* The comment, diagram, and equations below assume sign < 0. | |||||
* The other case is symmetric! | |||||
* @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | |||||
* | |||||
* Do a counterblockwise rotation rooted at @node (B below), | |||||
* then a clockwise rotation rooted at @parent (A below): | |||||
* | |||||
* A A E | |||||
* / \ / \ / \ | |||||
* B C? => E C? => B A | |||||
* / \ / \ / \ / \ | |||||
* D? E B G? D? F?G? C? | |||||
* / \ / \ | |||||
* F? G? D? F? | |||||
* | |||||
* Before the rotation: | |||||
* balance(A) = -2 | |||||
* balance(B) = +1 | |||||
* Let x = height(C). Then: | |||||
* height(B) = x + 2 | |||||
* height(E) = x + 1 | |||||
* height(D) = x | |||||
* max(height(F), height(G)) = x | |||||
* | |||||
* After both rotations: | |||||
* height(A) = max(height(G), height(C)) + 1 | |||||
* = x + 1 | |||||
* balance(A) = balance(E{orig}) >= 0 ? 0 : -balance(E{orig}) | |||||
* height(B) = max(height(D), height(F)) + 1 | |||||
* = x + 1 | |||||
* balance(B) = balance(E{orig} <= 0) ? 0 : -balance(E{orig}) | |||||
* | |||||
* height(E) = x + 2 | |||||
* balance(E) = 0 | |||||
*/ | |||||
avl_do_double_rotate(root_ptr, node, parent, -sign); | |||||
} | |||||
/* Height after rotation is unchanged; nothing more to do. */ | |||||
return true; | |||||
} | |||||
/* Rebalance the tree after insertion of the specified node. */ | |||||
void | |||||
avl_tree_rebalance_after_insert(struct avl_tree_node **root_ptr, | |||||
struct avl_tree_node *inserted) | |||||
{ | |||||
struct avl_tree_node *node, *parent; | |||||
bool done; | |||||
inserted->left = NULL; | |||||
inserted->right = NULL; | |||||
node = inserted; | |||||
/* Adjust balance factor of new node's parent. | |||||
* No rotation will need to be done at this level. */ | |||||
parent = avl_get_parent(node); | |||||
if (!parent) | |||||
return; | |||||
if (node == parent->left) | |||||
avl_adjust_balance_factor(parent, -1); | |||||
else | |||||
avl_adjust_balance_factor(parent, +1); | |||||
if (avl_get_balance_factor(parent) == 0) | |||||
/* @parent did not change in height. Nothing more to do. */ | |||||
return; | |||||
/* The subtree rooted at @parent increased in height by 1. */ | |||||
do { | |||||
/* Adjust balance factor of next ancestor. */ | |||||
node = parent; | |||||
parent = avl_get_parent(node); | |||||
if (!parent) | |||||
return; | |||||
/* The subtree rooted at @node has increased in height by 1. */ | |||||
if (node == parent->left) | |||||
done = avl_handle_subtree_growth(root_ptr, node, | |||||
parent, -1); | |||||
else | |||||
done = avl_handle_subtree_growth(root_ptr, node, | |||||
parent, +1); | |||||
} while (!done); | |||||
} | |||||
/* | |||||
* This function handles the shrinkage of a subtree due to a deletion. | |||||
* | |||||
* @root_ptr | |||||
* Location of the tree's root pointer. | |||||
* | |||||
* @parent | |||||
* A node in the tree, exactly one of whose subtrees has decreased | |||||
* in height by 1 due to a deletion. (This includes the case where | |||||
* one of the child pointers has become NULL, since we can consider | |||||
* the "NULL" subtree to have a height of 0.) | |||||
* | |||||
* @sign | |||||
* +1 if the left subtree of @parent has decreased in height by 1; | |||||
* -1 if the right subtree of @parent has decreased in height by 1. | |||||
* | |||||
* @left_deleted_ret | |||||
* If the return value is not NULL, this will be set to %true if the | |||||
* left subtree of the returned node has decreased in height by 1, | |||||
* or %false if the right subtree of the returned node has decreased | |||||
* in height by 1. | |||||
* | |||||
* This function will adjust @parent's balance factor, then do a (single | |||||
* or double) rotation if necessary. The return value will be NULL if | |||||
* the full AVL tree is now adequately balanced, or a pointer to the | |||||
* parent of @parent if @parent is now adequately balanced but has | |||||
* decreased in height by 1. Also in the latter case, *left_deleted_ret | |||||
* will be set. | |||||
*/ | |||||
static AVL_INLINE struct avl_tree_node * | |||||
avl_handle_subtree_shrink(struct avl_tree_node ** const root_ptr, | |||||
struct avl_tree_node *parent, | |||||
const int sign, | |||||
bool * const left_deleted_ret) | |||||
{ | |||||
struct avl_tree_node *node; | |||||
int old_balance_factor, new_balance_factor; | |||||
old_balance_factor = avl_get_balance_factor(parent); | |||||
if (old_balance_factor == 0) { | |||||
/* Prior to the deletion, the subtree rooted at | |||||
* @parent was perfectly balanced. It's now | |||||
* unbalanced by 1, but that's okay and its height | |||||
* hasn't changed. Nothing more to do. */ | |||||
avl_adjust_balance_factor(parent, sign); | |||||
return NULL; | |||||
} | |||||
new_balance_factor = old_balance_factor + sign; | |||||
if (new_balance_factor == 0) { | |||||
/* The subtree rooted at @parent is now perfectly | |||||
* balanced, whereas before the deletion it was | |||||
* unbalanced by 1. Its height must have decreased | |||||
* by 1. No rotation is needed at this location, | |||||
* but continue up the tree. */ | |||||
avl_adjust_balance_factor(parent, sign); | |||||
node = parent; | |||||
} else { | |||||
/* @parent is too left-heavy (new_balance_factor == -2) or | |||||
* too right-heavy (new_balance_factor == +2). */ | |||||
node = avl_get_child(parent, sign); | |||||
/* The rotations below are similar to those done during | |||||
* insertion (see avl_handle_subtree_growth()), so full | |||||
* comments are not provided. The only new case is the | |||||
* one where @node has a balance factor of 0, and that is | |||||
* commented. */ | |||||
if (sign * avl_get_balance_factor(node) >= 0) { | |||||
avl_rotate(root_ptr, parent, -sign); | |||||
if (avl_get_balance_factor(node) == 0) { | |||||
/* | |||||
* @node (B below) is perfectly balanced. | |||||
* | |||||
* @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | |||||
* The comment, diagram, and equations | |||||
* below assume sign < 0. The other case | |||||
* is symmetric! | |||||
* @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | |||||
* | |||||
* Do a clockwise rotation rooted at | |||||
* @parent (A below): | |||||
* | |||||
* A B | |||||
* / \ / \ | |||||
* B C? => D A | |||||
* / \ / \ / \ | |||||
* D E F? G?E C? | |||||
* / \ | |||||
* F? G? | |||||
* | |||||
* Before the rotation: | |||||
* balance(A) = -2 | |||||
* balance(B) = 0 | |||||
* Let x = height(C). Then: | |||||
* height(B) = x + 2 | |||||
* height(D) = x + 1 | |||||
* height(E) = x + 1 | |||||
* max(height(F), height(G)) = x. | |||||
* | |||||
* After the rotation: | |||||
* height(D) = max(height(F), height(G)) + 1 | |||||
* = x + 1 | |||||
* height(A) = max(height(E), height(C)) + 1 | |||||
* = max(x + 1, x) + 1 = x + 2 | |||||
* balance(A) = -1 | |||||
* balance(B) = +1 | |||||
*/ | |||||
/* A: -2 => -1 (sign < 0) | |||||
* or +2 => +1 (sign > 0) | |||||
* No change needed --- that's the same as | |||||
* old_balance_factor. */ | |||||
/* B: 0 => +1 (sign < 0) | |||||
* or 0 => -1 (sign > 0) */ | |||||
avl_adjust_balance_factor(node, -sign); | |||||
/* Height is unchanged; nothing more to do. */ | |||||
return NULL; | |||||
} else { | |||||
avl_adjust_balance_factor(parent, -sign); | |||||
avl_adjust_balance_factor(node, -sign); | |||||
} | |||||
} else { | |||||
node = avl_do_double_rotate(root_ptr, node, | |||||
parent, -sign); | |||||
} | |||||
} | |||||
parent = avl_get_parent(node); | |||||
if (parent) | |||||
*left_deleted_ret = (node == parent->left); | |||||
return parent; | |||||
} | |||||
/* Swaps node X, which must have 2 children, with its in-order successor, then | |||||
* unlinks node X. Returns the parent of X just before unlinking, without its | |||||
* balance factor having been updated to account for the unlink. */ | |||||
static AVL_INLINE struct avl_tree_node * | |||||
avl_tree_swap_with_successor(struct avl_tree_node **root_ptr, | |||||
struct avl_tree_node *X, | |||||
bool *left_deleted_ret) | |||||
{ | |||||
struct avl_tree_node *Y, *ret; | |||||
Y = X->right; | |||||
if (!Y->left) { | |||||
/* | |||||
* P? P? P? | |||||
* | | | | |||||
* X Y Y | |||||
* / \ / \ / \ | |||||
* A Y => A X => A B? | |||||
* / \ / \ | |||||
* (0) B? (0) B? | |||||
* | |||||
* [ X unlinked, Y returned ] | |||||
*/ | |||||
ret = Y; | |||||
*left_deleted_ret = false; | |||||
} else { | |||||
struct avl_tree_node *Q; | |||||
do { | |||||
Q = Y; | |||||
Y = Y->left; | |||||
} while (Y->left); | |||||
/* | |||||
* P? P? P? | |||||
* | | | | |||||
* X Y Y | |||||
* / \ / \ / \ | |||||
* A ... => A ... => A ... | |||||
* | | | | |||||
* Q Q Q | |||||
* / / / | |||||
* Y X B? | |||||
* / \ / \ | |||||
* (0) B? (0) B? | |||||
* | |||||
* | |||||
* [ X unlinked, Q returned ] | |||||
*/ | |||||
Q->left = Y->right; | |||||
if (Q->left) | |||||
avl_set_parent(Q->left, Q); | |||||
Y->right = X->right; | |||||
avl_set_parent(X->right, Y); | |||||
ret = Q; | |||||
*left_deleted_ret = true; | |||||
} | |||||
Y->left = X->left; | |||||
avl_set_parent(X->left, Y); | |||||
Y->parent_balance = X->parent_balance; | |||||
avl_replace_child(root_ptr, avl_get_parent(X), X, Y); | |||||
return ret; | |||||
} | |||||
/* | |||||
* Removes an item from the specified AVL tree. | |||||
* | |||||
* @root_ptr | |||||
* Location of the AVL tree's root pointer. Indirection is needed | |||||
* because the root node may change if the tree needed to be rebalanced | |||||
* because of the deletion or if @node was the root node. | |||||
* | |||||
* @node | |||||
* Pointer to the `struct avl_tree_node' embedded in the item to | |||||
* remove from the tree. | |||||
* | |||||
* Note: This function *only* removes the node and rebalances the tree. | |||||
* It does not free any memory, nor does it do the equivalent of | |||||
* avl_tree_node_set_unlinked(). | |||||
*/ | |||||
void | |||||
avl_tree_remove(struct avl_tree_node **root_ptr, struct avl_tree_node *node) | |||||
{ | |||||
struct avl_tree_node *parent; | |||||
bool left_deleted = false; | |||||
if (node->left && node->right) { | |||||
/* @node is fully internal, with two children. Swap it | |||||
* with its in-order successor (which must exist in the | |||||
* right subtree of @node and can have, at most, a right | |||||
* child), then unlink @node. */ | |||||
parent = avl_tree_swap_with_successor(root_ptr, node, | |||||
&left_deleted); | |||||
/* @parent is now the parent of what was @node's in-order | |||||
* successor. It cannot be NULL, since @node itself was | |||||
* an ancestor of its in-order successor. | |||||
* @left_deleted has been set to %true if @node's | |||||
* in-order successor was the left child of @parent, | |||||
* otherwise %false. */ | |||||
} else { | |||||
struct avl_tree_node *child; | |||||
/* @node is missing at least one child. Unlink it. Set | |||||
* @parent to @node's parent, and set @left_deleted to | |||||
* reflect which child of @parent @node was. Or, if | |||||
* @node was the root node, simply update the root node | |||||
* and return. */ | |||||
child = node->left ? node->left : node->right; | |||||
parent = avl_get_parent(node); | |||||
if (parent) { | |||||
if (node == parent->left) { | |||||
parent->left = child; | |||||
left_deleted = true; | |||||
} else { | |||||
parent->right = child; | |||||
left_deleted = false; | |||||
} | |||||
if (child) | |||||
avl_set_parent(child, parent); | |||||
} else { | |||||
if (child) | |||||
avl_set_parent(child, parent); | |||||
*root_ptr = child; | |||||
return; | |||||
} | |||||
} | |||||
/* Rebalance the tree. */ | |||||
do { | |||||
if (left_deleted) | |||||
parent = avl_handle_subtree_shrink(root_ptr, parent, | |||||
+1, &left_deleted); | |||||
else | |||||
parent = avl_handle_subtree_shrink(root_ptr, parent, | |||||
-1, &left_deleted); | |||||
} while (parent); | |||||
} |
@@ -0,0 +1,363 @@ | |||||
/* | |||||
* avl_tree.h - intrusive, nonrecursive AVL tree data structure (self-balancing | |||||
* binary search tree), header file | |||||
* | |||||
* Written in 2014-2016 by Eric Biggers <ebiggers3@gmail.com> | |||||
* Slight changes for compatibility by Ben Kurtovic <ben.kurtovic@gmail.com> | |||||
* | |||||
* To the extent possible under law, the author(s) have dedicated all copyright | |||||
* and related and neighboring rights to this software to the public domain | |||||
* worldwide via the Creative Commons Zero 1.0 Universal Public Domain | |||||
* Dedication (the "CC0"). | |||||
* | |||||
* This software is distributed in the hope that it will be useful, but WITHOUT | |||||
* ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS | |||||
* FOR A PARTICULAR PURPOSE. See the CC0 for more details. | |||||
* | |||||
* You should have received a copy of the CC0 along with this software; if not | |||||
* see <http://creativecommons.org/publicdomain/zero/1.0/>. | |||||
*/ | |||||
#ifndef _AVL_TREE_H_ | |||||
#define _AVL_TREE_H_ | |||||
#include <stddef.h> | |||||
#if !defined(_MSC_VER) || (_MSC_VER >= 1600) | |||||
#include <stdint.h> | |||||
#endif | |||||
#ifdef __GNUC__ | |||||
# define AVL_INLINE inline __attribute__((always_inline)) | |||||
#elif defined(_MSC_VER) && (_MSC_VER < 1900) | |||||
# define AVL_INLINE __inline | |||||
#else | |||||
# define AVL_INLINE inline | |||||
#endif | |||||
/* Node in an AVL tree. Embed this in some other data structure. */ | |||||
struct avl_tree_node { | |||||
/* Pointer to left child or NULL */ | |||||
struct avl_tree_node *left; | |||||
/* Pointer to right child or NULL */ | |||||
struct avl_tree_node *right; | |||||
/* Pointer to parent combined with the balance factor. This saves 4 or | |||||
* 8 bytes of memory depending on the CPU architecture. | |||||
* | |||||
* Low 2 bits: One greater than the balance factor of this subtree, | |||||
* which is equal to height(right) - height(left). The mapping is: | |||||
* | |||||
* 00 => -1 | |||||
* 01 => 0 | |||||
* 10 => +1 | |||||
* 11 => undefined | |||||
* | |||||
* The rest of the bits are the pointer to the parent node. It must be | |||||
* 4-byte aligned, and it will be NULL if this is the root node and | |||||
* therefore has no parent. */ | |||||
uintptr_t parent_balance; | |||||
}; | |||||
/* Cast an AVL tree node to the containing data structure. */ | |||||
#define avl_tree_entry(entry, type, member) \ | |||||
((type*) ((char *)(entry) - offsetof(type, member))) | |||||
/* Returns a pointer to the parent of the specified AVL tree node, or NULL if it | |||||
* is already the root of the tree. */ | |||||
static AVL_INLINE struct avl_tree_node * | |||||
avl_get_parent(const struct avl_tree_node *node) | |||||
{ | |||||
return (struct avl_tree_node *)(node->parent_balance & ~3); | |||||
} | |||||
/* Marks the specified AVL tree node as unlinked from any tree. */ | |||||
static AVL_INLINE void | |||||
avl_tree_node_set_unlinked(struct avl_tree_node *node) | |||||
{ | |||||
node->parent_balance = (uintptr_t)node; | |||||
} | |||||
/* Returns true iff the specified AVL tree node has been marked with | |||||
* avl_tree_node_set_unlinked() and has not subsequently been inserted into a | |||||
* tree. */ | |||||
static AVL_INLINE int | |||||
avl_tree_node_is_unlinked(const struct avl_tree_node *node) | |||||
{ | |||||
return node->parent_balance == (uintptr_t)node; | |||||
} | |||||
/* (Internal use only) */ | |||||
extern void | |||||
avl_tree_rebalance_after_insert(struct avl_tree_node **root_ptr, | |||||
struct avl_tree_node *inserted); | |||||
/* | |||||
* Looks up an item in the specified AVL tree. | |||||
* | |||||
* @root | |||||
* Pointer to the root of the AVL tree. (This can be NULL --- that just | |||||
* means the tree is empty.) | |||||
* | |||||
* @cmp_ctx | |||||
* First argument to pass to the comparison callback. This generally | |||||
* should be a pointer to an object equal to the one being searched for. | |||||
* | |||||
* @cmp | |||||
* Comparison callback. Must return < 0, 0, or > 0 if the first argument | |||||
* is less than, equal to, or greater than the second argument, | |||||
* respectively. The first argument will be @cmp_ctx and the second | |||||
* argument will be a pointer to the AVL tree node of an item in the tree. | |||||
* | |||||
* Returns a pointer to the AVL tree node of the resulting item, or NULL if the | |||||
* item was not found. | |||||
* | |||||
* Example: | |||||
* | |||||
* struct int_wrapper { | |||||
* int data; | |||||
* struct avl_tree_node index_node; | |||||
* }; | |||||
* | |||||
* static int _avl_cmp_int_to_node(const void *intptr, | |||||
* const struct avl_tree_node *nodeptr) | |||||
* { | |||||
* int n1 = *(const int *)intptr; | |||||
* int n2 = avl_tree_entry(nodeptr, struct int_wrapper, index_node)->data; | |||||
* if (n1 < n2) | |||||
* return -1; | |||||
* else if (n1 > n2) | |||||
* return 1; | |||||
* else | |||||
* return 0; | |||||
* } | |||||
* | |||||
* bool contains_int(struct avl_tree_node *root, int n) | |||||
* { | |||||
* struct avl_tree_node *result; | |||||
* | |||||
* result = avl_tree_lookup(root, &n, _avl_cmp_int_to_node); | |||||
* return result ? true : false; | |||||
* } | |||||
*/ | |||||
static AVL_INLINE struct avl_tree_node * | |||||
avl_tree_lookup(const struct avl_tree_node *root, | |||||
const void *cmp_ctx, | |||||
int (*cmp)(const void *, const struct avl_tree_node *)) | |||||
{ | |||||
const struct avl_tree_node *cur = root; | |||||
while (cur) { | |||||
int res = (*cmp)(cmp_ctx, cur); | |||||
if (res < 0) | |||||
cur = cur->left; | |||||
else if (res > 0) | |||||
cur = cur->right; | |||||
else | |||||
break; | |||||
} | |||||
return (struct avl_tree_node*)cur; | |||||
} | |||||
/* Same as avl_tree_lookup(), but uses a more specific type for the comparison | |||||
* function. Specifically, with this function the item being searched for is | |||||
* expected to be in the same format as those already in the tree, with an | |||||
* embedded 'struct avl_tree_node'. */ | |||||
static AVL_INLINE struct avl_tree_node * | |||||
avl_tree_lookup_node(const struct avl_tree_node *root, | |||||
const struct avl_tree_node *node, | |||||
int (*cmp)(const struct avl_tree_node *, | |||||
const struct avl_tree_node *)) | |||||
{ | |||||
const struct avl_tree_node *cur = root; | |||||
while (cur) { | |||||
int res = (*cmp)(node, cur); | |||||
if (res < 0) | |||||
cur = cur->left; | |||||
else if (res > 0) | |||||
cur = cur->right; | |||||
else | |||||
break; | |||||
} | |||||
return (struct avl_tree_node*)cur; | |||||
} | |||||
/* | |||||
* Inserts an item into the specified AVL tree. | |||||
* | |||||
* @root_ptr | |||||
* Location of the AVL tree's root pointer. Indirection is needed because | |||||
* the root node may change as a result of rotations caused by the | |||||
* insertion. Initialize *root_ptr to NULL for an empty tree. | |||||
* | |||||
* @item | |||||
* Pointer to the `struct avl_tree_node' embedded in the item to insert. | |||||
* No members in it need be pre-initialized, although members in the | |||||
* containing structure should be pre-initialized so that @cmp can use them | |||||
* in comparisons. | |||||
* | |||||
* @cmp | |||||
* Comparison callback. Must return < 0, 0, or > 0 if the first argument | |||||
* is less than, equal to, or greater than the second argument, | |||||
* respectively. The first argument will be @item and the second | |||||
* argument will be a pointer to an AVL tree node embedded in some | |||||
* previously-inserted item to which @item is being compared. | |||||
* | |||||
* If no item in the tree is comparatively equal (via @cmp) to @item, inserts | |||||
* @item and returns NULL. Otherwise does nothing and returns a pointer to the | |||||
* AVL tree node embedded in the previously-inserted item which compared equal | |||||
* to @item. | |||||
* | |||||
* Example: | |||||
* | |||||
* struct int_wrapper { | |||||
* int data; | |||||
* struct avl_tree_node index_node; | |||||
* }; | |||||
* | |||||
* #define GET_DATA(i) avl_tree_entry((i), struct int_wrapper, index_node)->data | |||||
* | |||||
* static int _avl_cmp_ints(const struct avl_tree_node *node1, | |||||
* const struct avl_tree_node *node2) | |||||
* { | |||||
* int n1 = GET_DATA(node1); | |||||
* int n2 = GET_DATA(node2); | |||||
* if (n1 < n2) | |||||
* return -1; | |||||
* else if (n1 > n2) | |||||
* return 1; | |||||
* else | |||||
* return 0; | |||||
* } | |||||
* | |||||
* bool insert_int(struct avl_tree_node **root_ptr, int data) | |||||
* { | |||||
* struct int_wrapper *i = malloc(sizeof(struct int_wrapper)); | |||||
* i->data = data; | |||||
* if (avl_tree_insert(root_ptr, &i->index_node, _avl_cmp_ints)) { | |||||
* // Duplicate. | |||||
* free(i); | |||||
* return false; | |||||
* } | |||||
* return true; | |||||
* } | |||||
*/ | |||||
static AVL_INLINE struct avl_tree_node * | |||||
avl_tree_insert(struct avl_tree_node **root_ptr, | |||||
struct avl_tree_node *item, | |||||
int (*cmp)(const struct avl_tree_node *, | |||||
const struct avl_tree_node *)) | |||||
{ | |||||
struct avl_tree_node **cur_ptr = root_ptr, *cur = NULL; | |||||
int res; | |||||
while (*cur_ptr) { | |||||
cur = *cur_ptr; | |||||
res = (*cmp)(item, cur); | |||||
if (res < 0) | |||||
cur_ptr = &cur->left; | |||||
else if (res > 0) | |||||
cur_ptr = &cur->right; | |||||
else | |||||
return cur; | |||||
} | |||||
*cur_ptr = item; | |||||
item->parent_balance = (uintptr_t)cur | 1; | |||||
avl_tree_rebalance_after_insert(root_ptr, item); | |||||
return NULL; | |||||
} | |||||
/* Removes an item from the specified AVL tree. | |||||
* See implementation for details. */ | |||||
extern void | |||||
avl_tree_remove(struct avl_tree_node **root_ptr, struct avl_tree_node *node); | |||||
/* Nonrecursive AVL tree traversal functions */ | |||||
extern struct avl_tree_node * | |||||
avl_tree_first_in_order(const struct avl_tree_node *root); | |||||
extern struct avl_tree_node * | |||||
avl_tree_last_in_order(const struct avl_tree_node *root); | |||||
extern struct avl_tree_node * | |||||
avl_tree_next_in_order(const struct avl_tree_node *node); | |||||
extern struct avl_tree_node * | |||||
avl_tree_prev_in_order(const struct avl_tree_node *node); | |||||
extern struct avl_tree_node * | |||||
avl_tree_first_in_postorder(const struct avl_tree_node *root); | |||||
extern struct avl_tree_node * | |||||
avl_tree_next_in_postorder(const struct avl_tree_node *prev, | |||||
const struct avl_tree_node *prev_parent); | |||||
/* | |||||
* Iterate through the nodes in an AVL tree in sorted order. | |||||
* You may not modify the tree during the iteration. | |||||
* | |||||
* @child_struct | |||||
* Variable that will receive a pointer to each struct inserted into the | |||||
* tree. | |||||
* @root | |||||
* Root of the AVL tree. | |||||
* @struct_name | |||||
* Type of *child_struct. | |||||
* @struct_member | |||||
* Member of @struct_name type that is the AVL tree node. | |||||
* | |||||
* Example: | |||||
* | |||||
* struct int_wrapper { | |||||
* int data; | |||||
* struct avl_tree_node index_node; | |||||
* }; | |||||
* | |||||
* void print_ints(struct avl_tree_node *root) | |||||
* { | |||||
* struct int_wrapper *i; | |||||
* | |||||
* avl_tree_for_each_in_order(i, root, struct int_wrapper, index_node) | |||||
* printf("%d\n", i->data); | |||||
* } | |||||
*/ | |||||
#define avl_tree_for_each_in_order(child_struct, root, \ | |||||
struct_name, struct_member) \ | |||||
for (struct avl_tree_node *_cur = \ | |||||
avl_tree_first_in_order(root); \ | |||||
_cur && ((child_struct) = \ | |||||
avl_tree_entry(_cur, struct_name, \ | |||||
struct_member), 1); \ | |||||
_cur = avl_tree_next_in_order(_cur)) | |||||
/* | |||||
* Like avl_tree_for_each_in_order(), but uses the reverse order. | |||||
*/ | |||||
#define avl_tree_for_each_in_reverse_order(child_struct, root, \ | |||||
struct_name, struct_member) \ | |||||
for (struct avl_tree_node *_cur = \ | |||||
avl_tree_last_in_order(root); \ | |||||
_cur && ((child_struct) = \ | |||||
avl_tree_entry(_cur, struct_name, \ | |||||
struct_member), 1); \ | |||||
_cur = avl_tree_prev_in_order(_cur)) | |||||
/* | |||||
* Like avl_tree_for_each_in_order(), but iterates through the nodes in | |||||
* postorder, so the current node may be deleted or freed. | |||||
*/ | |||||
#define avl_tree_for_each_in_postorder(child_struct, root, \ | |||||
struct_name, struct_member) \ | |||||
for (struct avl_tree_node *_cur = \ | |||||
avl_tree_first_in_postorder(root), *_parent; \ | |||||
_cur && ((child_struct) = \ | |||||
avl_tree_entry(_cur, struct_name, \ | |||||
struct_member), 1) \ | |||||
&& (_parent = avl_get_parent(_cur), 1); \ | |||||
_cur = avl_tree_next_in_postorder(_cur, _parent)) | |||||
#endif /* _AVL_TREE_H_ */ |
@@ -1,5 +1,5 @@ | |||||
/* | /* | ||||
Copyright (C) 2012-2016 Ben Kurtovic <ben.kurtovic@gmail.com> | |||||
Copyright (C) 2012-2017 Ben Kurtovic <ben.kurtovic@gmail.com> | |||||
Permission is hereby granted, free of charge, to any person obtaining a copy of | Permission is hereby granted, free of charge, to any person obtaining a copy of | ||||
this software and associated documentation files (the "Software"), to deal in | this software and associated documentation files (the "Software"), to deal in | ||||
@@ -30,6 +30,8 @@ SOFTWARE. | |||||
#include <structmember.h> | #include <structmember.h> | ||||
#include <bytesobject.h> | #include <bytesobject.h> | ||||
#include "avl_tree.h" | |||||
/* Compatibility macros */ | /* Compatibility macros */ | ||||
#if PY_MAJOR_VERSION >= 3 | #if PY_MAJOR_VERSION >= 3 | ||||
@@ -92,10 +94,16 @@ typedef struct { | |||||
#endif | #endif | ||||
} Textbuffer; | } Textbuffer; | ||||
typedef struct { | |||||
Py_ssize_t head; | |||||
uint64_t context; | |||||
} StackIdent; | |||||
struct Stack { | struct Stack { | ||||
PyObject* stack; | PyObject* stack; | ||||
uint64_t context; | uint64_t context; | ||||
Textbuffer* textbuffer; | Textbuffer* textbuffer; | ||||
StackIdent ident; | |||||
struct Stack* next; | struct Stack* next; | ||||
}; | }; | ||||
typedef struct Stack Stack; | typedef struct Stack Stack; | ||||
@@ -111,6 +119,13 @@ typedef struct { | |||||
#endif | #endif | ||||
} TokenizerInput; | } TokenizerInput; | ||||
typedef struct avl_tree_node avl_tree; | |||||
typedef struct { | |||||
StackIdent id; | |||||
struct avl_tree_node node; | |||||
} route_tree_node; | |||||
typedef struct { | typedef struct { | ||||
PyObject_HEAD | PyObject_HEAD | ||||
TokenizerInput text; /* text to tokenize */ | TokenizerInput text; /* text to tokenize */ | ||||
@@ -118,8 +133,8 @@ typedef struct { | |||||
Py_ssize_t head; /* current position in text */ | Py_ssize_t head; /* current position in text */ | ||||
int global; /* global context */ | int global; /* global context */ | ||||
int depth; /* stack recursion depth */ | int depth; /* stack recursion depth */ | ||||
int cycles; /* total number of stack recursions */ | |||||
int route_state; /* whether a BadRoute has been triggered */ | int route_state; /* whether a BadRoute has been triggered */ | ||||
uint64_t route_context; /* context when the last BadRoute was triggered */ | uint64_t route_context; /* context when the last BadRoute was triggered */ | ||||
avl_tree* bad_routes; /* stack idents for routes known to fail */ | |||||
int skip_style_tags; /* temp fix for the sometimes broken tag parser */ | int skip_style_tags; /* temp fix for the sometimes broken tag parser */ | ||||
} Tokenizer; | } Tokenizer; |
@@ -1,5 +1,5 @@ | |||||
/* | /* | ||||
Copyright (C) 2012-2016 Ben Kurtovic <ben.kurtovic@gmail.com> | |||||
Copyright (C) 2012-2017 Ben Kurtovic <ben.kurtovic@gmail.com> | |||||
Permission is hereby granted, free of charge, to any person obtaining a copy of | Permission is hereby granted, free of charge, to any person obtaining a copy of | ||||
this software and associated documentation files (the "Software"), to deal in | this software and associated documentation files (the "Software"), to deal in | ||||
@@ -81,6 +81,8 @@ SOFTWARE. | |||||
#define LC_TABLE_TD_LINE 0x0000000800000000 | #define LC_TABLE_TD_LINE 0x0000000800000000 | ||||
#define LC_TABLE_TH_LINE 0x0000001000000000 | #define LC_TABLE_TH_LINE 0x0000001000000000 | ||||
#define LC_HTML_ENTITY 0x0000002000000000 | |||||
/* Global contexts */ | /* Global contexts */ | ||||
#define GL_HEADING 0x1 | #define GL_HEADING 0x1 | ||||
@@ -1,5 +1,5 @@ | |||||
/* | /* | ||||
Copyright (C) 2012-2016 Ben Kurtovic <ben.kurtovic@gmail.com> | |||||
Copyright (C) 2012-2017 Ben Kurtovic <ben.kurtovic@gmail.com> | |||||
Permission is hereby granted, free of charge, to any person obtaining a copy of | Permission is hereby granted, free of charge, to any person obtaining a copy of | ||||
this software and associated documentation files (the "Software"), to deal in | this software and associated documentation files (the "Software"), to deal in | ||||
@@ -445,6 +445,8 @@ static int Tokenizer_parse_bracketed_uri_scheme(Tokenizer* self) | |||||
Unicode this; | Unicode this; | ||||
int slashes, i; | int slashes, i; | ||||
if (Tokenizer_check_route(self, LC_EXT_LINK_URI) < 0) | |||||
return 0; | |||||
if (Tokenizer_push(self, LC_EXT_LINK_URI)) | if (Tokenizer_push(self, LC_EXT_LINK_URI)) | ||||
return -1; | return -1; | ||||
if (Tokenizer_read(self, 0) == '/' && Tokenizer_read(self, 1) == '/') { | if (Tokenizer_read(self, 0) == '/' && Tokenizer_read(self, 1) == '/') { | ||||
@@ -461,7 +463,7 @@ static int Tokenizer_parse_bracketed_uri_scheme(Tokenizer* self) | |||||
while (1) { | while (1) { | ||||
if (!valid[i]) | if (!valid[i]) | ||||
goto end_of_loop; | goto end_of_loop; | ||||
if (this == valid[i]) | |||||
if (this == (Unicode) valid[i]) | |||||
break; | break; | ||||
i++; | i++; | ||||
} | } | ||||
@@ -517,6 +519,7 @@ static int Tokenizer_parse_free_uri_scheme(Tokenizer* self) | |||||
Unicode chunk; | Unicode chunk; | ||||
Py_ssize_t i; | Py_ssize_t i; | ||||
int slashes, j; | int slashes, j; | ||||
uint64_t new_context; | |||||
if (!scheme_buffer) | if (!scheme_buffer) | ||||
return -1; | return -1; | ||||
@@ -533,7 +536,7 @@ static int Tokenizer_parse_free_uri_scheme(Tokenizer* self) | |||||
FAIL_ROUTE(0); | FAIL_ROUTE(0); | ||||
return 0; | return 0; | ||||
} | } | ||||
} while (chunk != valid[j++]); | |||||
} while (chunk != (Unicode) valid[j++]); | |||||
Textbuffer_write(scheme_buffer, chunk); | Textbuffer_write(scheme_buffer, chunk); | ||||
} | } | ||||
end_of_loop: | end_of_loop: | ||||
@@ -552,7 +555,12 @@ static int Tokenizer_parse_free_uri_scheme(Tokenizer* self) | |||||
return 0; | return 0; | ||||
} | } | ||||
Py_DECREF(scheme); | Py_DECREF(scheme); | ||||
if (Tokenizer_push(self, self->topstack->context | LC_EXT_LINK_URI)) { | |||||
new_context = self->topstack->context | LC_EXT_LINK_URI; | |||||
if (Tokenizer_check_route(self, new_context) < 0) { | |||||
Textbuffer_dealloc(scheme_buffer); | |||||
return 0; | |||||
} | |||||
if (Tokenizer_push(self, new_context)) { | |||||
Textbuffer_dealloc(scheme_buffer); | Textbuffer_dealloc(scheme_buffer); | ||||
return -1; | return -1; | ||||
} | } | ||||
@@ -1000,7 +1008,7 @@ static int Tokenizer_really_parse_entity(Tokenizer* self) | |||||
while (1) { | while (1) { | ||||
if (!valid[j]) | if (!valid[j]) | ||||
FAIL_ROUTE_AND_EXIT() | FAIL_ROUTE_AND_EXIT() | ||||
if (this == valid[j]) | |||||
if (this == (Unicode) valid[j]) | |||||
break; | break; | ||||
j++; | j++; | ||||
} | } | ||||
@@ -1065,11 +1073,14 @@ static int Tokenizer_parse_entity(Tokenizer* self) | |||||
Py_ssize_t reset = self->head; | Py_ssize_t reset = self->head; | ||||
PyObject *tokenlist; | PyObject *tokenlist; | ||||
if (Tokenizer_push(self, 0)) | |||||
if (Tokenizer_check_route(self, LC_HTML_ENTITY) < 0) | |||||
goto on_bad_route; | |||||
if (Tokenizer_push(self, LC_HTML_ENTITY)) | |||||
return -1; | return -1; | ||||
if (Tokenizer_really_parse_entity(self)) | if (Tokenizer_really_parse_entity(self)) | ||||
return -1; | return -1; | ||||
if (BAD_ROUTE) { | if (BAD_ROUTE) { | ||||
on_bad_route: | |||||
RESET_ROUTE(); | RESET_ROUTE(); | ||||
self->head = reset; | self->head = reset; | ||||
if (Tokenizer_emit_char(self, '&')) | if (Tokenizer_emit_char(self, '&')) | ||||
@@ -1537,6 +1548,14 @@ static PyObject* Tokenizer_handle_single_tag_end(Tokenizer* self) | |||||
if (depth == 0) | if (depth == 0) | ||||
break; | break; | ||||
} | } | ||||
is_instance = PyObject_IsInstance(token, TagCloseSelfclose); | |||||
if (is_instance == -1) | |||||
return NULL; | |||||
else if (is_instance == 1) { | |||||
depth--; | |||||
if (depth == 0) // Should never happen | |||||
return NULL; | |||||
} | |||||
} | } | ||||
if (!token || depth > 0) | if (!token || depth > 0) | ||||
return NULL; | return NULL; | ||||
@@ -1574,6 +1593,8 @@ static PyObject* Tokenizer_really_parse_tag(Tokenizer* self) | |||||
if (!data) | if (!data) | ||||
return NULL; | return NULL; | ||||
if (Tokenizer_check_route(self, LC_TAG_OPEN) < 0) | |||||
return NULL; | |||||
if (Tokenizer_push(self, LC_TAG_OPEN)) { | if (Tokenizer_push(self, LC_TAG_OPEN)) { | ||||
TagData_dealloc(data); | TagData_dealloc(data); | ||||
return NULL; | return NULL; | ||||
@@ -2191,14 +2212,18 @@ static PyObject* Tokenizer_handle_table_style(Tokenizer* self, Unicode end_token | |||||
static int Tokenizer_parse_table(Tokenizer* self) | static int Tokenizer_parse_table(Tokenizer* self) | ||||
{ | { | ||||
Py_ssize_t reset = self->head; | Py_ssize_t reset = self->head; | ||||
PyObject *style, *padding; | |||||
PyObject *style, *padding, *trash; | |||||
PyObject *table = NULL; | PyObject *table = NULL; | ||||
StackIdent restore_point; | |||||
self->head += 2; | self->head += 2; | ||||
if(Tokenizer_push(self, LC_TABLE_OPEN)) | |||||
if (Tokenizer_check_route(self, LC_TABLE_OPEN) < 0) | |||||
goto on_bad_route; | |||||
if (Tokenizer_push(self, LC_TABLE_OPEN)) | |||||
return -1; | return -1; | ||||
padding = Tokenizer_handle_table_style(self, '\n'); | padding = Tokenizer_handle_table_style(self, '\n'); | ||||
if (BAD_ROUTE) { | if (BAD_ROUTE) { | ||||
on_bad_route: | |||||
RESET_ROUTE(); | RESET_ROUTE(); | ||||
self->head = reset; | self->head = reset; | ||||
if (Tokenizer_emit_char(self, '{')) | if (Tokenizer_emit_char(self, '{')) | ||||
@@ -2214,11 +2239,16 @@ static int Tokenizer_parse_table(Tokenizer* self) | |||||
} | } | ||||
self->head++; | self->head++; | ||||
restore_point = self->topstack->ident; | |||||
table = Tokenizer_parse(self, LC_TABLE_OPEN, 1); | table = Tokenizer_parse(self, LC_TABLE_OPEN, 1); | ||||
if (BAD_ROUTE) { | if (BAD_ROUTE) { | ||||
RESET_ROUTE(); | RESET_ROUTE(); | ||||
Py_DECREF(padding); | Py_DECREF(padding); | ||||
Py_DECREF(style); | Py_DECREF(style); | ||||
while (!Tokenizer_IS_CURRENT_STACK(self, restore_point)) { | |||||
trash = Tokenizer_pop(self); | |||||
Py_XDECREF(trash); | |||||
} | |||||
self->head = reset; | self->head = reset; | ||||
if (Tokenizer_emit_char(self, '{')) | if (Tokenizer_emit_char(self, '{')) | ||||
return -1; | return -1; | ||||
@@ -2243,7 +2273,7 @@ static int Tokenizer_parse_table(Tokenizer* self) | |||||
*/ | */ | ||||
static int Tokenizer_handle_table_row(Tokenizer* self) | static int Tokenizer_handle_table_row(Tokenizer* self) | ||||
{ | { | ||||
PyObject *padding, *style, *row, *trash; | |||||
PyObject *padding, *style, *row; | |||||
self->head += 2; | self->head += 2; | ||||
if (!Tokenizer_CAN_RECURSE(self)) { | if (!Tokenizer_CAN_RECURSE(self)) { | ||||
@@ -2253,14 +2283,13 @@ static int Tokenizer_handle_table_row(Tokenizer* self) | |||||
return 0; | return 0; | ||||
} | } | ||||
if(Tokenizer_push(self, LC_TABLE_OPEN | LC_TABLE_ROW_OPEN)) | |||||
if (Tokenizer_check_route(self, LC_TABLE_OPEN | LC_TABLE_ROW_OPEN) < 0) | |||||
return 0; | |||||
if (Tokenizer_push(self, LC_TABLE_OPEN | LC_TABLE_ROW_OPEN)) | |||||
return -1; | return -1; | ||||
padding = Tokenizer_handle_table_style(self, '\n'); | padding = Tokenizer_handle_table_style(self, '\n'); | ||||
if (BAD_ROUTE) { | |||||
trash = Tokenizer_pop(self); | |||||
Py_XDECREF(trash); | |||||
if (BAD_ROUTE) | |||||
return 0; | return 0; | ||||
} | |||||
if (!padding) | if (!padding) | ||||
return -1; | return -1; | ||||
style = Tokenizer_pop(self); | style = Tokenizer_pop(self); | ||||
@@ -2319,8 +2348,8 @@ Tokenizer_handle_table_cell(Tokenizer* self, const char *markup, | |||||
if (cell_context & LC_TABLE_CELL_STYLE) { | if (cell_context & LC_TABLE_CELL_STYLE) { | ||||
Py_DECREF(cell); | Py_DECREF(cell); | ||||
self->head = reset; | self->head = reset; | ||||
if(Tokenizer_push(self, LC_TABLE_OPEN | LC_TABLE_CELL_OPEN | | |||||
line_context)) | |||||
if (Tokenizer_push(self, LC_TABLE_OPEN | LC_TABLE_CELL_OPEN | | |||||
line_context)) | |||||
return -1; | return -1; | ||||
padding = Tokenizer_handle_table_style(self, '|'); | padding = Tokenizer_handle_table_style(self, '|'); | ||||
if (!padding) | if (!padding) | ||||
@@ -2541,6 +2570,8 @@ PyObject* Tokenizer_parse(Tokenizer* self, uint64_t context, int push) | |||||
PyObject* temp; | PyObject* temp; | ||||
if (push) { | if (push) { | ||||
if (Tokenizer_check_route(self, context) < 0) | |||||
return NULL; | |||||
if (Tokenizer_push(self, context)) | if (Tokenizer_push(self, context)) | ||||
return NULL; | return NULL; | ||||
} | } | ||||
@@ -1,5 +1,5 @@ | |||||
/* | /* | ||||
Copyright (C) 2012-2016 Ben Kurtovic <ben.kurtovic@gmail.com> | |||||
Copyright (C) 2012-2017 Ben Kurtovic <ben.kurtovic@gmail.com> | |||||
Permission is hereby granted, free of charge, to any person obtaining a copy of | Permission is hereby granted, free of charge, to any person obtaining a copy of | ||||
this software and associated documentation files (the "Software"), to deal in | this software and associated documentation files (the "Software"), to deal in | ||||
@@ -40,10 +40,11 @@ int Tokenizer_push(Tokenizer* self, uint64_t context) | |||||
top->textbuffer = Textbuffer_new(&self->text); | top->textbuffer = Textbuffer_new(&self->text); | ||||
if (!top->textbuffer) | if (!top->textbuffer) | ||||
return -1; | return -1; | ||||
top->ident.head = self->head; | |||||
top->ident.context = context; | |||||
top->next = self->topstack; | top->next = self->topstack; | ||||
self->topstack = top; | self->topstack = top; | ||||
self->depth++; | self->depth++; | ||||
self->cycles++; | |||||
return 0; | return 0; | ||||
} | } | ||||
@@ -130,20 +131,88 @@ PyObject* Tokenizer_pop_keeping_context(Tokenizer* self) | |||||
} | } | ||||
/* | /* | ||||
Compare two route_tree_nodes that are in their avl_tree_node forms. | |||||
*/ | |||||
static int compare_nodes( | |||||
const struct avl_tree_node* na, const struct avl_tree_node* nb) | |||||
{ | |||||
route_tree_node *a = avl_tree_entry(na, route_tree_node, node); | |||||
route_tree_node *b = avl_tree_entry(nb, route_tree_node, node); | |||||
if (a->id.head < b->id.head) | |||||
return -1; | |||||
if (a->id.head > b->id.head) | |||||
return 1; | |||||
return (a->id.context > b->id.context) - (a->id.context < b->id.context); | |||||
} | |||||
/* | |||||
Fail the current tokenization route. Discards the current | Fail the current tokenization route. Discards the current | ||||
stack/context/textbuffer and sets the BAD_ROUTE flag. | |||||
stack/context/textbuffer and sets the BAD_ROUTE flag. Also records the | |||||
ident of the failed stack so future parsing attempts down this route can be | |||||
stopped early. | |||||
*/ | */ | ||||
void* Tokenizer_fail_route(Tokenizer* self) | void* Tokenizer_fail_route(Tokenizer* self) | ||||
{ | { | ||||
uint64_t context = self->topstack->context; | uint64_t context = self->topstack->context; | ||||
PyObject* stack = Tokenizer_pop(self); | |||||
PyObject* stack; | |||||
route_tree_node *node = malloc(sizeof(route_tree_node)); | |||||
if (node) { | |||||
node->id = self->topstack->ident; | |||||
if (avl_tree_insert(&self->bad_routes, &node->node, compare_nodes)) | |||||
free(node); | |||||
} | |||||
stack = Tokenizer_pop(self); | |||||
Py_XDECREF(stack); | Py_XDECREF(stack); | ||||
FAIL_ROUTE(context); | FAIL_ROUTE(context); | ||||
return NULL; | return NULL; | ||||
} | } | ||||
/* | /* | ||||
Check if pushing a new route here with the given context would definitely | |||||
fail, based on a previous call to Tokenizer_fail_route() with the same | |||||
stack. | |||||
Return 0 if safe and -1 if unsafe. The BAD_ROUTE flag will be set in the | |||||
latter case. | |||||
This function is not necessary to call and works as an optimization | |||||
implementation detail. (The Python tokenizer checks every route on push, | |||||
but this would introduce too much overhead in C tokenizer due to the need | |||||
to check for a bad route after every call to Tokenizer_push.) | |||||
*/ | |||||
int Tokenizer_check_route(Tokenizer* self, uint64_t context) | |||||
{ | |||||
StackIdent ident = {self->head, context}; | |||||
struct avl_tree_node *node = (struct avl_tree_node*) (&ident + 1); | |||||
if (avl_tree_lookup_node(self->bad_routes, node, compare_nodes)) { | |||||
FAIL_ROUTE(context); | |||||
return -1; | |||||
} | |||||
return 0; | |||||
} | |||||
/* | |||||
Free the tokenizer's bad route cache tree. Intended to be called by the | |||||
main tokenizer function after parsing is finished. | |||||
*/ | |||||
void Tokenizer_free_bad_route_tree(Tokenizer *self) | |||||
{ | |||||
struct avl_tree_node *cur = avl_tree_first_in_postorder(self->bad_routes); | |||||
struct avl_tree_node *parent; | |||||
while (cur) { | |||||
route_tree_node *node = avl_tree_entry(cur, route_tree_node, node); | |||||
parent = avl_get_parent(cur); | |||||
free(node); | |||||
cur = avl_tree_next_in_postorder(cur, parent); | |||||
} | |||||
self->bad_routes = NULL; | |||||
} | |||||
/* | |||||
Write a token to the current token stack. | Write a token to the current token stack. | ||||
*/ | */ | ||||
int Tokenizer_emit_token(Tokenizer* self, PyObject* token, int first) | int Tokenizer_emit_token(Tokenizer* self, PyObject* token, int first) | ||||
@@ -1,5 +1,5 @@ | |||||
/* | /* | ||||
Copyright (C) 2012-2016 Ben Kurtovic <ben.kurtovic@gmail.com> | |||||
Copyright (C) 2012-2017 Ben Kurtovic <ben.kurtovic@gmail.com> | |||||
Permission is hereby granted, free of charge, to any person obtaining a copy of | Permission is hereby granted, free of charge, to any person obtaining a copy of | ||||
this software and associated documentation files (the "Software"), to deal in | this software and associated documentation files (the "Software"), to deal in | ||||
@@ -32,6 +32,8 @@ void Tokenizer_delete_top_of_stack(Tokenizer*); | |||||
PyObject* Tokenizer_pop(Tokenizer*); | PyObject* Tokenizer_pop(Tokenizer*); | ||||
PyObject* Tokenizer_pop_keeping_context(Tokenizer*); | PyObject* Tokenizer_pop_keeping_context(Tokenizer*); | ||||
void* Tokenizer_fail_route(Tokenizer*); | void* Tokenizer_fail_route(Tokenizer*); | ||||
int Tokenizer_check_route(Tokenizer*, uint64_t); | |||||
void Tokenizer_free_bad_route_tree(Tokenizer*); | |||||
int Tokenizer_emit_token(Tokenizer*, PyObject*, int); | int Tokenizer_emit_token(Tokenizer*, PyObject*, int); | ||||
int Tokenizer_emit_token_kwargs(Tokenizer*, PyObject*, PyObject*, int); | int Tokenizer_emit_token_kwargs(Tokenizer*, PyObject*, PyObject*, int); | ||||
@@ -47,10 +49,11 @@ Unicode Tokenizer_read_backwards(Tokenizer*, Py_ssize_t); | |||||
/* Macros */ | /* Macros */ | ||||
#define MAX_DEPTH 40 | #define MAX_DEPTH 40 | ||||
#define MAX_CYCLES 100000 | |||||
#define Tokenizer_CAN_RECURSE(self) \ | #define Tokenizer_CAN_RECURSE(self) \ | ||||
(self->depth < MAX_DEPTH && self->cycles < MAX_CYCLES) | |||||
(self->depth < MAX_DEPTH) | |||||
#define Tokenizer_IS_CURRENT_STACK(self, id) \ | |||||
(self->topstack->ident.head == (id).head && \ | |||||
self->topstack->ident.context == (id).context) | |||||
#define Tokenizer_emit(self, token) \ | #define Tokenizer_emit(self, token) \ | ||||
Tokenizer_emit_token(self, token, 0) | Tokenizer_emit_token(self, token, 0) | ||||
@@ -1,5 +1,5 @@ | |||||
/* | /* | ||||
Copyright (C) 2012-2016 Ben Kurtovic <ben.kurtovic@gmail.com> | |||||
Copyright (C) 2012-2017 Ben Kurtovic <ben.kurtovic@gmail.com> | |||||
Permission is hereby granted, free of charge, to any person obtaining a copy of | Permission is hereby granted, free of charge, to any person obtaining a copy of | ||||
this software and associated documentation files (the "Software"), to deal in | this software and associated documentation files (the "Software"), to deal in | ||||
@@ -22,6 +22,7 @@ SOFTWARE. | |||||
#include "tokenizer.h" | #include "tokenizer.h" | ||||
#include "tok_parse.h" | #include "tok_parse.h" | ||||
#include "tok_support.h" | |||||
#include "tokens.h" | #include "tokens.h" | ||||
/* Globals */ | /* Globals */ | ||||
@@ -103,8 +104,9 @@ static int Tokenizer_init(Tokenizer* self, PyObject* args, PyObject* kwds) | |||||
return -1; | return -1; | ||||
init_tokenizer_text(&self->text); | init_tokenizer_text(&self->text); | ||||
self->topstack = NULL; | self->topstack = NULL; | ||||
self->head = self->global = self->depth = self->cycles = 0; | |||||
self->head = self->global = self->depth = 0; | |||||
self->route_context = self->route_state = 0; | self->route_context = self->route_state = 0; | ||||
self->bad_routes = NULL; | |||||
self->skip_style_tags = 0; | self->skip_style_tags = 0; | ||||
return 0; | return 0; | ||||
} | } | ||||
@@ -158,10 +160,14 @@ static PyObject* Tokenizer_tokenize(Tokenizer* self, PyObject* args) | |||||
return NULL; | return NULL; | ||||
} | } | ||||
self->head = self->global = self->depth = self->cycles = 0; | |||||
self->head = self->global = self->depth = 0; | |||||
self->skip_style_tags = skip_style_tags; | self->skip_style_tags = skip_style_tags; | ||||
self->bad_routes = NULL; | |||||
tokens = Tokenizer_parse(self, context, 1); | tokens = Tokenizer_parse(self, context, 1); | ||||
Tokenizer_free_bad_route_tree(self); | |||||
if (!tokens || self->topstack) { | if (!tokens || self->topstack) { | ||||
Py_XDECREF(tokens); | Py_XDECREF(tokens); | ||||
if (PyErr_Occurred()) | if (PyErr_Occurred()) | ||||
@@ -1,6 +1,6 @@ | |||||
# -*- coding: utf-8 -*- | # -*- coding: utf-8 -*- | ||||
# | # | ||||
# Copyright (C) 2012-2016 Ben Kurtovic <ben.kurtovic@gmail.com> | |||||
# Copyright (C) 2012-2017 Ben Kurtovic <ben.kurtovic@gmail.com> | |||||
# | # | ||||
# Permission is hereby granted, free of charge, to any person obtaining a copy | # Permission is hereby granted, free of charge, to any person obtaining a copy | ||||
# of this software and associated documentation files (the "Software"), to deal | # of this software and associated documentation files (the "Software"), to deal | ||||
@@ -65,7 +65,6 @@ class Tokenizer(object): | |||||
MARKERS = ["{", "}", "[", "]", "<", ">", "|", "=", "&", "'", "#", "*", ";", | MARKERS = ["{", "}", "[", "]", "<", ">", "|", "=", "&", "'", "#", "*", ";", | ||||
":", "/", "-", "!", "\n", START, END] | ":", "/", "-", "!", "\n", START, END] | ||||
MAX_DEPTH = 40 | MAX_DEPTH = 40 | ||||
MAX_CYCLES = 100000 | |||||
regex = re.compile(r"([{}\[\]<>|=&'#*;:/\\\"\-!\n])", flags=re.IGNORECASE) | regex = re.compile(r"([{}\[\]<>|=&'#*;:/\\\"\-!\n])", flags=re.IGNORECASE) | ||||
tag_splitter = re.compile(r"([\s\"\'\\]+)") | tag_splitter = re.compile(r"([\s\"\'\\]+)") | ||||
@@ -75,7 +74,8 @@ class Tokenizer(object): | |||||
self._stacks = [] | self._stacks = [] | ||||
self._global = 0 | self._global = 0 | ||||
self._depth = 0 | self._depth = 0 | ||||
self._cycles = 0 | |||||
self._bad_routes = set() | |||||
self._skip_style_tags = False | |||||
@property | @property | ||||
def _stack(self): | def _stack(self): | ||||
@@ -100,11 +100,24 @@ class Tokenizer(object): | |||||
def _textbuffer(self, value): | def _textbuffer(self, value): | ||||
self._stacks[-1][2] = value | self._stacks[-1][2] = value | ||||
@property | |||||
def _stack_ident(self): | |||||
"""An identifier for the current stack. | |||||
This is based on the starting head position and context. Stacks with | |||||
the same identifier are always parsed in the same way. This can be used | |||||
to cache intermediate parsing info. | |||||
""" | |||||
return self._stacks[-1][3] | |||||
def _push(self, context=0): | def _push(self, context=0): | ||||
"""Add a new token stack, context, and textbuffer to the list.""" | """Add a new token stack, context, and textbuffer to the list.""" | ||||
self._stacks.append([[], context, []]) | |||||
new_ident = (self._head, context) | |||||
if new_ident in self._bad_routes: | |||||
raise BadRoute(context) | |||||
self._stacks.append([[], context, [], new_ident]) | |||||
self._depth += 1 | self._depth += 1 | ||||
self._cycles += 1 | |||||
def _push_textbuffer(self): | def _push_textbuffer(self): | ||||
"""Push the textbuffer onto the stack as a Text node and clear it.""" | """Push the textbuffer onto the stack as a Text node and clear it.""" | ||||
@@ -129,7 +142,7 @@ class Tokenizer(object): | |||||
def _can_recurse(self): | def _can_recurse(self): | ||||
"""Return whether or not our max recursion depth has been exceeded.""" | """Return whether or not our max recursion depth has been exceeded.""" | ||||
return self._depth < self.MAX_DEPTH and self._cycles < self.MAX_CYCLES | |||||
return self._depth < self.MAX_DEPTH | |||||
def _fail_route(self): | def _fail_route(self): | ||||
"""Fail the current tokenization route. | """Fail the current tokenization route. | ||||
@@ -138,6 +151,7 @@ class Tokenizer(object): | |||||
:exc:`.BadRoute`. | :exc:`.BadRoute`. | ||||
""" | """ | ||||
context = self._context | context = self._context | ||||
self._bad_routes.add(self._stack_ident) | |||||
self._pop() | self._pop() | ||||
raise BadRoute(context) | raise BadRoute(context) | ||||
@@ -609,8 +623,8 @@ class Tokenizer(object): | |||||
def _parse_entity(self): | def _parse_entity(self): | ||||
"""Parse an HTML entity at the head of the wikicode string.""" | """Parse an HTML entity at the head of the wikicode string.""" | ||||
reset = self._head | reset = self._head | ||||
self._push() | |||||
try: | try: | ||||
self._push(contexts.HTML_ENTITY) | |||||
self._really_parse_entity() | self._really_parse_entity() | ||||
except BadRoute: | except BadRoute: | ||||
self._head = reset | self._head = reset | ||||
@@ -650,8 +664,9 @@ class Tokenizer(object): | |||||
self._emit_first(tokens.TagAttrQuote(char=data.quoter)) | self._emit_first(tokens.TagAttrQuote(char=data.quoter)) | ||||
self._emit_all(self._pop()) | self._emit_all(self._pop()) | ||||
buf = data.padding_buffer | buf = data.padding_buffer | ||||
self._emit_first(tokens.TagAttrStart(pad_first=buf["first"], | |||||
pad_before_eq=buf["before_eq"], pad_after_eq=buf["after_eq"])) | |||||
self._emit_first(tokens.TagAttrStart( | |||||
pad_first=buf["first"], pad_before_eq=buf["before_eq"], | |||||
pad_after_eq=buf["after_eq"])) | |||||
self._emit_all(self._pop()) | self._emit_all(self._pop()) | ||||
for key in data.padding_buffer: | for key in data.padding_buffer: | ||||
data.padding_buffer[key] = "" | data.padding_buffer[key] = "" | ||||
@@ -804,6 +819,12 @@ class Tokenizer(object): | |||||
depth -= 1 | depth -= 1 | ||||
if depth == 0: | if depth == 0: | ||||
break | break | ||||
elif isinstance(token, tokens.TagCloseSelfclose): | |||||
depth -= 1 | |||||
if depth == 0: # pragma: no cover (untestable/exceptional) | |||||
raise ParserError( | |||||
"_handle_single_tag_end() got an unexpected " | |||||
"TagCloseSelfclose") | |||||
else: # pragma: no cover (untestable/exceptional case) | else: # pragma: no cover (untestable/exceptional case) | ||||
raise ParserError("_handle_single_tag_end() missed a TagCloseOpen") | raise ParserError("_handle_single_tag_end() missed a TagCloseOpen") | ||||
padding = stack[index].padding | padding = stack[index].padding | ||||
@@ -1076,8 +1097,8 @@ class Tokenizer(object): | |||||
"""Parse a wikicode table by starting with the first line.""" | """Parse a wikicode table by starting with the first line.""" | ||||
reset = self._head | reset = self._head | ||||
self._head += 2 | self._head += 2 | ||||
self._push(contexts.TABLE_OPEN) | |||||
try: | try: | ||||
self._push(contexts.TABLE_OPEN) | |||||
padding = self._handle_table_style("\n") | padding = self._handle_table_style("\n") | ||||
except BadRoute: | except BadRoute: | ||||
self._head = reset | self._head = reset | ||||
@@ -1086,9 +1107,12 @@ class Tokenizer(object): | |||||
style = self._pop() | style = self._pop() | ||||
self._head += 1 | self._head += 1 | ||||
restore_point = self._stack_ident | |||||
try: | try: | ||||
table = self._parse(contexts.TABLE_OPEN) | table = self._parse(contexts.TABLE_OPEN) | ||||
except BadRoute: | except BadRoute: | ||||
while self._stack_ident != restore_point: | |||||
self._pop() | |||||
self._head = reset | self._head = reset | ||||
self._emit_text("{") | self._emit_text("{") | ||||
return | return | ||||
@@ -1106,11 +1130,7 @@ class Tokenizer(object): | |||||
return | return | ||||
self._push(contexts.TABLE_OPEN | contexts.TABLE_ROW_OPEN) | self._push(contexts.TABLE_OPEN | contexts.TABLE_ROW_OPEN) | ||||
try: | |||||
padding = self._handle_table_style("\n") | |||||
except BadRoute: | |||||
self._pop() | |||||
raise | |||||
padding = self._handle_table_style("\n") | |||||
style = self._pop() | style = self._pop() | ||||
# Don't parse the style separator: | # Don't parse the style separator: | ||||
@@ -1348,7 +1368,8 @@ class Tokenizer(object): | |||||
# Kill potential table contexts | # Kill potential table contexts | ||||
self._context &= ~contexts.TABLE_CELL_LINE_CONTEXTS | self._context &= ~contexts.TABLE_CELL_LINE_CONTEXTS | ||||
# Start of table parsing | # Start of table parsing | ||||
elif this == "{" and next == "|" and (self._read(-1) in ("\n", self.START) or | |||||
elif this == "{" and next == "|" and ( | |||||
self._read(-1) in ("\n", self.START) or | |||||
(self._read(-2) in ("\n", self.START) and self._read(-1).isspace())): | (self._read(-2) in ("\n", self.START) and self._read(-1).isspace())): | ||||
if self._can_recurse(): | if self._can_recurse(): | ||||
self._parse_table() | self._parse_table() | ||||
@@ -1374,7 +1395,7 @@ class Tokenizer(object): | |||||
self._context &= ~contexts.TABLE_CELL_LINE_CONTEXTS | self._context &= ~contexts.TABLE_CELL_LINE_CONTEXTS | ||||
self._emit_text(this) | self._emit_text(this) | ||||
elif (self._read(-1) in ("\n", self.START) or | elif (self._read(-1) in ("\n", self.START) or | ||||
(self._read(-2) in ("\n", self.START) and self._read(-1).isspace())): | |||||
(self._read(-2) in ("\n", self.START) and self._read(-1).isspace())): | |||||
if this == "|" and next == "}": | if this == "|" and next == "}": | ||||
if self._context & contexts.TABLE_CELL_OPEN: | if self._context & contexts.TABLE_CELL_OPEN: | ||||
return self._handle_table_cell_end() | return self._handle_table_cell_end() | ||||
@@ -1406,10 +1427,12 @@ class Tokenizer(object): | |||||
def tokenize(self, text, context=0, skip_style_tags=False): | def tokenize(self, text, context=0, skip_style_tags=False): | ||||
"""Build a list of tokens from a string of wikicode and return it.""" | """Build a list of tokens from a string of wikicode and return it.""" | ||||
self._skip_style_tags = skip_style_tags | |||||
split = self.regex.split(text) | split = self.regex.split(text) | ||||
self._text = [segment for segment in split if segment] | self._text = [segment for segment in split if segment] | ||||
self._head = self._global = self._depth = self._cycles = 0 | |||||
self._head = self._global = self._depth = 0 | |||||
self._bad_routes = set() | |||||
self._skip_style_tags = skip_style_tags | |||||
try: | try: | ||||
tokens = self._parse(context) | tokens = self._parse(context) | ||||
except BadRoute: # pragma: no cover (untestable/exceptional case) | except BadRoute: # pragma: no cover (untestable/exceptional case) | ||||
@@ -271,7 +271,7 @@ class _ListProxy(_SliceNormalizerMixIn, list): | |||||
return bool(self._render()) | return bool(self._render()) | ||||
def __len__(self): | def __len__(self): | ||||
return (self._stop - self._start) // self._step | |||||
return max((self._stop - self._start) // self._step, 0) | |||||
def __getitem__(self, key): | def __getitem__(self, key): | ||||
if isinstance(key, slice): | if isinstance(key, slice): | ||||
@@ -108,6 +108,9 @@ class StringMixIn(object): | |||||
return str(item) in self.__unicode__() | return str(item) in self.__unicode__() | ||||
def __getattr__(self, attr): | def __getattr__(self, attr): | ||||
if not hasattr(str, attr): | |||||
raise AttributeError("{0!r} object has no attribute {1!r}".format( | |||||
type(self).__name__, attr)) | |||||
return getattr(self.__unicode__(), attr) | return getattr(self.__unicode__(), attr) | ||||
if py3k: | if py3k: | ||||
@@ -1,6 +1,6 @@ | |||||
# -*- coding: utf-8 -*- | # -*- coding: utf-8 -*- | ||||
# | # | ||||
# Copyright (C) 2012-2016 Ben Kurtovic <ben.kurtovic@gmail.com> | |||||
# Copyright (C) 2012-2017 Ben Kurtovic <ben.kurtovic@gmail.com> | |||||
# | # | ||||
# Permission is hereby granted, free of charge, to any person obtaining a copy | # Permission is hereby granted, free of charge, to any person obtaining a copy | ||||
# of this software and associated documentation files (the "Software"), to deal | # of this software and associated documentation files (the "Software"), to deal | ||||
@@ -24,7 +24,7 @@ from __future__ import unicode_literals | |||||
from itertools import chain | from itertools import chain | ||||
import re | import re | ||||
from .compat import py3k, range, str | |||||
from .compat import bytes, py3k, range, str | |||||
from .nodes import (Argument, Comment, ExternalLink, Heading, HTMLEntity, | from .nodes import (Argument, Comment, ExternalLink, Heading, HTMLEntity, | ||||
Node, Tag, Template, Text, Wikilink) | Node, Tag, Template, Text, Wikilink) | ||||
from .string_mixin import StringMixIn | from .string_mixin import StringMixIn | ||||
@@ -275,6 +275,21 @@ class Wikicode(StringMixIn): | |||||
else: | else: | ||||
self.nodes.pop(index) | self.nodes.pop(index) | ||||
def contains(self, obj): | |||||
"""Return whether this Wikicode object contains *obj*. | |||||
If *obj* is a :class:`.Node` or :class:`.Wikicode` object, then we | |||||
search for it exactly among all of our children, recursively. | |||||
Otherwise, this method just uses :meth:`.__contains__` on the string. | |||||
""" | |||||
if not isinstance(obj, (Node, Wikicode)): | |||||
return obj in self | |||||
try: | |||||
self._do_strong_search(obj, recursive=True) | |||||
except ValueError: | |||||
return False | |||||
return True | |||||
def index(self, obj, recursive=False): | def index(self, obj, recursive=False): | ||||
"""Return the index of *obj* in the list of nodes. | """Return the index of *obj* in the list of nodes. | ||||
@@ -294,6 +309,52 @@ class Wikicode(StringMixIn): | |||||
return i | return i | ||||
raise ValueError(obj) | raise ValueError(obj) | ||||
def get_ancestors(self, obj): | |||||
"""Return a list of all ancestor nodes of the :class:`.Node` *obj*. | |||||
The list is ordered from the most shallow ancestor (greatest great- | |||||
grandparent) to the direct parent. The node itself is not included in | |||||
the list. For example:: | |||||
>>> text = "{{a|{{b|{{c|{{d}}}}}}}}" | |||||
>>> code = mwparserfromhell.parse(text) | |||||
>>> node = code.filter_templates(matches=lambda n: n == "{{d}}")[0] | |||||
>>> code.get_ancestors(node) | |||||
['{{a|{{b|{{c|{{d}}}}}}}}', '{{b|{{c|{{d}}}}}}', '{{c|{{d}}}}'] | |||||
Will return an empty list if *obj* is at the top level of this Wikicode | |||||
object. Will raise :exc:`ValueError` if it wasn't found. | |||||
""" | |||||
def _get_ancestors(code, needle): | |||||
for node in code.nodes: | |||||
if node is needle: | |||||
return [] | |||||
for code in node.__children__(): | |||||
ancestors = _get_ancestors(code, needle) | |||||
if ancestors is not None: | |||||
return [node] + ancestors | |||||
if isinstance(obj, Wikicode): | |||||
obj = obj.get(0) | |||||
elif not isinstance(obj, Node): | |||||
raise ValueError(obj) | |||||
ancestors = _get_ancestors(self, obj) | |||||
if ancestors is None: | |||||
raise ValueError(obj) | |||||
return ancestors | |||||
def get_parent(self, obj): | |||||
"""Return the direct parent node of the :class:`.Node` *obj*. | |||||
This function is equivalent to calling :meth:`.get_ancestors` and | |||||
taking the last element of the resulting list. Will return None if | |||||
the node exists but does not have a parent; i.e., it is at the top | |||||
level of the Wikicode object. | |||||
""" | |||||
ancestors = self.get_ancestors(obj) | |||||
return ancestors[-1] if ancestors else None | |||||
def insert(self, index, value): | def insert(self, index, value): | ||||
"""Insert *value* at *index* in the list of nodes. | """Insert *value* at *index* in the list of nodes. | ||||
@@ -413,22 +474,23 @@ class Wikicode(StringMixIn): | |||||
"""Do a loose equivalency test suitable for comparing page names. | """Do a loose equivalency test suitable for comparing page names. | ||||
*other* can be any string-like object, including :class:`.Wikicode`, or | *other* can be any string-like object, including :class:`.Wikicode`, or | ||||
a tuple of these. This operation is symmetric; both sides are adjusted. | |||||
Specifically, whitespace and markup is stripped and the first letter's | |||||
case is normalized. Typical usage is | |||||
an iterable of these. This operation is symmetric; both sides are | |||||
adjusted. Specifically, whitespace and markup is stripped and the first | |||||
letter's case is normalized. Typical usage is | |||||
``if template.name.matches("stub"): ...``. | ``if template.name.matches("stub"): ...``. | ||||
""" | """ | ||||
cmp = lambda a, b: (a[0].upper() + a[1:] == b[0].upper() + b[1:] | cmp = lambda a, b: (a[0].upper() + a[1:] == b[0].upper() + b[1:] | ||||
if a and b else a == b) | if a and b else a == b) | ||||
this = self.strip_code().strip() | this = self.strip_code().strip() | ||||
if isinstance(other, (tuple, list)): | |||||
for obj in other: | |||||
that = parse_anything(obj).strip_code().strip() | |||||
if cmp(this, that): | |||||
return True | |||||
return False | |||||
that = parse_anything(other).strip_code().strip() | |||||
return cmp(this, that) | |||||
if isinstance(other, (str, bytes, Wikicode, Node)): | |||||
that = parse_anything(other).strip_code().strip() | |||||
return cmp(this, that) | |||||
for obj in other: | |||||
that = parse_anything(obj).strip_code().strip() | |||||
if cmp(this, that): | |||||
return True | |||||
return False | |||||
def ifilter(self, recursive=True, matches=None, flags=FLAGS, | def ifilter(self, recursive=True, matches=None, flags=FLAGS, | ||||
forcetype=None): | forcetype=None): | ||||
@@ -530,23 +592,33 @@ class Wikicode(StringMixIn): | |||||
# Ensure that earlier sections are earlier in the returned list: | # Ensure that earlier sections are earlier in the returned list: | ||||
return [section for i, section in sorted(sections)] | return [section for i, section in sorted(sections)] | ||||
def strip_code(self, normalize=True, collapse=True): | |||||
def strip_code(self, normalize=True, collapse=True, | |||||
keep_template_params=False): | |||||
"""Return a rendered string without unprintable code such as templates. | """Return a rendered string without unprintable code such as templates. | ||||
The way a node is stripped is handled by the | The way a node is stripped is handled by the | ||||
:meth:`~.Node.__strip__` method of :class:`.Node` objects, which | :meth:`~.Node.__strip__` method of :class:`.Node` objects, which | ||||
generally return a subset of their nodes or ``None``. For example, | generally return a subset of their nodes or ``None``. For example, | ||||
templates and tags are removed completely, links are stripped to just | templates and tags are removed completely, links are stripped to just | ||||
their display part, headings are stripped to just their title. If | |||||
*normalize* is ``True``, various things may be done to strip code | |||||
their display part, headings are stripped to just their title. | |||||
If *normalize* is ``True``, various things may be done to strip code | |||||
further, such as converting HTML entities like ``Σ``, ``Σ``, | further, such as converting HTML entities like ``Σ``, ``Σ``, | ||||
and ``Σ`` to ``Σ``. If *collapse* is ``True``, we will try to | and ``Σ`` to ``Σ``. If *collapse* is ``True``, we will try to | ||||
remove excess whitespace as well (three or more newlines are converted | remove excess whitespace as well (three or more newlines are converted | ||||
to two, for example). | |||||
to two, for example). If *keep_template_params* is ``True``, then | |||||
template parameters will be preserved in the output (normally, they are | |||||
removed completely). | |||||
""" | """ | ||||
kwargs = { | |||||
"normalize": normalize, | |||||
"collapse": collapse, | |||||
"keep_template_params": keep_template_params | |||||
} | |||||
nodes = [] | nodes = [] | ||||
for node in self.nodes: | for node in self.nodes: | ||||
stripped = node.__strip__(normalize, collapse) | |||||
stripped = node.__strip__(**kwargs) | |||||
if stripped: | if stripped: | ||||
nodes.append(str(stripped)) | nodes.append(str(stripped)) | ||||
@@ -117,11 +117,11 @@ test_release() { | |||||
fi | fi | ||||
pip -q uninstall -y mwparserfromhell | pip -q uninstall -y mwparserfromhell | ||||
echo -n "Downloading mwparserfromhell source tarball and GPG signature..." | echo -n "Downloading mwparserfromhell source tarball and GPG signature..." | ||||
curl -sL "https://pypi.python.org/packages/source/m/mwparserfromhell/mwparserfromhell-$VERSION.tar.gz" -o "mwparserfromhell.tar.gz" | |||||
curl -sL "https://pypi.python.org/packages/source/m/mwparserfromhell/mwparserfromhell-$VERSION.tar.gz.asc" -o "mwparserfromhell.tar.gz.asc" | |||||
curl -sL "https://pypi.io/packages/source/m/mwparserfromhell/mwparserfromhell-$VERSION.tar.gz" -o "mwparserfromhell.tar.gz" | |||||
curl -sL "https://pypi.io/packages/source/m/mwparserfromhell/mwparserfromhell-$VERSION.tar.gz.asc" -o "mwparserfromhell.tar.gz.asc" | |||||
echo " done." | echo " done." | ||||
echo "Verifying tarball..." | echo "Verifying tarball..." | ||||
gpg --verify mwparserfromhell.tar.gz.asc | |||||
gpg --verify mwparserfromhell.tar.gz.asc mwparserfromhell.tar.gz | |||||
if [[ "$?" != "0" ]]; then | if [[ "$?" != "0" ]]; then | ||||
echo "*** ERROR: GPG signature verification failed!" | echo "*** ERROR: GPG signature verification failed!" | ||||
deactivate | deactivate | ||||
@@ -56,12 +56,10 @@ class TestArgument(TreeEqualityTestCase): | |||||
def test_strip(self): | def test_strip(self): | ||||
"""test Argument.__strip__()""" | """test Argument.__strip__()""" | ||||
node = Argument(wraptext("foobar")) | |||||
node1 = Argument(wraptext("foobar")) | |||||
node2 = Argument(wraptext("foo"), wraptext("bar")) | node2 = Argument(wraptext("foo"), wraptext("bar")) | ||||
for a in (True, False): | |||||
for b in (True, False): | |||||
self.assertIs(None, node.__strip__(a, b)) | |||||
self.assertEqual("bar", node2.__strip__(a, b)) | |||||
self.assertIs(None, node1.__strip__()) | |||||
self.assertEqual("bar", node2.__strip__()) | |||||
def test_showtree(self): | def test_showtree(self): | ||||
"""test Argument.__showtree__()""" | """test Argument.__showtree__()""" | ||||
@@ -49,9 +49,7 @@ class TestComment(TreeEqualityTestCase): | |||||
def test_strip(self): | def test_strip(self): | ||||
"""test Comment.__strip__()""" | """test Comment.__strip__()""" | ||||
node = Comment("foobar") | node = Comment("foobar") | ||||
for a in (True, False): | |||||
for b in (True, False): | |||||
self.assertIs(None, node.__strip__(a, b)) | |||||
self.assertIs(None, node.__strip__()) | |||||
def test_showtree(self): | def test_showtree(self): | ||||
"""test Comment.__showtree__()""" | """test Comment.__showtree__()""" | ||||
@@ -66,12 +66,11 @@ class TestExternalLink(TreeEqualityTestCase): | |||||
node2 = ExternalLink(wraptext("http://example.com")) | node2 = ExternalLink(wraptext("http://example.com")) | ||||
node3 = ExternalLink(wraptext("http://example.com"), wrap([])) | node3 = ExternalLink(wraptext("http://example.com"), wrap([])) | ||||
node4 = ExternalLink(wraptext("http://example.com"), wraptext("Link")) | node4 = ExternalLink(wraptext("http://example.com"), wraptext("Link")) | ||||
for a in (True, False): | |||||
for b in (True, False): | |||||
self.assertEqual("http://example.com", node1.__strip__(a, b)) | |||||
self.assertEqual(None, node2.__strip__(a, b)) | |||||
self.assertEqual(None, node3.__strip__(a, b)) | |||||
self.assertEqual("Link", node4.__strip__(a, b)) | |||||
self.assertEqual("http://example.com", node1.__strip__()) | |||||
self.assertEqual(None, node2.__strip__()) | |||||
self.assertEqual(None, node3.__strip__()) | |||||
self.assertEqual("Link", node4.__strip__()) | |||||
def test_showtree(self): | def test_showtree(self): | ||||
"""test ExternalLink.__showtree__()""" | """test ExternalLink.__showtree__()""" | ||||
@@ -52,9 +52,7 @@ class TestHeading(TreeEqualityTestCase): | |||||
def test_strip(self): | def test_strip(self): | ||||
"""test Heading.__strip__()""" | """test Heading.__strip__()""" | ||||
node = Heading(wraptext("foobar"), 3) | node = Heading(wraptext("foobar"), 3) | ||||
for a in (True, False): | |||||
for b in (True, False): | |||||
self.assertEqual("foobar", node.__strip__(a, b)) | |||||
self.assertEqual("foobar", node.__strip__()) | |||||
def test_showtree(self): | def test_showtree(self): | ||||
"""test Heading.__showtree__()""" | """test Heading.__showtree__()""" | ||||
@@ -57,13 +57,13 @@ class TestHTMLEntity(TreeEqualityTestCase): | |||||
node1 = HTMLEntity("nbsp", named=True, hexadecimal=False) | node1 = HTMLEntity("nbsp", named=True, hexadecimal=False) | ||||
node2 = HTMLEntity("107", named=False, hexadecimal=False) | node2 = HTMLEntity("107", named=False, hexadecimal=False) | ||||
node3 = HTMLEntity("e9", named=False, hexadecimal=True) | node3 = HTMLEntity("e9", named=False, hexadecimal=True) | ||||
for a in (True, False): | |||||
self.assertEqual("\xa0", node1.__strip__(True, a)) | |||||
self.assertEqual(" ", node1.__strip__(False, a)) | |||||
self.assertEqual("k", node2.__strip__(True, a)) | |||||
self.assertEqual("k", node2.__strip__(False, a)) | |||||
self.assertEqual("é", node3.__strip__(True, a)) | |||||
self.assertEqual("é", node3.__strip__(False, a)) | |||||
self.assertEqual("\xa0", node1.__strip__(normalize=True)) | |||||
self.assertEqual(" ", node1.__strip__(normalize=False)) | |||||
self.assertEqual("k", node2.__strip__(normalize=True)) | |||||
self.assertEqual("k", node2.__strip__(normalize=False)) | |||||
self.assertEqual("é", node3.__strip__(normalize=True)) | |||||
self.assertEqual("é", node3.__strip__(normalize=False)) | |||||
def test_showtree(self): | def test_showtree(self): | ||||
"""test HTMLEntity.__showtree__()""" | """test HTMLEntity.__showtree__()""" | ||||
@@ -398,6 +398,7 @@ class TestSmartList(unittest.TestCase): | |||||
self.assertEqual([4, 3, 2, 1.9, 1.8, 5, 6], child1) | self.assertEqual([4, 3, 2, 1.9, 1.8, 5, 6], child1) | ||||
self.assertEqual([4, 3, 2, 1.9, 1.8], child2) | self.assertEqual([4, 3, 2, 1.9, 1.8], child2) | ||||
self.assertEqual([], child3) | self.assertEqual([], child3) | ||||
self.assertEqual(0, len(child3)) | |||||
del child1 | del child1 | ||||
self.assertEqual([1, 4, 3, 2, 1.9, 1.8, 5, 6], parent) | self.assertEqual([1, 4, 3, 2, 1.9, 1.8, 5, 6], parent) | ||||
@@ -103,11 +103,10 @@ class TestTag(TreeEqualityTestCase): | |||||
node1 = Tag(wraptext("i"), wraptext("foobar")) | node1 = Tag(wraptext("i"), wraptext("foobar")) | ||||
node2 = Tag(wraptext("math"), wraptext("foobar")) | node2 = Tag(wraptext("math"), wraptext("foobar")) | ||||
node3 = Tag(wraptext("br"), self_closing=True) | node3 = Tag(wraptext("br"), self_closing=True) | ||||
for a in (True, False): | |||||
for b in (True, False): | |||||
self.assertEqual("foobar", node1.__strip__(a, b)) | |||||
self.assertEqual(None, node2.__strip__(a, b)) | |||||
self.assertEqual(None, node3.__strip__(a, b)) | |||||
self.assertEqual("foobar", node1.__strip__()) | |||||
self.assertEqual(None, node2.__strip__()) | |||||
self.assertEqual(None, node3.__strip__()) | |||||
def test_showtree(self): | def test_showtree(self): | ||||
"""test Tag.__showtree__()""" | """test Tag.__showtree__()""" | ||||
@@ -67,12 +67,19 @@ class TestTemplate(TreeEqualityTestCase): | |||||
def test_strip(self): | def test_strip(self): | ||||
"""test Template.__strip__()""" | """test Template.__strip__()""" | ||||
node1 = Template(wraptext("foobar")) | node1 = Template(wraptext("foobar")) | ||||
node2 = Template(wraptext("foo"), | |||||
[pgenh("1", "bar"), pgens("abc", "def")]) | |||||
for a in (True, False): | |||||
for b in (True, False): | |||||
self.assertEqual(None, node1.__strip__(a, b)) | |||||
self.assertEqual(None, node2.__strip__(a, b)) | |||||
node2 = Template(wraptext("foo"), [ | |||||
pgenh("1", "bar"), pgens("foo", ""), pgens("abc", "def")]) | |||||
node3 = Template(wraptext("foo"), [ | |||||
pgenh("1", "foo"), | |||||
Parameter(wraptext("2"), wrap([Template(wraptext("hello"))]), | |||||
showkey=False), | |||||
pgenh("3", "bar")]) | |||||
self.assertEqual(None, node1.__strip__(keep_template_params=False)) | |||||
self.assertEqual(None, node2.__strip__(keep_template_params=False)) | |||||
self.assertEqual("", node1.__strip__(keep_template_params=True)) | |||||
self.assertEqual("bar def", node2.__strip__(keep_template_params=True)) | |||||
self.assertEqual("foo bar", node3.__strip__(keep_template_params=True)) | |||||
def test_showtree(self): | def test_showtree(self): | ||||
"""test Template.__showtree__()""" | """test Template.__showtree__()""" | ||||
@@ -216,6 +223,7 @@ class TestTemplate(TreeEqualityTestCase): | |||||
node39 = Template(wraptext("a"), [pgenh("1", " b ")]) | node39 = Template(wraptext("a"), [pgenh("1", " b ")]) | ||||
node40 = Template(wraptext("a"), [pgenh("1", " b"), pgenh("2", " c")]) | node40 = Template(wraptext("a"), [pgenh("1", " b"), pgenh("2", " c")]) | ||||
node41 = Template(wraptext("a"), [pgens("1", " b"), pgens("2", " c")]) | node41 = Template(wraptext("a"), [pgens("1", " b"), pgens("2", " c")]) | ||||
node42 = Template(wraptext("a"), [pgens("b", " \n")]) | |||||
node1.add("e", "f", showkey=True) | node1.add("e", "f", showkey=True) | ||||
node2.add(2, "g", showkey=False) | node2.add(2, "g", showkey=False) | ||||
@@ -261,6 +269,7 @@ class TestTemplate(TreeEqualityTestCase): | |||||
node39.add("1", "c") | node39.add("1", "c") | ||||
node40.add("3", "d") | node40.add("3", "d") | ||||
node41.add("3", "d") | node41.add("3", "d") | ||||
node42.add("b", "hello") | |||||
self.assertEqual("{{a|b=c|d|e=f}}", node1) | self.assertEqual("{{a|b=c|d|e=f}}", node1) | ||||
self.assertEqual("{{a|b=c|d|g}}", node2) | self.assertEqual("{{a|b=c|d|g}}", node2) | ||||
@@ -308,6 +317,7 @@ class TestTemplate(TreeEqualityTestCase): | |||||
self.assertEqual("{{a|c}}", node39) | self.assertEqual("{{a|c}}", node39) | ||||
self.assertEqual("{{a| b| c|d}}", node40) | self.assertEqual("{{a| b| c|d}}", node40) | ||||
self.assertEqual("{{a|1= b|2= c|3= d}}", node41) | self.assertEqual("{{a|1= b|2= c|3= d}}", node41) | ||||
self.assertEqual("{{a|b=hello \n}}", node42) | |||||
def test_remove(self): | def test_remove(self): | ||||
"""test Template.remove()""" | """test Template.remove()""" | ||||
@@ -49,9 +49,7 @@ class TestText(unittest.TestCase): | |||||
def test_strip(self): | def test_strip(self): | ||||
"""test Text.__strip__()""" | """test Text.__strip__()""" | ||||
node = Text("foobar") | node = Text("foobar") | ||||
for a in (True, False): | |||||
for b in (True, False): | |||||
self.assertIs(node, node.__strip__(a, b)) | |||||
self.assertIs(node, node.__strip__()) | |||||
def test_showtree(self): | def test_showtree(self): | ||||
"""test Text.__showtree__()""" | """test Text.__showtree__()""" | ||||
@@ -85,6 +85,17 @@ class TestWikicode(TreeEqualityTestCase): | |||||
self.assertRaises(IndexError, code.set, 3, "{{baz}}") | self.assertRaises(IndexError, code.set, 3, "{{baz}}") | ||||
self.assertRaises(IndexError, code.set, -4, "{{baz}}") | self.assertRaises(IndexError, code.set, -4, "{{baz}}") | ||||
def test_contains(self): | |||||
"""test Wikicode.contains()""" | |||||
code = parse("Here is {{aaa|{{bbb|xyz{{ccc}}}}}} and a [[page|link]]") | |||||
tmpl1, tmpl2, tmpl3 = code.filter_templates() | |||||
tmpl4 = parse("{{ccc}}").filter_templates()[0] | |||||
self.assertTrue(code.contains(tmpl1)) | |||||
self.assertTrue(code.contains(tmpl3)) | |||||
self.assertFalse(code.contains(tmpl4)) | |||||
self.assertTrue(code.contains(str(tmpl4))) | |||||
self.assertTrue(code.contains(tmpl2.params[0].value)) | |||||
def test_index(self): | def test_index(self): | ||||
"""test Wikicode.index()""" | """test Wikicode.index()""" | ||||
code = parse("Have a {{template}} and a [[page|link]]") | code = parse("Have a {{template}} and a [[page|link]]") | ||||
@@ -102,6 +113,22 @@ class TestWikicode(TreeEqualityTestCase): | |||||
self.assertRaises(ValueError, code.index, | self.assertRaises(ValueError, code.index, | ||||
code.get(1).get(1).value, recursive=False) | code.get(1).get(1).value, recursive=False) | ||||
def test_get_ancestors_parent(self): | |||||
"""test Wikicode.get_ancestors() and Wikicode.get_parent()""" | |||||
code = parse("{{a|{{b|{{d|{{e}}{{f}}}}{{g}}}}}}{{c}}") | |||||
tmpl = code.filter_templates(matches=lambda n: n.name == "f")[0] | |||||
parent1 = code.filter_templates(matches=lambda n: n.name == "d")[0] | |||||
parent2 = code.filter_templates(matches=lambda n: n.name == "b")[0] | |||||
parent3 = code.filter_templates(matches=lambda n: n.name == "a")[0] | |||||
fake = parse("{{f}}").get(0) | |||||
self.assertEqual([parent3, parent2, parent1], code.get_ancestors(tmpl)) | |||||
self.assertIs(parent1, code.get_parent(tmpl)) | |||||
self.assertEqual([], code.get_ancestors(parent3)) | |||||
self.assertIs(None, code.get_parent(parent3)) | |||||
self.assertRaises(ValueError, code.get_ancestors, fake) | |||||
self.assertRaises(ValueError, code.get_parent, fake) | |||||
def test_insert(self): | def test_insert(self): | ||||
"""test Wikicode.insert()""" | """test Wikicode.insert()""" | ||||
code = parse("Have a {{template}} and a [[page|link]]") | code = parse("Have a {{template}} and a [[page|link]]") | ||||
@@ -433,7 +460,7 @@ class TestWikicode(TreeEqualityTestCase): | |||||
"""test Wikicode.strip_code()""" | """test Wikicode.strip_code()""" | ||||
# Since individual nodes have test cases for their __strip__ methods, | # Since individual nodes have test cases for their __strip__ methods, | ||||
# we're only going to do an integration test: | # we're only going to do an integration test: | ||||
code = parse("Foo [[bar]]\n\n{{baz}}\n\n[[a|b]] Σ") | |||||
code = parse("Foo [[bar]]\n\n{{baz|hello}}\n\n[[a|b]] Σ") | |||||
self.assertEqual("Foo bar\n\nb Σ", | self.assertEqual("Foo bar\n\nb Σ", | ||||
code.strip_code(normalize=True, collapse=True)) | code.strip_code(normalize=True, collapse=True)) | ||||
self.assertEqual("Foo bar\n\n\n\nb Σ", | self.assertEqual("Foo bar\n\n\n\nb Σ", | ||||
@@ -442,6 +469,9 @@ class TestWikicode(TreeEqualityTestCase): | |||||
code.strip_code(normalize=False, collapse=True)) | code.strip_code(normalize=False, collapse=True)) | ||||
self.assertEqual("Foo bar\n\n\n\nb Σ", | self.assertEqual("Foo bar\n\n\n\nb Σ", | ||||
code.strip_code(normalize=False, collapse=False)) | code.strip_code(normalize=False, collapse=False)) | ||||
self.assertEqual("Foo bar\n\nhello\n\nb Σ", | |||||
code.strip_code(normalize=True, collapse=True, | |||||
keep_template_params=True)) | |||||
def test_get_tree(self): | def test_get_tree(self): | ||||
"""test Wikicode.get_tree()""" | """test Wikicode.get_tree()""" | ||||
@@ -58,10 +58,8 @@ class TestWikilink(TreeEqualityTestCase): | |||||
"""test Wikilink.__strip__()""" | """test Wikilink.__strip__()""" | ||||
node = Wikilink(wraptext("foobar")) | node = Wikilink(wraptext("foobar")) | ||||
node2 = Wikilink(wraptext("foo"), wraptext("bar")) | node2 = Wikilink(wraptext("foo"), wraptext("bar")) | ||||
for a in (True, False): | |||||
for b in (True, False): | |||||
self.assertEqual("foobar", node.__strip__(a, b)) | |||||
self.assertEqual("bar", node2.__strip__(a, b)) | |||||
self.assertEqual("foobar", node.__strip__()) | |||||
self.assertEqual("bar", node2.__strip__()) | |||||
def test_showtree(self): | def test_showtree(self): | ||||
"""test Wikilink.__showtree__()""" | """test Wikilink.__showtree__()""" | ||||
@@ -346,3 +346,10 @@ name: tables_in_templates_2 | |||||
label: catch error handling mistakes when wikitables are inside templates | label: catch error handling mistakes when wikitables are inside templates | ||||
input: "{{hello|test\n{|\n| }}" | input: "{{hello|test\n{|\n| }}" | ||||
output: [TemplateOpen(), Text(text="hello"), TemplateParamSeparator(), Text(text="test\n{"), TemplateParamSeparator(), Text(text="\n"), TemplateParamSeparator(), Text(text=" "), TemplateClose()] | output: [TemplateOpen(), Text(text="hello"), TemplateParamSeparator(), Text(text="test\n{"), TemplateParamSeparator(), Text(text="\n"), TemplateParamSeparator(), Text(text=" "), TemplateClose()] | ||||
--- | |||||
name: many_invalid_nested_tags | |||||
label: many unending nested tags that should be treated as plain text, followed by valid wikitext (see issues #42, #183) | |||||
input: "<b><b><b><b><b><b><b><b><b><b><b><b><b><b><b><b><b><b>[[{{x}}" | |||||
output: [Text(text="<b><b><b><b><b><b><b><b><b><b><b><b><b><b><b><b><b><b>[["), TemplateOpen(), Text(text="x"), TemplateClose()] |
@@ -646,3 +646,10 @@ name: non_ascii_full | |||||
label: an open/close tag pair containing non-ASCII characters | label: an open/close tag pair containing non-ASCII characters | ||||
input: "<éxamplé></éxamplé>" | input: "<éxamplé></éxamplé>" | ||||
output: [TagOpenOpen(), Text(text="éxamplé"), TagCloseOpen(padding=""), TagOpenClose(), Text(text="éxamplé"), TagCloseClose()] | output: [TagOpenOpen(), Text(text="éxamplé"), TagCloseOpen(padding=""), TagOpenClose(), Text(text="éxamplé"), TagCloseClose()] | ||||
--- | |||||
name: single_nested_selfclosing | |||||
label: a single (unpaired) tag with a self-closing tag in the middle (see issue #147) | |||||
input: "<li a <br/> c>foobar" | |||||
output: [TagOpenOpen(), Text(text="li"), TagAttrStart(pad_first=" ", pad_after_eq="", pad_before_eq=" "), Text(text="a"), TagAttrStart(pad_first="", pad_after_eq="", pad_before_eq=" "), TagOpenOpen(), Text(text="br"), TagCloseSelfclose(padding=""), TagAttrStart(pad_first="", pad_after_eq="", pad_before_eq=""), Text(text="c"), TagCloseSelfclose(padding="", implicit=True), Text(text="foobar")] |
@@ -694,4 +694,4 @@ output: [Text(text="{{ {{ {{ {{ {{ {{ {{ {{ {{ {{ {{ {{ {{ {{ {{ {{ {{ {{ {{ {{ | |||||
name: recursion_opens_and_closes | name: recursion_opens_and_closes | ||||
label: test potentially dangerous recursion: template openings and closings | label: test potentially dangerous recursion: template openings and closings | ||||
input: "{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}" | input: "{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}" | ||||
output: [Text(text="{{x|"), TemplateOpen(), Text(text="x"), TemplateClose(), Text(text="{{x|"), TemplateOpen(), Text(text="x"), TemplateClose(), TemplateOpen(), Text(text="x"), TemplateParamSeparator(), TemplateOpen(), Text(text="x"), TemplateClose(), Text(text="{{x"), TemplateParamSeparator(), Text(text="{{x"), TemplateClose(), Text(text="{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}")] | |||||
output: [Text(text="{{x|"), TemplateOpen(), Text(text="x"), TemplateClose(), Text(text="{{x|"), TemplateOpen(), Text(text="x"), TemplateClose(), Text(text="{{x|"), TemplateOpen(), Text(text="x"), TemplateClose(), Text(text="{{x|"), TemplateOpen(), Text(text="x"), TemplateClose(), Text(text="{{x|"), TemplateOpen(), Text(text="x"), TemplateClose(), Text(text="{{x|"), TemplateOpen(), Text(text="x"), TemplateClose(), Text(text="{{x|"), TemplateOpen(), Text(text="x"), TemplateClose(), Text(text="{{x|"), TemplateOpen(), Text(text="x"), TemplateClose(), Text(text="{{x|"), TemplateOpen(), Text(text="x"), TemplateClose(), Text(text="{{x|"), TemplateOpen(), Text(text="x"), TemplateClose(), Text(text="{{x|"), TemplateOpen(), Text(text="x"), TemplateClose(), Text(text="{{x|"), TemplateOpen(), Text(text="x"), TemplateClose(), Text(text="{{x|"), TemplateOpen(), Text(text="x"), TemplateClose(), Text(text="{{x|"), TemplateOpen(), Text(text="x"), TemplateClose()] |