Browse Source

Merge branch 'develop'

tags/v0.4.2
Ben Kurtovic 8 years ago
parent
commit
e87661b912
43 changed files with 2496 additions and 1611 deletions
  1. +2
    -0
      .travis.yml
  2. +24
    -1
      CHANGELOG
  3. +1
    -2
      README.rst
  4. +64
    -0
      appveyor.yml
  5. +34
    -2
      docs/changelog.rst
  6. +8
    -10
      docs/index.rst
  7. +1
    -1
      docs/integration.rst
  8. +1
    -1
      mwparserfromhell/__init__.py
  9. +0
    -2
      mwparserfromhell/compat.py
  10. +1
    -3
      mwparserfromhell/definitions.py
  11. +58
    -53
      mwparserfromhell/nodes/template.py
  12. +2
    -1
      mwparserfromhell/parser/__init__.py
  13. +10
    -8
      mwparserfromhell/parser/contexts.py
  14. +125
    -0
      mwparserfromhell/parser/ctokenizer/common.h
  15. +105
    -0
      mwparserfromhell/parser/ctokenizer/contexts.h
  16. +78
    -0
      mwparserfromhell/parser/ctokenizer/tag_data.c
  17. +43
    -0
      mwparserfromhell/parser/ctokenizer/tag_data.h
  18. +232
    -0
      mwparserfromhell/parser/ctokenizer/textbuffer.c
  19. +36
    -0
      mwparserfromhell/parser/ctokenizer/textbuffer.h
  20. +223
    -896
      mwparserfromhell/parser/ctokenizer/tok_parse.c
  21. +35
    -0
      mwparserfromhell/parser/ctokenizer/tok_parse.h
  22. +345
    -0
      mwparserfromhell/parser/ctokenizer/tok_support.c
  23. +62
    -0
      mwparserfromhell/parser/ctokenizer/tok_support.h
  24. +310
    -0
      mwparserfromhell/parser/ctokenizer/tokenizer.c
  25. +111
    -0
      mwparserfromhell/parser/ctokenizer/tokenizer.h
  26. +111
    -0
      mwparserfromhell/parser/ctokenizer/tokens.c
  27. +69
    -0
      mwparserfromhell/parser/ctokenizer/tokens.h
  28. +0
    -366
      mwparserfromhell/parser/tokenizer.h
  29. +29
    -15
      mwparserfromhell/parser/tokenizer.py
  30. +42
    -58
      mwparserfromhell/smart_list.py
  31. +3
    -0
      scripts/README
  32. +16
    -16
      scripts/release.sh
  33. +0
    -58
      scripts/win_build.py
  34. +43
    -0
      scripts/win_wrapper.cmd
  35. +40
    -60
      setup.py
  36. +41
    -26
      tests/_test_tokenizer.py
  37. +2
    -2
      tests/test_docs.py
  38. +20
    -12
      tests/test_smart_list.py
  39. +12
    -3
      tests/test_template.py
  40. +77
    -0
      tests/tokenizer/integration.mwtest
  41. +51
    -14
      tests/tokenizer/templates.mwtest
  42. +1
    -1
      tests/tokenizer/text.mwtest
  43. +28
    -0
      tests/tokenizer/wikilinks.mwtest

+ 2
- 0
.travis.yml View File

@@ -5,6 +5,8 @@ python:
- 3.2
- 3.3
- 3.4
- 3.5-dev
sudo: false
install:
- pip install coveralls
- python setup.py build


+ 24
- 1
CHANGELOG View File

@@ -1,4 +1,27 @@
v0.4 (unreleased):
v0.4.1 (released July 30, 2015):

- The process for building Windows binaries has been fixed, and these should be
distributed along with new releases. Windows users can now take advantage of
C speedups without having a compiler of their own.
- Added support for Python 3.5.
- '<' and '>' are now disallowed in wikilink titles and template names. This
includes when denoting tags, but not comments.
- Fixed the behavior of preserve_spacing in Template.add() and keep_field in
Template.remove() on parameters with hidden keys.
- Removed _ListProxy.detach(). SmartLists now use weak references and their
children are garbage-collected properly.
- Fixed parser bugs involving:
- templates with completely blank names;
- templates with newlines and comments.
- Heavy refactoring and fixes to the C tokenizer, including:
- corrected a design flaw in text handling, allowing for substantial speed
improvements when parsing long strings of plain text;
- implemented new Python 3.3 PEP 393 Unicode APIs.
- Fixed various bugs in SmartList, including one that was causing memory issues
on 64-bit builds of Python 2 on Windows.
- Fixed some bugs in the release scripts.

v0.4 (released May 23, 2015):

- The parser now falls back on pure Python mode if C extensions cannot be
built. This fixes an issue that prevented some Windows users from installing


+ 1
- 2
README.rst View File

@@ -139,7 +139,7 @@ If you're not using a library, you can parse any page using the following code
from urllib.parse import urlencode
from urllib.request import urlopen
import mwparserfromhell
API_URL = "http://en.wikipedia.org/w/api.php"
API_URL = "https://en.wikipedia.org/w/api.php"

def parse(title):
data = {"action": "query", "prop": "revisions", "rvlimit": 1,
@@ -156,7 +156,6 @@ If you're not using a library, you can parse any page using the following code
.. _Legoktm: http://en.wikipedia.org/wiki/User:Legoktm
.. _GitHub: https://github.com/earwig/mwparserfromhell
.. _Python Package Index: http://pypi.python.org
.. _StackOverflow question: http://stackoverflow.com/questions/2817869/error-unable-to-find-vcvarsall-bat
.. _get pip: http://pypi.python.org/pypi/pip
.. _EarwigBot: https://github.com/earwig/earwigbot
.. _Pywikibot: https://www.mediawiki.org/wiki/Manual:Pywikibot


+ 64
- 0
appveyor.yml View File

@@ -0,0 +1,64 @@
# This config file is used by appveyor.com to build Windows release binaries

version: 0.4.1-b{build}

branches:
only:
- master

skip_tags: true

environment:
global:
# See: http://stackoverflow.com/a/13751649/163740
WRAPPER: "cmd /E:ON /V:ON /C .\\scripts\\win_wrapper.cmd"
PIP: "%WRAPPER% %PYTHON%\\Scripts\\pip.exe"
SETUPPY: "%WRAPPER% %PYTHON%\\python setup.py --with-extension"
PYPI_USERNAME: "earwigbot"
PYPI_PASSWORD:
secure: gOIcvPxSC2ujuhwOzwj3v8xjq3CCYd8keFWVnguLM+gcL0e02qshDHy7gwZZwj0+

matrix:
- PYTHON: "C:\\Python27"
PYTHON_VERSION: "2.7"
PYTHON_ARCH: "32"

- PYTHON: "C:\\Python27-x64"
PYTHON_VERSION: "2.7"
PYTHON_ARCH: "64"

- PYTHON: "C:\\Python33"
PYTHON_VERSION: "3.3"
PYTHON_ARCH: "32"

- PYTHON: "C:\\Python33-x64"
PYTHON_VERSION: "3.3"
PYTHON_ARCH: "64"

- PYTHON: "C:\\Python34"
PYTHON_VERSION: "3.4"
PYTHON_ARCH: "32"

- PYTHON: "C:\\Python34-x64"
PYTHON_VERSION: "3.4"
PYTHON_ARCH: "64"

install:
- "%PIP% install wheel twine"

build_script:
- "%SETUPPY% build"

test_script:
- "%SETUPPY% -q test"

after_test:
- "%SETUPPY% bdist_wheel"

on_success:
- "twine upload dist\\* -u %PYPI_USERNAME% -p %PYPI_PASSWORD%"

artifacts:
- path: dist\*

deploy: off

+ 34
- 2
docs/changelog.rst View File

@@ -1,11 +1,43 @@
Changelog
=========

v0.4.1
------

`Released July 30, 2015 <https://github.com/earwig/mwparserfromhell/tree/v0.4.1>`_
(`changes <https://github.com/earwig/mwparserfromhell/compare/v0.4...v0.4.1>`__):

- The process for building Windows binaries has been fixed, and these should be
distributed along with new releases. Windows users can now take advantage of
C speedups without having a compiler of their own.
- Added support for Python 3.5.
- ``<`` and ``>`` are now disallowed in wikilink titles and template names.
This includes when denoting tags, but not comments.
- Fixed the behavior of *preserve_spacing* in :meth:`.Template.add` and
*keep_field* in :meth:`.Template.remove` on parameters with hidden keys.
- Removed :meth:`._ListProxy.detach`. :class:`.SmartList`\ s now use weak
references and their children are garbage-collected properly.
- Fixed parser bugs involving:

- templates with completely blank names;
- templates with newlines and comments.

- Heavy refactoring and fixes to the C tokenizer, including:

- corrected a design flaw in text handling, allowing for substantial speed
improvements when parsing long strings of plain text;
- implemented new Python 3.3
`PEP 393 <https://www.python.org/dev/peps/pep-0393/>`_ Unicode APIs.

- Fixed various bugs in :class:`.SmartList`, including one that was causing
memory issues on 64-bit builds of Python 2 on Windows.
- Fixed some bugs in the release scripts.

v0.4
----

Unreleased
(`changes <https://github.com/earwig/mwparserfromhell/compare/v0.3.3...develop>`__):
`Released May 23, 2015 <https://github.com/earwig/mwparserfromhell/tree/v0.4>`_
(`changes <https://github.com/earwig/mwparserfromhell/compare/v0.3.3...v0.4>`__):

- The parser now falls back on pure Python mode if C extensions cannot be
built. This fixes an issue that prevented some Windows users from installing


+ 8
- 10
docs/index.rst View File

@@ -17,24 +17,22 @@ Development occurs on GitHub_.
Installation
------------

The easiest way to install the parser is through the `Python Package Index`_,
so you can install the latest release with ``pip install mwparserfromhell``
(`get pip`_). Alternatively, get the latest development version::
The easiest way to install the parser is through the `Python Package Index`_;
you can install the latest release with ``pip install mwparserfromhell``
(`get pip`_). On Windows, make sure you have the latest version of pip
installed by running ``pip install --upgrade pip``.

Alternatively, get the latest development version::

git clone https://github.com/earwig/mwparserfromhell.git
cd mwparserfromhell
python setup.py install

If you get ``error: Unable to find vcvarsall.bat`` while installing, this is
because Windows can't find the compiler for C extensions. Consult this
`StackOverflow question`_ for help. You can also set ``ext_modules`` in
``setup.py`` to an empty list to prevent the extension from building.

You can run the comprehensive unit testing suite with ``python setup.py test``.
You can run the comprehensive unit testing suite with
``python setup.py test -q``.

.. _Python Package Index: http://pypi.python.org
.. _get pip: http://pypi.python.org/pypi/pip
.. _StackOverflow question: http://stackoverflow.com/questions/2817869/error-unable-to-find-vcvarsall-bat

Contents
--------


+ 1
- 1
docs/integration.rst View File

@@ -25,7 +25,7 @@ If you're not using a library, you can parse any page using the following code
from urllib.parse import urlencode
from urllib.request import urlopen
import mwparserfromhell
API_URL = "http://en.wikipedia.org/w/api.php"
API_URL = "https://en.wikipedia.org/w/api.php"

def parse(title):
data = {"action": "query", "prop": "revisions", "rvlimit": 1,


+ 1
- 1
mwparserfromhell/__init__.py View File

@@ -29,7 +29,7 @@ outrageously powerful parser for `MediaWiki <http://mediawiki.org>`_ wikicode.
__author__ = "Ben Kurtovic"
__copyright__ = "Copyright (C) 2012, 2013, 2014, 2015 Ben Kurtovic"
__license__ = "MIT License"
__version__ = "0.4"
__version__ = "0.4.1"
__email__ = "ben.kurtovic@gmail.com"

from . import (compat, definitions, nodes, parser, smart_list, string_mixin,


+ 0
- 2
mwparserfromhell/compat.py View File

@@ -18,14 +18,12 @@ if py3k:
bytes = bytes
str = str
range = range
maxsize = sys.maxsize
import html.entities as htmlentities

else:
bytes = str
str = unicode
range = xrange
maxsize = sys.maxint
import htmlentitydefs as htmlentities

del sys

+ 1
- 3
mwparserfromhell/definitions.py View File

@@ -81,10 +81,8 @@ def is_single_only(tag):
"""Return whether or not the given *tag* must exist without a close tag."""
return tag.lower() in SINGLE_ONLY

def is_scheme(scheme, slashes=True, reverse=False):
def is_scheme(scheme, slashes=True):
"""Return whether *scheme* is valid for external links."""
if reverse: # Convenience for C
scheme = scheme[::-1]
scheme = scheme.lower()
if slashes:
return scheme in URI_SCHEMES


+ 58
- 53
mwparserfromhell/nodes/template.py View File

@@ -82,21 +82,11 @@ class Template(Node):
if char in node:
code.replace(node, node.replace(char, replacement), False)

def _blank_param_value(self, value):
"""Remove the content from *value* while keeping its whitespace.

Replace *value*\ 's nodes with two text nodes, the first containing
whitespace from before its content and the second containing whitespace
from after its content.
"""
match = re.search(r"^(\s*).*?(\s*)$", str(value), FLAGS)
value.nodes = [Text(match.group(1)), Text(match.group(2))]

def _select_theory(self, theories):
"""Return the most likely spacing convention given different options.

Given a dictionary of convention options as keys and their occurrence as
values, return the convention that occurs the most, or ``None`` if
Given a dictionary of convention options as keys and their occurrence
as values, return the convention that occurs the most, or ``None`` if
there is no clear preferred style.
"""
if theories:
@@ -129,34 +119,47 @@ class Template(Node):
after = self._select_theory(after_theories)
return before, after

def _remove_with_field(self, param, i, name):
"""Return True if a parameter name should be kept, otherwise False."""
if param.showkey:
following = self.params[i+1:]
better_matches = [after.name.strip() == name and not after.showkey for after in following]
if any(better_matches):
return False
return True
def _remove_without_field(self, param, i):
"""Return False if a parameter name should be kept, otherwise True."""
if not param.showkey:
dependents = [not after.showkey for after in self.params[i+1:]]
if any(dependents):
return False
return True
def _blank_param_value(self, value):
"""Remove the content from *value* while keeping its whitespace.
Replace *value*\ 's nodes with two text nodes, the first containing
whitespace from before its content and the second containing whitespace
from after its content.
"""
match = re.search(r"^(\s*).*?(\s*)$", str(value), FLAGS)
value.nodes = [Text(match.group(1)), Text(match.group(2))]
def _fix_dependendent_params(self, i):
"""Unhide keys if necessary after removing the param at index *i*."""
if not self.params[i].showkey:
for param in self.params[i + 1:]:
if not param.showkey:
param.showkey = True

def _remove_exact(self, needle, keep_field):
"""Remove a specific parameter, *needle*, from the template."""
for i, param in enumerate(self.params):
if param is needle:
if keep_field or not self._remove_without_field(param, i):
if keep_field:
self._blank_param_value(param.value)
else:
self._fix_dependendent_params(i)
self.params.pop(i)
return
raise ValueError(needle)

def _should_remove(self, i, name):
"""Look ahead for a parameter with the same name, but hidden.

If one exists, we should remove the given one rather than blanking it.
"""
if self.params[i].showkey:
following = self.params[i + 1:]
better_matches = [after.name.strip() == name and not after.showkey
for after in following]
return any(better_matches)
return False

@property
def name(self):
"""The name of the template, as a :class:`.Wikicode` object."""
@@ -213,26 +216,25 @@ class Template(Node):
:func:`.utils.parse_anything`; pipes and equal signs are automatically
escaped from *value* when appropriate.

If *name* is already a parameter in the template, we'll replace its
value.

If *showkey* is given, this will determine whether or not to show the
parameter's name (e.g., ``{{foo|bar}}``'s parameter has a name of
``"1"`` but it is hidden); otherwise, we'll make a safe and intelligent
guess.

If *name* is already a parameter in the template, we'll replace its
value while keeping the same whitespace around it. We will also try to
guess the dominant spacing convention when adding a new parameter using
:meth:`_get_spacing_conventions`.

If *before* is given (either a :class:`.Parameter` object or a name),
then we will place the parameter immediately before this one.
Otherwise, it will be added at the end. If *before* is a name and
exists multiple times in the template, we will place it before the last
occurrence. If *before* is not in the template, :exc:`ValueError` is
raised. The argument is ignored if the new parameter already exists.
raised. The argument is ignored if *name* is an existing parameter.

If *preserve_spacing* is ``False``, we will avoid preserving spacing
conventions when changing the value of an existing parameter or when
adding a new one.
If *preserve_spacing* is ``True``, we will try to preserve whitespace
conventions around the parameter, whether it is new or we are updating
an existing value. It is disabled for parameters with hidden keys,
since MediaWiki doesn't strip whitespace in this case.
"""
name, value = parse_anything(name), parse_anything(value)
self._surface_escape(value, "|")
@@ -245,7 +247,7 @@ class Template(Node):
if not existing.showkey:
self._surface_escape(value, "=")
nodes = existing.value.nodes
if preserve_spacing:
if preserve_spacing and existing.showkey:
for i in range(2): # Ignore empty text nodes
if not nodes[i]:
nodes[i] = None
@@ -271,7 +273,7 @@ class Template(Node):
if not showkey:
self._surface_escape(value, "=")

if preserve_spacing:
if preserve_spacing and showkey:
before_n, after_n = self._get_spacing_conventions(use_names=True)
before_v, after_v = self._get_spacing_conventions(use_names=False)
name = parse_anything([before_n, name, after_n])
@@ -294,36 +296,39 @@ class Template(Node):
and :meth:`get`.

If *keep_field* is ``True``, we will keep the parameter's name, but
blank its value. Otherwise, we will remove the parameter completely
*unless* other parameters are dependent on it (e.g. removing ``bar``
from ``{{foo|bar|baz}}`` is unsafe because ``{{foo|baz}}`` is not what
we expected, so ``{{foo||baz}}`` will be produced instead).
blank its value. Otherwise, we will remove the parameter completely.

When removing a parameter with a hidden name, subsequent parameters
with hidden names will be made visible. For example, removing ``bar``
from ``{{foo|bar|baz}}`` produces ``{{foo|2=baz}}`` because
``{{foo|baz}}`` is incorrect.

If the parameter shows up multiple times in the template and *param* is
not a :class:`.Parameter` object, we will remove all instances of it
(and keep only one if *keep_field* is ``True`` - the first instance if
none have dependents, otherwise the one with dependents will be kept).
(and keep only one if *keep_field* is ``True`` - either the one with a
hidden name, if it exists, or the first instance).
"""
if isinstance(param, Parameter):
return self._remove_exact(param, keep_field)

name = str(param).strip()
removed = False
to_remove = []

for i, param in enumerate(self.params):
if param.name.strip() == name:
if keep_field:
if self._remove_with_field(param, i, name):
self._blank_param_value(param.value)
keep_field = False
else:
to_remove.append(i)
else:
if self._remove_without_field(param, i):
if self._should_remove(i, name):
to_remove.append(i)
else:
self._blank_param_value(param.value)
keep_field = False
else:
self._fix_dependendent_params(i)
to_remove.append(i)
if not removed:
removed = True

if not removed:
raise ValueError(name)
for i in reversed(to_remove):


+ 2
- 1
mwparserfromhell/parser/__init__.py View File

@@ -40,11 +40,11 @@ class ParserError(Exception):


from .builder import Builder
from .tokenizer import Tokenizer
try:
from ._tokenizer import CTokenizer
use_c = True
except ImportError:
from .tokenizer import Tokenizer
CTokenizer = None
use_c = False

@@ -70,6 +70,7 @@ class Parser(object):
if use_c and CTokenizer:
self._tokenizer = CTokenizer()
else:
from .tokenizer import Tokenizer
self._tokenizer = Tokenizer()
self._builder = Builder()



+ 10
- 8
mwparserfromhell/parser/contexts.py View File

@@ -89,6 +89,7 @@ Local (stack-specific) contexts:
* :const:`FAIL_ON_LBRACE`
* :const:`FAIL_ON_RBRACE`
* :const:`FAIL_ON_EQUALS`
* :const:`HAS_TEMPLATE`

* :const:`TABLE`

@@ -161,15 +162,16 @@ FAIL_NEXT = 1 << 26
FAIL_ON_LBRACE = 1 << 27
FAIL_ON_RBRACE = 1 << 28
FAIL_ON_EQUALS = 1 << 29
HAS_TEMPLATE = 1 << 30
SAFETY_CHECK = (HAS_TEXT + FAIL_ON_TEXT + FAIL_NEXT + FAIL_ON_LBRACE +
FAIL_ON_RBRACE + FAIL_ON_EQUALS)
TABLE_OPEN = 1 << 30
TABLE_CELL_OPEN = 1 << 31
TABLE_CELL_STYLE = 1 << 32
TABLE_ROW_OPEN = 1 << 33
TABLE_TD_LINE = 1 << 34
TABLE_TH_LINE = 1 << 35
FAIL_ON_RBRACE + FAIL_ON_EQUALS + HAS_TEMPLATE)
TABLE_OPEN = 1 << 31
TABLE_CELL_OPEN = 1 << 32
TABLE_CELL_STYLE = 1 << 33
TABLE_ROW_OPEN = 1 << 34
TABLE_TD_LINE = 1 << 35
TABLE_TH_LINE = 1 << 36
TABLE_CELL_LINE_CONTEXTS = TABLE_TD_LINE + TABLE_TH_LINE + TABLE_CELL_STYLE
TABLE = (TABLE_OPEN + TABLE_CELL_OPEN + TABLE_CELL_STYLE + TABLE_ROW_OPEN +
TABLE_TD_LINE + TABLE_TH_LINE)


+ 125
- 0
mwparserfromhell/parser/ctokenizer/common.h View File

@@ -0,0 +1,125 @@
/*
Copyright (C) 2012-2015 Ben Kurtovic <ben.kurtovic@gmail.com>

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
of the Software, and to permit persons to whom the Software is furnished to do
so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
*/

#pragma once

#ifndef PY_SSIZE_T_CLEAN
#define PY_SSIZE_T_CLEAN // See: https://docs.python.org/2/c-api/arg.html
#endif

#include <Python.h>
#include <structmember.h>
#include <bytesobject.h>

/* Compatibility macros */

#if PY_MAJOR_VERSION >= 3
#define IS_PY3K
#endif

#ifndef uint64_t
#define uint64_t unsigned PY_LONG_LONG
#endif

#define malloc PyObject_Malloc // XXX: yuck
#define realloc PyObject_Realloc
#define free PyObject_Free

/* Unicode support macros */

#if defined(IS_PY3K) && PY_MINOR_VERSION >= 3
#define PEP_393
#endif

#ifdef PEP_393
#define Unicode Py_UCS4
#define PyUnicode_FROM_SINGLE(chr) \
PyUnicode_FromKindAndData(PyUnicode_4BYTE_KIND, &(chr), 1)
#else
#define Unicode Py_UNICODE
#define PyUnicode_FROM_SINGLE(chr) \
PyUnicode_FromUnicode(&(chr), 1)
#define PyUnicode_GET_LENGTH PyUnicode_GET_SIZE
#endif

/* Error handling macros */

#define BAD_ROUTE self->route_state
#define BAD_ROUTE_CONTEXT self->route_context
#define FAIL_ROUTE(context) { \
self->route_state = 1; \
self->route_context = context; \
}
#define RESET_ROUTE() self->route_state = 0

/* Shared globals */

extern char** entitydefs;

extern PyObject* NOARGS;
extern PyObject* definitions;

/* Structs */

typedef struct {
Py_ssize_t capacity;
Py_ssize_t length;
#ifdef PEP_393
PyObject* object;
int kind;
void* data;
#else
Py_UNICODE* data;
#endif
} Textbuffer;

struct Stack {
PyObject* stack;
uint64_t context;
Textbuffer* textbuffer;
struct Stack* next;
};
typedef struct Stack Stack;

typedef struct {
PyObject* object; /* base PyUnicodeObject object */
Py_ssize_t length; /* length of object, in code points */
#ifdef PEP_393
int kind; /* object's kind value */
void* data; /* object's raw unicode buffer */
#else
Py_UNICODE* buf; /* object's internal buffer */
#endif
} TokenizerInput;

typedef struct {
PyObject_HEAD
TokenizerInput text; /* text to tokenize */
Stack* topstack; /* topmost stack */
Py_ssize_t head; /* current position in text */
int global; /* global context */
int depth; /* stack recursion depth */
int cycles; /* total number of stack recursions */
int route_state; /* whether a BadRoute has been triggered */
uint64_t route_context; /* context when the last BadRoute was triggered */
int skip_style_tags; /* temp fix for the sometimes broken tag parser */
} Tokenizer;

+ 105
- 0
mwparserfromhell/parser/ctokenizer/contexts.h View File

@@ -0,0 +1,105 @@
/*
Copyright (C) 2012-2015 Ben Kurtovic <ben.kurtovic@gmail.com>

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
of the Software, and to permit persons to whom the Software is furnished to do
so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
*/

#pragma once

/* Local contexts */

#define LC_TEMPLATE 0x0000000000000007
#define LC_TEMPLATE_NAME 0x0000000000000001
#define LC_TEMPLATE_PARAM_KEY 0x0000000000000002
#define LC_TEMPLATE_PARAM_VALUE 0x0000000000000004

#define LC_ARGUMENT 0x0000000000000018
#define LC_ARGUMENT_NAME 0x0000000000000008
#define LC_ARGUMENT_DEFAULT 0x0000000000000010

#define LC_WIKILINK 0x0000000000000060
#define LC_WIKILINK_TITLE 0x0000000000000020
#define LC_WIKILINK_TEXT 0x0000000000000040

#define LC_EXT_LINK 0x0000000000000180
#define LC_EXT_LINK_URI 0x0000000000000080
#define LC_EXT_LINK_TITLE 0x0000000000000100

#define LC_HEADING 0x0000000000007E00
#define LC_HEADING_LEVEL_1 0x0000000000000200
#define LC_HEADING_LEVEL_2 0x0000000000000400
#define LC_HEADING_LEVEL_3 0x0000000000000800
#define LC_HEADING_LEVEL_4 0x0000000000001000
#define LC_HEADING_LEVEL_5 0x0000000000002000
#define LC_HEADING_LEVEL_6 0x0000000000004000

#define LC_TAG 0x0000000000078000
#define LC_TAG_OPEN 0x0000000000008000
#define LC_TAG_ATTR 0x0000000000010000
#define LC_TAG_BODY 0x0000000000020000
#define LC_TAG_CLOSE 0x0000000000040000

#define LC_STYLE 0x0000000000780000
#define LC_STYLE_ITALICS 0x0000000000080000
#define LC_STYLE_BOLD 0x0000000000100000
#define LC_STYLE_PASS_AGAIN 0x0000000000200000
#define LC_STYLE_SECOND_PASS 0x0000000000400000

#define LC_DLTERM 0x0000000000800000

#define LC_SAFETY_CHECK 0x000000007F000000
#define LC_HAS_TEXT 0x0000000001000000
#define LC_FAIL_ON_TEXT 0x0000000002000000
#define LC_FAIL_NEXT 0x0000000004000000
#define LC_FAIL_ON_LBRACE 0x0000000008000000
#define LC_FAIL_ON_RBRACE 0x0000000010000000
#define LC_FAIL_ON_EQUALS 0x0000000020000000
#define LC_HAS_TEMPLATE 0x0000000040000000

#define LC_TABLE 0x0000001F80000000
#define LC_TABLE_CELL_LINE_CONTEXTS 0x0000001A00000000
#define LC_TABLE_OPEN 0x0000000080000000
#define LC_TABLE_CELL_OPEN 0x0000000100000000
#define LC_TABLE_CELL_STYLE 0x0000000200000000
#define LC_TABLE_ROW_OPEN 0x0000000400000000
#define LC_TABLE_TD_LINE 0x0000000800000000
#define LC_TABLE_TH_LINE 0x0000001000000000

/* Global contexts */

#define GL_HEADING 0x1

/* Aggregate contexts */

#define AGG_FAIL (LC_TEMPLATE | LC_ARGUMENT | LC_WIKILINK | LC_EXT_LINK_TITLE | LC_HEADING | LC_TAG | LC_STYLE | LC_TABLE_OPEN)
#define AGG_UNSAFE (LC_TEMPLATE_NAME | LC_WIKILINK_TITLE | LC_EXT_LINK_TITLE | LC_TEMPLATE_PARAM_KEY | LC_ARGUMENT_NAME)
#define AGG_DOUBLE (LC_TEMPLATE_PARAM_KEY | LC_TAG_CLOSE | LC_TABLE_ROW_OPEN)
#define AGG_NO_WIKILINKS (LC_TEMPLATE_NAME | LC_ARGUMENT_NAME | LC_WIKILINK_TITLE | LC_EXT_LINK_URI)
#define AGG_NO_EXT_LINKS (LC_TEMPLATE_NAME | LC_ARGUMENT_NAME | LC_WIKILINK_TITLE | LC_EXT_LINK)

/* Tag contexts */

#define TAG_NAME 0x01
#define TAG_ATTR_READY 0x02
#define TAG_ATTR_NAME 0x04
#define TAG_ATTR_VALUE 0x08
#define TAG_QUOTED 0x10
#define TAG_NOTE_SPACE 0x20
#define TAG_NOTE_EQUALS 0x40
#define TAG_NOTE_QUOTE 0x80

+ 78
- 0
mwparserfromhell/parser/ctokenizer/tag_data.c View File

@@ -0,0 +1,78 @@
/*
Copyright (C) 2012-2015 Ben Kurtovic <ben.kurtovic@gmail.com>

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
of the Software, and to permit persons to whom the Software is furnished to do
so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
*/

#include "tag_data.h"
#include "contexts.h"

/*
Initialize a new TagData object.
*/
TagData* TagData_new(TokenizerInput* text)
{
#define ALLOC_BUFFER(name) \
name = Textbuffer_new(text); \
if (!name) { \
TagData_dealloc(self); \
return NULL; \
}

TagData *self = malloc(sizeof(TagData));
if (!self) {
PyErr_NoMemory();
return NULL;
}
self->context = TAG_NAME;
ALLOC_BUFFER(self->pad_first)
ALLOC_BUFFER(self->pad_before_eq)
ALLOC_BUFFER(self->pad_after_eq)
self->quoter = 0;
self->reset = 0;
return self;

#undef ALLOC_BUFFER
}

/*
Deallocate the given TagData object.
*/
void TagData_dealloc(TagData* self)
{
if (self->pad_first)
Textbuffer_dealloc(self->pad_first);
if (self->pad_before_eq)
Textbuffer_dealloc(self->pad_before_eq);
if (self->pad_after_eq)
Textbuffer_dealloc(self->pad_after_eq);
free(self);
}

/*
Clear the internal buffers of the given TagData object.
*/
int TagData_reset_buffers(TagData* self)
{
if (Textbuffer_reset(self->pad_first) ||
Textbuffer_reset(self->pad_before_eq) ||
Textbuffer_reset(self->pad_after_eq))
return -1;
return 0;
}

+ 43
- 0
mwparserfromhell/parser/ctokenizer/tag_data.h View File

@@ -0,0 +1,43 @@
/*
Copyright (C) 2012-2015 Ben Kurtovic <ben.kurtovic@gmail.com>

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
of the Software, and to permit persons to whom the Software is furnished to do
so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
*/

#pragma once

#include "common.h"
#include "textbuffer.h"

/* Structs */

typedef struct {
uint64_t context;
Textbuffer* pad_first;
Textbuffer* pad_before_eq;
Textbuffer* pad_after_eq;
Unicode quoter;
Py_ssize_t reset;
} TagData;

/* Functions */

TagData* TagData_new(TokenizerInput*);
void TagData_dealloc(TagData*);
int TagData_reset_buffers(TagData*);

+ 232
- 0
mwparserfromhell/parser/ctokenizer/textbuffer.c View File

@@ -0,0 +1,232 @@
/*
Copyright (C) 2012-2015 Ben Kurtovic <ben.kurtovic@gmail.com>

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
of the Software, and to permit persons to whom the Software is furnished to do
so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
*/

#include "textbuffer.h"

#define INITIAL_CAPACITY 32
#define RESIZE_FACTOR 2
#define CONCAT_EXTRA 32

/*
Internal allocation function for textbuffers.
*/
static int internal_alloc(Textbuffer* self, Unicode maxchar)
{
self->capacity = INITIAL_CAPACITY;
self->length = 0;

#ifdef PEP_393
self->object = PyUnicode_New(self->capacity, maxchar);
if (!self->object)
return -1;
self->kind = PyUnicode_KIND(self->object);
self->data = PyUnicode_DATA(self->object);
#else
(void) maxchar; // Unused
self->data = malloc(sizeof(Unicode) * self->capacity);
if (!self->data)
return -1;
#endif

return 0;
}

/*
Internal deallocation function for textbuffers.
*/
static void internal_dealloc(Textbuffer* self)
{
#ifdef PEP_393
Py_DECREF(self->object);
#else
free(self->data);
#endif
}

/*
Internal resize function.
*/
static int internal_resize(Textbuffer* self, Py_ssize_t new_cap)
{
#ifdef PEP_393
PyObject *newobj;
void *newdata;

newobj = PyUnicode_New(new_cap, PyUnicode_MAX_CHAR_VALUE(self->object));
if (!newobj)
return -1;
newdata = PyUnicode_DATA(newobj);
memcpy(newdata, self->data, self->length * self->kind);
Py_DECREF(self->object);
self->object = newobj;
self->data = newdata;
#else
if (!(self->data = realloc(self->data, sizeof(Unicode) * new_cap)))
return -1;
#endif

self->capacity = new_cap;
return 0;
}

/*
Create a new textbuffer object.
*/
Textbuffer* Textbuffer_new(TokenizerInput* text)
{
Textbuffer* self = malloc(sizeof(Textbuffer));
Unicode maxchar = 0;

#ifdef PEP_393
maxchar = PyUnicode_MAX_CHAR_VALUE(text->object);
#endif

if (!self)
goto fail_nomem;
if (internal_alloc(self, maxchar) < 0)
goto fail_dealloc;
return self;

fail_dealloc:
free(self);
fail_nomem:
PyErr_NoMemory();
return NULL;
}

/*
Deallocate the given textbuffer.
*/
void Textbuffer_dealloc(Textbuffer* self)
{
internal_dealloc(self);
free(self);
}

/*
Reset a textbuffer to its initial, empty state.
*/
int Textbuffer_reset(Textbuffer* self)
{
Unicode maxchar = 0;

#ifdef PEP_393
maxchar = PyUnicode_MAX_CHAR_VALUE(self->object);
#endif

internal_dealloc(self);
if (internal_alloc(self, maxchar))
return -1;
return 0;
}

/*
Write a Unicode codepoint to the given textbuffer.
*/
int Textbuffer_write(Textbuffer* self, Unicode code)
{
if (self->length >= self->capacity) {
if (internal_resize(self, self->capacity * RESIZE_FACTOR) < 0)
return -1;
}

#ifdef PEP_393
PyUnicode_WRITE(self->kind, self->data, self->length++, code);
#else
self->data[self->length++] = code;
#endif

return 0;
}

/*
Read a Unicode codepoint from the given index of the given textbuffer.

This function does not check for bounds.
*/
Unicode Textbuffer_read(Textbuffer* self, Py_ssize_t index)
{
#ifdef PEP_393
return PyUnicode_READ(self->kind, self->data, index);
#else
return self->data[index];
#endif
}

/*
Return the contents of the textbuffer as a Python Unicode object.
*/
PyObject* Textbuffer_render(Textbuffer* self)
{
#ifdef PEP_393
return PyUnicode_FromKindAndData(self->kind, self->data, self->length);
#else
return PyUnicode_FromUnicode(self->data, self->length);
#endif
}

/*
Concatenate the 'other' textbuffer onto the end of the given textbuffer.
*/
int Textbuffer_concat(Textbuffer* self, Textbuffer* other)
{
Py_ssize_t newlen = self->length + other->length;

if (newlen > self->capacity) {
if (internal_resize(self, newlen + CONCAT_EXTRA) < 0)
return -1;
}

#ifdef PEP_393
assert(self->kind == other->kind);
memcpy(((Py_UCS1*) self->data) + self->kind * self->length, other->data,
other->length * other->kind);
#else
memcpy(self->data + self->length, other->data,
other->length * sizeof(Unicode));
#endif

self->length = newlen;
return 0;
}

/*
Reverse the contents of the given textbuffer.
*/
void Textbuffer_reverse(Textbuffer* self)
{
Py_ssize_t i, end = self->length - 1;
Unicode tmp;

for (i = 0; i < self->length / 2; i++) {
#ifdef PEP_393
tmp = PyUnicode_READ(self->kind, self->data, i);
PyUnicode_WRITE(self->kind, self->data, i,
PyUnicode_READ(self->kind, self->data, end - i));
PyUnicode_WRITE(self->kind, self->data, end - i, tmp);
#else
tmp = self->data[i];
self->data[i] = self->data[end - i];
self->data[end - i] = tmp;
#endif
}
}

+ 36
- 0
mwparserfromhell/parser/ctokenizer/textbuffer.h View File

@@ -0,0 +1,36 @@
/*
Copyright (C) 2012-2015 Ben Kurtovic <ben.kurtovic@gmail.com>

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
of the Software, and to permit persons to whom the Software is furnished to do
so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
*/

#pragma once

#include "common.h"

/* Functions */

Textbuffer* Textbuffer_new(TokenizerInput*);
void Textbuffer_dealloc(Textbuffer*);
int Textbuffer_reset(Textbuffer*);
int Textbuffer_write(Textbuffer*, Unicode);
Unicode Textbuffer_read(Textbuffer*, Py_ssize_t);
PyObject* Textbuffer_render(Textbuffer*);
int Textbuffer_concat(Textbuffer*, Textbuffer*);
void Textbuffer_reverse(Textbuffer*);

mwparserfromhell/parser/ctokenizer/tok_parse.c
File diff suppressed because it is too large
View File


+ 35
- 0
mwparserfromhell/parser/ctokenizer/tok_parse.h View File

@@ -0,0 +1,35 @@
/*
Copyright (C) 2012-2015 Ben Kurtovic <ben.kurtovic@gmail.com>

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
of the Software, and to permit persons to whom the Software is furnished to do
so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
*/

#pragma once

#include "common.h"

static const char MARKERS[] = {
'{', '}', '[', ']', '<', '>', '|', '=', '&', '\'', '#', '*', ';', ':', '/',
'-', '!', '\n', '\0'};

#define NUM_MARKERS 19

/* Functions */

PyObject* Tokenizer_parse(Tokenizer*, uint64_t, int);

+ 345
- 0
mwparserfromhell/parser/ctokenizer/tok_support.c View File

@@ -0,0 +1,345 @@
/*
Copyright (C) 2012-2015 Ben Kurtovic <ben.kurtovic@gmail.com>

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
of the Software, and to permit persons to whom the Software is furnished to do
so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
*/

#include "tok_support.h"
#include "textbuffer.h"
#include "tokens.h"

/*
Add a new token stack, context, and textbuffer to the list.
*/
int Tokenizer_push(Tokenizer* self, uint64_t context)
{
Stack* top = malloc(sizeof(Stack));

if (!top) {
PyErr_NoMemory();
return -1;
}
top->stack = PyList_New(0);
top->context = context;
top->textbuffer = Textbuffer_new(&self->text);
if (!top->textbuffer)
return -1;
top->next = self->topstack;
self->topstack = top;
self->depth++;
self->cycles++;
return 0;
}

/*
Push the textbuffer onto the stack as a Text node and clear it.
*/
int Tokenizer_push_textbuffer(Tokenizer* self)
{
PyObject *text, *kwargs, *token;
Textbuffer* buffer = self->topstack->textbuffer;

if (buffer->length == 0)
return 0;
text = Textbuffer_render(buffer);
if (!text)
return -1;
kwargs = PyDict_New();
if (!kwargs) {
Py_DECREF(text);
return -1;
}
PyDict_SetItemString(kwargs, "text", text);
Py_DECREF(text);
token = PyObject_Call(Text, NOARGS, kwargs);
Py_DECREF(kwargs);
if (!token)
return -1;
if (PyList_Append(self->topstack->stack, token)) {
Py_DECREF(token);
return -1;
}
Py_DECREF(token);
if (Textbuffer_reset(buffer))
return -1;
return 0;
}

/*
Pop and deallocate the top token stack/context/textbuffer.
*/
void Tokenizer_delete_top_of_stack(Tokenizer* self)
{
Stack* top = self->topstack;

Py_DECREF(top->stack);
Textbuffer_dealloc(top->textbuffer);
self->topstack = top->next;
free(top);
self->depth--;
}

/*
Pop the current stack/context/textbuffer, returing the stack.
*/
PyObject* Tokenizer_pop(Tokenizer* self)
{
PyObject* stack;

if (Tokenizer_push_textbuffer(self))
return NULL;
stack = self->topstack->stack;
Py_INCREF(stack);
Tokenizer_delete_top_of_stack(self);
return stack;
}

/*
Pop the current stack/context/textbuffer, returing the stack. We will also
replace the underlying stack's context with the current stack's.
*/
PyObject* Tokenizer_pop_keeping_context(Tokenizer* self)
{
PyObject* stack;
uint64_t context;

if (Tokenizer_push_textbuffer(self))
return NULL;
stack = self->topstack->stack;
Py_INCREF(stack);
context = self->topstack->context;
Tokenizer_delete_top_of_stack(self);
self->topstack->context = context;
return stack;
}

/*
Fail the current tokenization route. Discards the current
stack/context/textbuffer and sets the BAD_ROUTE flag.
*/
void* Tokenizer_fail_route(Tokenizer* self)
{
uint64_t context = self->topstack->context;
PyObject* stack = Tokenizer_pop(self);

Py_XDECREF(stack);
FAIL_ROUTE(context);
return NULL;
}

/*
Write a token to the current token stack.
*/
int Tokenizer_emit_token(Tokenizer* self, PyObject* token, int first)
{
PyObject* instance;

if (Tokenizer_push_textbuffer(self))
return -1;
instance = PyObject_CallObject(token, NULL);
if (!instance)
return -1;
if (first ? PyList_Insert(self->topstack->stack, 0, instance) :
PyList_Append(self->topstack->stack, instance)) {
Py_DECREF(instance);
return -1;
}
Py_DECREF(instance);
return 0;
}

/*
Write a token to the current token stack, with kwargs. Steals a reference
to kwargs.
*/
int Tokenizer_emit_token_kwargs(Tokenizer* self, PyObject* token,
PyObject* kwargs, int first)
{
PyObject* instance;

if (Tokenizer_push_textbuffer(self)) {
Py_DECREF(kwargs);
return -1;
}
instance = PyObject_Call(token, NOARGS, kwargs);
if (!instance) {
Py_DECREF(kwargs);
return -1;
}
if (first ? PyList_Insert(self->topstack->stack, 0, instance):
PyList_Append(self->topstack->stack, instance)) {
Py_DECREF(instance);
Py_DECREF(kwargs);
return -1;
}
Py_DECREF(instance);
Py_DECREF(kwargs);
return 0;
}

/*
Write a Unicode codepoint to the current textbuffer.
*/
int Tokenizer_emit_char(Tokenizer* self, Unicode code)
{
return Textbuffer_write(self->topstack->textbuffer, code);
}

/*
Write a string of text to the current textbuffer.
*/
int Tokenizer_emit_text(Tokenizer* self, const char* text)
{
int i = 0;

while (text[i]) {
if (Tokenizer_emit_char(self, text[i]))
return -1;
i++;
}
return 0;
}

/*
Write the contents of another textbuffer to the current textbuffer,
deallocating it in the process.
*/
int Tokenizer_emit_textbuffer(Tokenizer* self, Textbuffer* buffer)
{
int retval = Textbuffer_concat(self->topstack->textbuffer, buffer);
Textbuffer_dealloc(buffer);
return retval;
}

/*
Write a series of tokens to the current stack at once.
*/
int Tokenizer_emit_all(Tokenizer* self, PyObject* tokenlist)
{
int pushed = 0;
PyObject *stack, *token, *left, *right, *text;
Textbuffer* buffer;
Py_ssize_t size;

if (PyList_GET_SIZE(tokenlist) > 0) {
token = PyList_GET_ITEM(tokenlist, 0);
switch (PyObject_IsInstance(token, Text)) {
case 0:
break;
case 1: {
pushed = 1;
buffer = self->topstack->textbuffer;
if (buffer->length == 0)
break;
left = Textbuffer_render(buffer);
if (!left)
return -1;
right = PyObject_GetAttrString(token, "text");
if (!right)
return -1;
text = PyUnicode_Concat(left, right);
Py_DECREF(left);
Py_DECREF(right);
if (!text)
return -1;
if (PyObject_SetAttrString(token, "text", text)) {
Py_DECREF(text);
return -1;
}
Py_DECREF(text);
if (Textbuffer_reset(buffer))
return -1;
break;
}
case -1:
return -1;
}
}
if (!pushed) {
if (Tokenizer_push_textbuffer(self))
return -1;
}
stack = self->topstack->stack;
size = PyList_GET_SIZE(stack);
if (PyList_SetSlice(stack, size, size, tokenlist))
return -1;
return 0;
}

/*
Pop the current stack, write text, and then write the stack. 'text' is a
NULL-terminated array of chars.
*/
int Tokenizer_emit_text_then_stack(Tokenizer* self, const char* text)
{
PyObject* stack = Tokenizer_pop(self);

if (Tokenizer_emit_text(self, text)) {
Py_DECREF(stack);
return -1;
}
if (stack) {
if (PyList_GET_SIZE(stack) > 0) {
if (Tokenizer_emit_all(self, stack)) {
Py_DECREF(stack);
return -1;
}
}
Py_DECREF(stack);
}
self->head--;
return 0;
}

/*
Internal function to read the codepoint at the given index from the input.
*/
static Unicode read_codepoint(TokenizerInput* text, Py_ssize_t index)
{
#ifdef PEP_393
return PyUnicode_READ(text->kind, text->data, index);
#else
return text->buf[index];
#endif
}

/*
Read the value at a relative point in the wikicode, forwards.
*/
Unicode Tokenizer_read(Tokenizer* self, Py_ssize_t delta)
{
Py_ssize_t index = self->head + delta;

if (index >= self->text.length)
return '\0';
return read_codepoint(&self->text, index);
}

/*
Read the value at a relative point in the wikicode, backwards.
*/
Unicode Tokenizer_read_backwards(Tokenizer* self, Py_ssize_t delta)
{
Py_ssize_t index;

if (delta > self->head)
return '\0';
index = self->head - delta;
return read_codepoint(&self->text, index);
}

+ 62
- 0
mwparserfromhell/parser/ctokenizer/tok_support.h View File

@@ -0,0 +1,62 @@
/*
Copyright (C) 2012-2015 Ben Kurtovic <ben.kurtovic@gmail.com>

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
of the Software, and to permit persons to whom the Software is furnished to do
so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
*/

#pragma once

#include "common.h"

/* Functions */

int Tokenizer_push(Tokenizer*, uint64_t);
int Tokenizer_push_textbuffer(Tokenizer*);
void Tokenizer_delete_top_of_stack(Tokenizer*);
PyObject* Tokenizer_pop(Tokenizer*);
PyObject* Tokenizer_pop_keeping_context(Tokenizer*);
void* Tokenizer_fail_route(Tokenizer*);

int Tokenizer_emit_token(Tokenizer*, PyObject*, int);
int Tokenizer_emit_token_kwargs(Tokenizer*, PyObject*, PyObject*, int);
int Tokenizer_emit_char(Tokenizer*, Unicode);
int Tokenizer_emit_text(Tokenizer*, const char*);
int Tokenizer_emit_textbuffer(Tokenizer*, Textbuffer*);
int Tokenizer_emit_all(Tokenizer*, PyObject*);
int Tokenizer_emit_text_then_stack(Tokenizer*, const char*);

Unicode Tokenizer_read(Tokenizer*, Py_ssize_t);
Unicode Tokenizer_read_backwards(Tokenizer*, Py_ssize_t);

/* Macros */

#define MAX_DEPTH 40
#define MAX_CYCLES 100000

#define Tokenizer_CAN_RECURSE(self) \
(self->depth < MAX_DEPTH && self->cycles < MAX_CYCLES)

#define Tokenizer_emit(self, token) \
Tokenizer_emit_token(self, token, 0)
#define Tokenizer_emit_first(self, token) \
Tokenizer_emit_token(self, token, 1)
#define Tokenizer_emit_kwargs(self, token, kwargs) \
Tokenizer_emit_token_kwargs(self, token, kwargs, 0)
#define Tokenizer_emit_first_kwargs(self, token, kwargs) \
Tokenizer_emit_token_kwargs(self, token, kwargs, 1)

+ 310
- 0
mwparserfromhell/parser/ctokenizer/tokenizer.c View File

@@ -0,0 +1,310 @@
/*
Copyright (C) 2012-2015 Ben Kurtovic <ben.kurtovic@gmail.com>

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
of the Software, and to permit persons to whom the Software is furnished to do
so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
*/

#include "tokenizer.h"
#include "tok_parse.h"
#include "tokens.h"

/* Globals */

int route_state;
uint64_t route_context;

char** entitydefs;

PyObject* NOARGS;
PyObject* definitions;

static PyObject* ParserError;

/* Forward declarations */

static int load_exceptions(void);

/*
Create a new tokenizer object.
*/
static PyObject*
Tokenizer_new(PyTypeObject* type, PyObject* args, PyObject* kwds)
{
Tokenizer* self = (Tokenizer*) type->tp_alloc(type, 0);
return (PyObject*) self;
}

/*
Deallocate the given tokenizer's text field.
*/
static void dealloc_tokenizer_text(TokenizerInput* text)
{
Py_XDECREF(text->object);
}

/*
Deallocate the given tokenizer object.
*/
static void Tokenizer_dealloc(Tokenizer* self)
{
Stack *this = self->topstack, *next;
dealloc_tokenizer_text(&self->text);

while (this) {
Py_DECREF(this->stack);
Textbuffer_dealloc(this->textbuffer);
next = this->next;
free(this);
this = next;
}
Py_TYPE(self)->tp_free((PyObject*) self);
}

/*
Initialize a new tokenizer instance's text field.
*/
static void init_tokenizer_text(TokenizerInput* text)
{
text->object = Py_None;
Py_INCREF(Py_None);
text->length = 0;
#ifdef PEP_393
text->kind = PyUnicode_WCHAR_KIND;
text->data = NULL;
#else
text->buf = NULL;
#endif
}

/*
Initialize a new tokenizer instance by setting instance attributes.
*/
static int Tokenizer_init(Tokenizer* self, PyObject* args, PyObject* kwds)
{
static char* kwlist[] = {NULL};

if (!PyArg_ParseTupleAndKeywords(args, kwds, "", kwlist))
return -1;
init_tokenizer_text(&self->text);
self->topstack = NULL;
self->head = self->global = self->depth = self->cycles = 0;
self->route_context = self->route_state = 0;
self->skip_style_tags = 0;
return 0;
}

/*
Load input text into the tokenizer.
*/
static int load_tokenizer_text(TokenizerInput* text, PyObject *input)
{
dealloc_tokenizer_text(text);
text->object = input;

#ifdef PEP_393
if (PyUnicode_READY(input) < 0)
return -1;
text->kind = PyUnicode_KIND(input);
text->data = PyUnicode_DATA(input);
#else
text->buf = PyUnicode_AS_UNICODE(input);
#endif
text->length = PyUnicode_GET_LENGTH(input);
return 0;
}

/*
Build a list of tokens from a string of wikicode and return it.
*/
static PyObject* Tokenizer_tokenize(Tokenizer* self, PyObject* args)
{
PyObject *input, *tokens;
uint64_t context = 0;
int skip_style_tags = 0;

if (PyArg_ParseTuple(args, "U|ii", &input, &context, &skip_style_tags)) {
Py_INCREF(input);
if (load_tokenizer_text(&self->text, input))
return NULL;
}
else {
const char *encoded;
Py_ssize_t size;

/* Failed to parse a Unicode object; try a string instead. */
PyErr_Clear();
if (!PyArg_ParseTuple(args, "s#|ii", &encoded, &size, &context,
&skip_style_tags))
return NULL;
if (!(input = PyUnicode_FromStringAndSize(encoded, size)))
return NULL;
if (load_tokenizer_text(&self->text, input))
return NULL;
}

self->head = self->global = self->depth = self->cycles = 0;
self->skip_style_tags = skip_style_tags;
tokens = Tokenizer_parse(self, context, 1);

if ((!tokens && !PyErr_Occurred()) || self->topstack) {
if (!ParserError) {
if (load_exceptions())
return NULL;
}
if (BAD_ROUTE) {
RESET_ROUTE();
PyErr_SetString(ParserError, "C tokenizer exited with BAD_ROUTE");
}
else if (self->topstack)
PyErr_SetString(ParserError,
"C tokenizer exited with non-empty token stack");
else
PyErr_SetString(ParserError, "C tokenizer exited unexpectedly");
return NULL;
}
return tokens;
}

static int load_entities(void)
{
PyObject *tempmod, *defmap, *deflist;
unsigned numdefs, i;
#ifdef IS_PY3K
PyObject *string;
#endif

tempmod = PyImport_ImportModule(ENTITYDEFS_MODULE);
if (!tempmod)
return -1;
defmap = PyObject_GetAttrString(tempmod, "entitydefs");
if (!defmap)
return -1;
Py_DECREF(tempmod);
deflist = PyDict_Keys(defmap);
if (!deflist)
return -1;
Py_DECREF(defmap);
numdefs = (unsigned) PyList_GET_SIZE(defmap);
entitydefs = calloc(numdefs + 1, sizeof(char*));
if (!entitydefs)
return -1;
for (i = 0; i < numdefs; i++) {
#ifdef IS_PY3K
string = PyUnicode_AsASCIIString(PyList_GET_ITEM(deflist, i));
if (!string)
return -1;
entitydefs[i] = PyBytes_AsString(string);
#else
entitydefs[i] = PyBytes_AsString(PyList_GET_ITEM(deflist, i));
#endif
if (!entitydefs[i])
return -1;
}
Py_DECREF(deflist);
return 0;
}

static int load_tokens(void)
{
PyObject *tempmod, *tokens,
*globals = PyEval_GetGlobals(),
*locals = PyEval_GetLocals(),
*fromlist = PyList_New(1),
*modname = IMPORT_NAME_FUNC("tokens");
char *name = "mwparserfromhell.parser";

if (!fromlist || !modname)
return -1;
PyList_SET_ITEM(fromlist, 0, modname);
tempmod = PyImport_ImportModuleLevel(name, globals, locals, fromlist, 0);
Py_DECREF(fromlist);
if (!tempmod)
return -1;
tokens = PyObject_GetAttrString(tempmod, "tokens");
Py_DECREF(tempmod);
load_tokens_from_module(tokens);
Py_DECREF(tokens);
return 0;
}

static int load_defs(void)
{
PyObject *tempmod,
*globals = PyEval_GetGlobals(),
*locals = PyEval_GetLocals(),
*fromlist = PyList_New(1),
*modname = IMPORT_NAME_FUNC("definitions");
char *name = "mwparserfromhell";

if (!fromlist || !modname)
return -1;
PyList_SET_ITEM(fromlist, 0, modname);
tempmod = PyImport_ImportModuleLevel(name, globals, locals, fromlist, 0);
Py_DECREF(fromlist);
if (!tempmod)
return -1;
definitions = PyObject_GetAttrString(tempmod, "definitions");
Py_DECREF(tempmod);
return 0;
}

static int load_exceptions(void)
{
PyObject *tempmod, *parsermod,
*globals = PyEval_GetGlobals(),
*locals = PyEval_GetLocals(),
*fromlist = PyList_New(1),
*modname = IMPORT_NAME_FUNC("parser");
char *name = "mwparserfromhell";

if (!fromlist || !modname)
return -1;
PyList_SET_ITEM(fromlist, 0, modname);
tempmod = PyImport_ImportModuleLevel(name, globals, locals, fromlist, 0);
Py_DECREF(fromlist);
if (!tempmod)
return -1;
parsermod = PyObject_GetAttrString(tempmod, "parser");
Py_DECREF(tempmod);
ParserError = PyObject_GetAttrString(parsermod, "ParserError");
Py_DECREF(parsermod);
return 0;
}

PyMODINIT_FUNC INIT_FUNC_NAME(void)
{
PyObject *module;

TokenizerType.tp_new = PyType_GenericNew;
if (PyType_Ready(&TokenizerType) < 0)
INIT_ERROR;
module = CREATE_MODULE;
if (!module)
INIT_ERROR;
Py_INCREF(&TokenizerType);
PyModule_AddObject(module, "CTokenizer", (PyObject*) &TokenizerType);
Py_INCREF(Py_True);
PyDict_SetItemString(TokenizerType.tp_dict, "USES_C", Py_True);
NOARGS = PyTuple_New(0);
if (!NOARGS || load_entities() || load_tokens() || load_defs())
INIT_ERROR;
#ifdef IS_PY3K
return module;
#endif
}

+ 111
- 0
mwparserfromhell/parser/ctokenizer/tokenizer.h View File

@@ -0,0 +1,111 @@
/*
Copyright (C) 2012-2015 Ben Kurtovic <ben.kurtovic@gmail.com>

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
of the Software, and to permit persons to whom the Software is furnished to do
so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
*/

#pragma once

#include "common.h"
#include "textbuffer.h"

/* Functions */

static PyObject* Tokenizer_new(PyTypeObject*, PyObject*, PyObject*);
static void Tokenizer_dealloc(Tokenizer*);
static int Tokenizer_init(Tokenizer*, PyObject*, PyObject*);
static PyObject* Tokenizer_tokenize(Tokenizer*, PyObject*);

/* Compatibility macros */

#ifdef IS_PY3K
#define IMPORT_NAME_FUNC PyUnicode_FromString
#define CREATE_MODULE PyModule_Create(&module_def);
#define ENTITYDEFS_MODULE "html.entities"
#define INIT_FUNC_NAME PyInit__tokenizer
#define INIT_ERROR return NULL
#else
#define IMPORT_NAME_FUNC PyBytes_FromString
#define CREATE_MODULE Py_InitModule("_tokenizer", NULL);
#define ENTITYDEFS_MODULE "htmlentitydefs"
#define INIT_FUNC_NAME init_tokenizer
#define INIT_ERROR return
#endif

/* Structs */

static PyMethodDef Tokenizer_methods[] = {
{"tokenize", (PyCFunction) Tokenizer_tokenize, METH_VARARGS,
"Build a list of tokens from a string of wikicode and return it."},
{NULL}
};

static PyMemberDef Tokenizer_members[] = {
{NULL}
};

static PyTypeObject TokenizerType = {
PyVarObject_HEAD_INIT(NULL, 0)
"_tokenizer.CTokenizer", /* tp_name */
sizeof(Tokenizer), /* tp_basicsize */
0, /* tp_itemsize */
(destructor) Tokenizer_dealloc, /* tp_dealloc */
0, /* tp_print */
0, /* tp_getattr */
0, /* tp_setattr */
0, /* tp_compare */
0, /* tp_repr */
0, /* tp_as_number */
0, /* tp_as_sequence */
0, /* tp_as_mapping */
0, /* tp_hash */
0, /* tp_call */
0, /* tp_str */
0, /* tp_getattro */
0, /* tp_setattro */
0, /* tp_as_buffer */
Py_TPFLAGS_DEFAULT, /* tp_flags */
"Creates a list of tokens from a string of wikicode.", /* tp_doc */
0, /* tp_traverse */
0, /* tp_clear */
0, /* tp_richcompare */
0, /* tp_weaklistoffset */
0, /* tp_iter */
0, /* tp_iternext */
Tokenizer_methods, /* tp_methods */
Tokenizer_members, /* tp_members */
0, /* tp_getset */
0, /* tp_base */
0, /* tp_dict */
0, /* tp_descr_get */
0, /* tp_descr_set */
0, /* tp_dictoffset */
(initproc) Tokenizer_init, /* tp_init */
0, /* tp_alloc */
Tokenizer_new, /* tp_new */
};

#ifdef IS_PY3K
static PyModuleDef module_def = {
PyModuleDef_HEAD_INIT,
"_tokenizer",
"Creates a list of tokens from a string of wikicode.",
-1, NULL, NULL, NULL, NULL, NULL
};
#endif

+ 111
- 0
mwparserfromhell/parser/ctokenizer/tokens.c View File

@@ -0,0 +1,111 @@
/*
Copyright (C) 2012-2015 Ben Kurtovic <ben.kurtovic@gmail.com>

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
of the Software, and to permit persons to whom the Software is furnished to do
so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
*/

#include "tokens.h"

/* Globals */

PyObject* Text;

PyObject* TemplateOpen;
PyObject* TemplateParamSeparator;
PyObject* TemplateParamEquals;
PyObject* TemplateClose;

PyObject* ArgumentOpen;
PyObject* ArgumentSeparator;
PyObject* ArgumentClose;

PyObject* WikilinkOpen;
PyObject* WikilinkSeparator;
PyObject* WikilinkClose;

PyObject* ExternalLinkOpen;
PyObject* ExternalLinkSeparator;
PyObject* ExternalLinkClose;

PyObject* HTMLEntityStart;
PyObject* HTMLEntityNumeric;
PyObject* HTMLEntityHex;
PyObject* HTMLEntityEnd;
PyObject* HeadingStart;
PyObject* HeadingEnd;

PyObject* CommentStart;
PyObject* CommentEnd;

PyObject* TagOpenOpen;
PyObject* TagAttrStart;
PyObject* TagAttrEquals;
PyObject* TagAttrQuote;
PyObject* TagCloseOpen;
PyObject* TagCloseSelfclose;
PyObject* TagOpenClose;
PyObject* TagCloseClose;

/*
Load individual tokens into globals from the given Python module object.
*/
void load_tokens_from_module(PyObject* module)
{
Text = PyObject_GetAttrString(module, "Text");

TemplateOpen = PyObject_GetAttrString(module, "TemplateOpen");
TemplateParamSeparator = PyObject_GetAttrString(module,
"TemplateParamSeparator");
TemplateParamEquals = PyObject_GetAttrString(module,
"TemplateParamEquals");
TemplateClose = PyObject_GetAttrString(module, "TemplateClose");

ArgumentOpen = PyObject_GetAttrString(module, "ArgumentOpen");
ArgumentSeparator = PyObject_GetAttrString(module, "ArgumentSeparator");
ArgumentClose = PyObject_GetAttrString(module, "ArgumentClose");

WikilinkOpen = PyObject_GetAttrString(module, "WikilinkOpen");
WikilinkSeparator = PyObject_GetAttrString(module, "WikilinkSeparator");
WikilinkClose = PyObject_GetAttrString(module, "WikilinkClose");

ExternalLinkOpen = PyObject_GetAttrString(module, "ExternalLinkOpen");
ExternalLinkSeparator = PyObject_GetAttrString(module,
"ExternalLinkSeparator");
ExternalLinkClose = PyObject_GetAttrString(module, "ExternalLinkClose");

HTMLEntityStart = PyObject_GetAttrString(module, "HTMLEntityStart");
HTMLEntityNumeric = PyObject_GetAttrString(module, "HTMLEntityNumeric");
HTMLEntityHex = PyObject_GetAttrString(module, "HTMLEntityHex");
HTMLEntityEnd = PyObject_GetAttrString(module, "HTMLEntityEnd");

HeadingStart = PyObject_GetAttrString(module, "HeadingStart");
HeadingEnd = PyObject_GetAttrString(module, "HeadingEnd");

CommentStart = PyObject_GetAttrString(module, "CommentStart");
CommentEnd = PyObject_GetAttrString(module, "CommentEnd");

TagOpenOpen = PyObject_GetAttrString(module, "TagOpenOpen");
TagAttrStart = PyObject_GetAttrString(module, "TagAttrStart");
TagAttrEquals = PyObject_GetAttrString(module, "TagAttrEquals");
TagAttrQuote = PyObject_GetAttrString(module, "TagAttrQuote");
TagCloseOpen = PyObject_GetAttrString(module, "TagCloseOpen");
TagCloseSelfclose = PyObject_GetAttrString(module, "TagCloseSelfclose");
TagOpenClose = PyObject_GetAttrString(module, "TagOpenClose");
TagCloseClose = PyObject_GetAttrString(module, "TagCloseClose");
}

+ 69
- 0
mwparserfromhell/parser/ctokenizer/tokens.h View File

@@ -0,0 +1,69 @@
/*
Copyright (C) 2012-2015 Ben Kurtovic <ben.kurtovic@gmail.com>

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
of the Software, and to permit persons to whom the Software is furnished to do
so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
*/

#pragma once

#include "common.h"

/* Token globals */

extern PyObject* Text;

extern PyObject* TemplateOpen;
extern PyObject* TemplateParamSeparator;
extern PyObject* TemplateParamEquals;
extern PyObject* TemplateClose;

extern PyObject* ArgumentOpen;
extern PyObject* ArgumentSeparator;
extern PyObject* ArgumentClose;

extern PyObject* WikilinkOpen;
extern PyObject* WikilinkSeparator;
extern PyObject* WikilinkClose;

extern PyObject* ExternalLinkOpen;
extern PyObject* ExternalLinkSeparator;
extern PyObject* ExternalLinkClose;

extern PyObject* HTMLEntityStart;
extern PyObject* HTMLEntityNumeric;
extern PyObject* HTMLEntityHex;
extern PyObject* HTMLEntityEnd;
extern PyObject* HeadingStart;
extern PyObject* HeadingEnd;

extern PyObject* CommentStart;
extern PyObject* CommentEnd;

extern PyObject* TagOpenOpen;
extern PyObject* TagAttrStart;
extern PyObject* TagAttrEquals;
extern PyObject* TagAttrQuote;
extern PyObject* TagCloseOpen;
extern PyObject* TagCloseSelfclose;
extern PyObject* TagOpenClose;
extern PyObject* TagCloseClose;

/* Functions */

void load_tokens_from_module(PyObject*);

+ 0
- 366
mwparserfromhell/parser/tokenizer.h View File

@@ -1,366 +0,0 @@
/*
Tokenizer Header File for MWParserFromHell
Copyright (C) 2012-2015 Ben Kurtovic <ben.kurtovic@gmail.com>

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
of the Software, and to permit persons to whom the Software is furnished to do
so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
*/

#ifndef PY_SSIZE_T_CLEAN
#define PY_SSIZE_T_CLEAN
#endif

#include <Python.h>
#include <math.h>
#include <structmember.h>
#include <bytesobject.h>
#include <stdint.h>

#if PY_MAJOR_VERSION >= 3
#define IS_PY3K
#endif

#define malloc PyObject_Malloc
#define free PyObject_Free

#define DIGITS "0123456789"
#define HEXDIGITS "0123456789abcdefABCDEF"
#define ALPHANUM "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"

static const char MARKERS[] = {
'{', '}', '[', ']', '<', '>', '|', '=', '&', '\'', '#', '*', ';', ':', '/',
'-', '!', '\n', '\0'};

#define NUM_MARKERS 19
#define TEXTBUFFER_BLOCKSIZE 1024
#define MAX_DEPTH 40
#define MAX_CYCLES 100000
#define MAX_BRACES 255
#define MAX_ENTITY_SIZE 8

static int route_state = 0;
static uint64_t route_context = 0;
#define BAD_ROUTE route_state
#define BAD_ROUTE_CONTEXT route_context
#define FAIL_ROUTE(context) route_state = 1; route_context = context
#define RESET_ROUTE() route_state = 0

static char** entitydefs;

static PyObject* EMPTY;
static PyObject* NOARGS;
static PyObject* ParserError;
static PyObject* definitions;


/* Tokens: */

static PyObject* Text;

static PyObject* TemplateOpen;
static PyObject* TemplateParamSeparator;
static PyObject* TemplateParamEquals;
static PyObject* TemplateClose;

static PyObject* ArgumentOpen;
static PyObject* ArgumentSeparator;
static PyObject* ArgumentClose;

static PyObject* WikilinkOpen;
static PyObject* WikilinkSeparator;
static PyObject* WikilinkClose;

static PyObject* ExternalLinkOpen;
static PyObject* ExternalLinkSeparator;
static PyObject* ExternalLinkClose;

static PyObject* HTMLEntityStart;
static PyObject* HTMLEntityNumeric;
static PyObject* HTMLEntityHex;
static PyObject* HTMLEntityEnd;
static PyObject* HeadingStart;
static PyObject* HeadingEnd;

static PyObject* CommentStart;
static PyObject* CommentEnd;

static PyObject* TagOpenOpen;
static PyObject* TagAttrStart;
static PyObject* TagAttrEquals;
static PyObject* TagAttrQuote;
static PyObject* TagCloseOpen;
static PyObject* TagCloseSelfclose;
static PyObject* TagOpenClose;
static PyObject* TagCloseClose;


/* Local contexts: */

#define LC_TEMPLATE 0x0000000000000007
#define LC_TEMPLATE_NAME 0x0000000000000001
#define LC_TEMPLATE_PARAM_KEY 0x0000000000000002
#define LC_TEMPLATE_PARAM_VALUE 0x0000000000000004

#define LC_ARGUMENT 0x0000000000000018
#define LC_ARGUMENT_NAME 0x0000000000000008
#define LC_ARGUMENT_DEFAULT 0x0000000000000010

#define LC_WIKILINK 0x0000000000000060
#define LC_WIKILINK_TITLE 0x0000000000000020
#define LC_WIKILINK_TEXT 0x0000000000000040

#define LC_EXT_LINK 0x0000000000000180
#define LC_EXT_LINK_URI 0x0000000000000080
#define LC_EXT_LINK_TITLE 0x0000000000000100

#define LC_HEADING 0x0000000000007E00
#define LC_HEADING_LEVEL_1 0x0000000000000200
#define LC_HEADING_LEVEL_2 0x0000000000000400
#define LC_HEADING_LEVEL_3 0x0000000000000800
#define LC_HEADING_LEVEL_4 0x0000000000001000
#define LC_HEADING_LEVEL_5 0x0000000000002000
#define LC_HEADING_LEVEL_6 0x0000000000004000

#define LC_TAG 0x0000000000078000
#define LC_TAG_OPEN 0x0000000000008000
#define LC_TAG_ATTR 0x0000000000010000
#define LC_TAG_BODY 0x0000000000020000
#define LC_TAG_CLOSE 0x0000000000040000

#define LC_STYLE 0x0000000000780000
#define LC_STYLE_ITALICS 0x0000000000080000
#define LC_STYLE_BOLD 0x0000000000100000
#define LC_STYLE_PASS_AGAIN 0x0000000000200000
#define LC_STYLE_SECOND_PASS 0x0000000000400000

#define LC_DLTERM 0x0000000000800000

#define LC_SAFETY_CHECK 0x000000003F000000
#define LC_HAS_TEXT 0x0000000001000000
#define LC_FAIL_ON_TEXT 0x0000000002000000
#define LC_FAIL_NEXT 0x0000000004000000
#define LC_FAIL_ON_LBRACE 0x0000000008000000
#define LC_FAIL_ON_RBRACE 0x0000000010000000
#define LC_FAIL_ON_EQUALS 0x0000000020000000

#define LC_TABLE 0x0000000FC0000000
#define LC_TABLE_CELL_LINE_CONTEXTS 0x0000000D00000000
#define LC_TABLE_OPEN 0x0000000040000000
#define LC_TABLE_CELL_OPEN 0x0000000080000000
#define LC_TABLE_CELL_STYLE 0x0000000100000000
#define LC_TABLE_ROW_OPEN 0x0000000200000000
#define LC_TABLE_TD_LINE 0x0000000400000000
#define LC_TABLE_TH_LINE 0x0000000800000000

/* Global contexts: */

#define GL_HEADING 0x1

/* Aggregate contexts: */

#define AGG_FAIL (LC_TEMPLATE | LC_ARGUMENT | LC_WIKILINK | LC_EXT_LINK_TITLE | LC_HEADING | LC_TAG | LC_STYLE | LC_TABLE_OPEN)
#define AGG_UNSAFE (LC_TEMPLATE_NAME | LC_WIKILINK_TITLE | LC_EXT_LINK_TITLE | LC_TEMPLATE_PARAM_KEY | LC_ARGUMENT_NAME)
#define AGG_DOUBLE (LC_TEMPLATE_PARAM_KEY | LC_TAG_CLOSE | LC_TABLE_ROW_OPEN)
#define AGG_NO_WIKILINKS (LC_TEMPLATE_NAME | LC_ARGUMENT_NAME | LC_WIKILINK_TITLE | LC_EXT_LINK_URI)
#define AGG_NO_EXT_LINKS (LC_TEMPLATE_NAME | LC_ARGUMENT_NAME | LC_WIKILINK_TITLE | LC_EXT_LINK)

/* Tag contexts: */

#define TAG_NAME 0x01
#define TAG_ATTR_READY 0x02
#define TAG_ATTR_NAME 0x04
#define TAG_ATTR_VALUE 0x08
#define TAG_QUOTED 0x10
#define TAG_NOTE_SPACE 0x20
#define TAG_NOTE_EQUALS 0x40
#define TAG_NOTE_QUOTE 0x80


/* Miscellaneous structs: */

struct Textbuffer {
Py_ssize_t size;
Py_UNICODE* data;
struct Textbuffer* prev;
struct Textbuffer* next;
};

struct Stack {
PyObject* stack;
uint64_t context;
struct Textbuffer* textbuffer;
struct Stack* next;
};

typedef struct {
PyObject* title;
int level;
} HeadingData;

typedef struct {
uint64_t context;
struct Textbuffer* pad_first;
struct Textbuffer* pad_before_eq;
struct Textbuffer* pad_after_eq;
Py_UNICODE quoter;
Py_ssize_t reset;
} TagData;

typedef struct Textbuffer Textbuffer;
typedef struct Stack Stack;


/* Tokenizer object definition: */

typedef struct {
PyObject_HEAD
PyObject* text; /* text to tokenize */
Stack* topstack; /* topmost stack */
Py_ssize_t head; /* current position in text */
Py_ssize_t length; /* length of text */
int global; /* global context */
int depth; /* stack recursion depth */
int cycles; /* total number of stack recursions */
int skip_style_tags; /* temporary fix for the sometimes broken tag parser */
} Tokenizer;


/* Macros related to Tokenizer functions: */

#define Tokenizer_READ(self, delta) (*PyUnicode_AS_UNICODE(Tokenizer_read(self, delta)))
#define Tokenizer_READ_BACKWARDS(self, delta) \
(*PyUnicode_AS_UNICODE(Tokenizer_read_backwards(self, delta)))
#define Tokenizer_CAN_RECURSE(self) (self->depth < MAX_DEPTH && self->cycles < MAX_CYCLES)

#define Tokenizer_emit(self, token) Tokenizer_emit_token(self, token, 0)
#define Tokenizer_emit_first(self, token) Tokenizer_emit_token(self, token, 1)
#define Tokenizer_emit_kwargs(self, token, kwargs) Tokenizer_emit_token_kwargs(self, token, kwargs, 0)
#define Tokenizer_emit_first_kwargs(self, token, kwargs) Tokenizer_emit_token_kwargs(self, token, kwargs, 1)


/* Macros for accessing definitions: */

#define GET_HTML_TAG(markup) (markup == ':' ? "dd" : markup == ';' ? "dt" : "li")
#define IS_PARSABLE(tag) (call_def_func("is_parsable", tag, NULL, NULL))
#define IS_SINGLE(tag) (call_def_func("is_single", tag, NULL, NULL))
#define IS_SINGLE_ONLY(tag) (call_def_func("is_single_only", tag, NULL, NULL))
#define IS_SCHEME(scheme, slashes, reverse) \
(call_def_func("is_scheme", scheme, slashes ? Py_True : Py_False, reverse ? Py_True : Py_False))


/* Function prototypes: */

static Textbuffer* Textbuffer_new(void);
static void Textbuffer_dealloc(Textbuffer*);

static TagData* TagData_new(void);
static void TagData_dealloc(TagData*);

static PyObject* Tokenizer_new(PyTypeObject*, PyObject*, PyObject*);
static void Tokenizer_dealloc(Tokenizer*);
static int Tokenizer_init(Tokenizer*, PyObject*, PyObject*);
static int Tokenizer_parse_entity(Tokenizer*);
static int Tokenizer_parse_comment(Tokenizer*);
static int Tokenizer_handle_dl_term(Tokenizer*);
static int Tokenizer_parse_tag(Tokenizer*);
static PyObject* Tokenizer_parse(Tokenizer*, uint64_t, int);
static PyObject* Tokenizer_tokenize(Tokenizer*, PyObject*);

static int load_exceptions(void);


/* Macros for Python 2/3 compatibility: */

#ifdef IS_PY3K
#define NEW_INT_FUNC PyLong_FromSsize_t
#define IMPORT_NAME_FUNC PyUnicode_FromString
#define CREATE_MODULE PyModule_Create(&module_def);
#define ENTITYDEFS_MODULE "html.entities"
#define INIT_FUNC_NAME PyInit__tokenizer
#define INIT_ERROR return NULL
#else
#define NEW_INT_FUNC PyInt_FromSsize_t
#define IMPORT_NAME_FUNC PyBytes_FromString
#define CREATE_MODULE Py_InitModule("_tokenizer", NULL);
#define ENTITYDEFS_MODULE "htmlentitydefs"
#define INIT_FUNC_NAME init_tokenizer
#define INIT_ERROR return
#endif


/* More structs for creating the Tokenizer type: */

static PyMethodDef Tokenizer_methods[] = {
{"tokenize", (PyCFunction) Tokenizer_tokenize, METH_VARARGS,
"Build a list of tokens from a string of wikicode and return it."},
{NULL}
};

static PyMemberDef Tokenizer_members[] = {
{NULL}
};

static PyTypeObject TokenizerType = {
PyVarObject_HEAD_INIT(NULL, 0)
"_tokenizer.CTokenizer", /* tp_name */
sizeof(Tokenizer), /* tp_basicsize */
0, /* tp_itemsize */
(destructor) Tokenizer_dealloc, /* tp_dealloc */
0, /* tp_print */
0, /* tp_getattr */
0, /* tp_setattr */
0, /* tp_compare */
0, /* tp_repr */
0, /* tp_as_number */
0, /* tp_as_sequence */
0, /* tp_as_mapping */
0, /* tp_hash */
0, /* tp_call */
0, /* tp_str */
0, /* tp_getattro */
0, /* tp_setattro */
0, /* tp_as_buffer */
Py_TPFLAGS_DEFAULT, /* tp_flags */
"Creates a list of tokens from a string of wikicode.", /* tp_doc */
0, /* tp_traverse */
0, /* tp_clear */
0, /* tp_richcompare */
0, /* tp_weaklistoffset */
0, /* tp_iter */
0, /* tp_iternext */
Tokenizer_methods, /* tp_methods */
Tokenizer_members, /* tp_members */
0, /* tp_getset */
0, /* tp_base */
0, /* tp_dict */
0, /* tp_descr_get */
0, /* tp_descr_set */
0, /* tp_dictoffset */
(initproc) Tokenizer_init, /* tp_init */
0, /* tp_alloc */
Tokenizer_new, /* tp_new */
};

#ifdef IS_PY3K
static PyModuleDef module_def = {
PyModuleDef_HEAD_INIT,
"_tokenizer",
"Creates a list of tokens from a string of wikicode.",
-1, NULL, NULL, NULL, NULL, NULL
};
#endif

+ 29
- 15
mwparserfromhell/parser/tokenizer.py View File

@@ -192,11 +192,14 @@ class Tokenizer(object):
self._fail_route()
return self.END

def _parse_template(self):
def _parse_template(self, has_content):
"""Parse a template at the head of the wikicode string."""
reset = self._head
context = contexts.TEMPLATE_NAME
if has_content:
context |= contexts.HAS_TEMPLATE
try:
template = self._parse(contexts.TEMPLATE_NAME)
template = self._parse(context)
except BadRoute:
self._head = reset
raise
@@ -223,6 +226,7 @@ class Tokenizer(object):
while self._read() == "{":
self._head += 1
braces += 1
has_content = False
self._push()

while braces:
@@ -230,7 +234,7 @@ class Tokenizer(object):
return self._emit_text_then_stack("{")
if braces == 2:
try:
self._parse_template()
self._parse_template(has_content)
except BadRoute:
return self._emit_text_then_stack("{{")
break
@@ -239,11 +243,12 @@ class Tokenizer(object):
braces -= 3
except BadRoute:
try:
self._parse_template()
self._parse_template(has_content)
braces -= 2
except BadRoute:
return self._emit_text_then_stack("{" * braces)
if braces:
has_content = True
self._head += 1

self._emit_all(self._pop())
@@ -253,6 +258,8 @@ class Tokenizer(object):
def _handle_template_param(self):
"""Handle a template parameter at the head of the string."""
if self._context & contexts.TEMPLATE_NAME:
if not self._context & (contexts.HAS_TEXT | contexts.HAS_TEMPLATE):
self._fail_route()
self._context ^= contexts.TEMPLATE_NAME
elif self._context & contexts.TEMPLATE_PARAM_VALUE:
self._context ^= contexts.TEMPLATE_PARAM_VALUE
@@ -271,7 +278,10 @@ class Tokenizer(object):

def _handle_template_end(self):
"""Handle the end of a template at the head of the string."""
if self._context & contexts.TEMPLATE_PARAM_KEY:
if self._context & contexts.TEMPLATE_NAME:
if not self._context & (contexts.HAS_TEXT | contexts.HAS_TEMPLATE):
self._fail_route()
elif self._context & contexts.TEMPLATE_PARAM_KEY:
self._emit_all(self._pop(keep_context=True))
self._head += 1
return self._pop()
@@ -610,7 +620,7 @@ class Tokenizer(object):
self._head += 2
if self._context & contexts.FAIL_NEXT:
# _verify_safe() sets this flag while parsing a template
# name when it encounters what might be a comment -- we
# or link when it encounters what might be a comment -- we
# must unset it to let _verify_safe() know it was correct:
self._context ^= contexts.FAIL_NEXT
return
@@ -1172,29 +1182,33 @@ class Tokenizer(object):
if context & contexts.WIKILINK_TITLE:
if this == "]" or this == "{":
self._context |= contexts.FAIL_NEXT
elif this == "\n" or this == "[" or this == "}":
elif this == "\n" or this == "[" or this == "}" or this == ">":
return False
elif this == "<":
if self._read(1) == "!":
self._context |= contexts.FAIL_NEXT
else:
return False
return True
elif context & contexts.EXT_LINK_TITLE:
return this != "\n"
elif context & contexts.TEMPLATE_NAME:
if this == "{" or this == "}" or this == "[":
if this == "{":
self._context |= contexts.HAS_TEMPLATE | contexts.FAIL_NEXT
return True
if this == "}" or (this == "<" and self._read(1) == "!"):
self._context |= contexts.FAIL_NEXT
return True
if this == "]":
if this == "[" or this == "]" or this == "<" or this == ">":
return False
if this == "|":
return True
if context & contexts.HAS_TEXT:
if context & contexts.FAIL_ON_TEXT:
if this is self.END or not this.isspace():
if this == "<" and self._read(1) == "!":
self._context |= contexts.FAIL_NEXT
return True
return False
else:
if this == "\n":
self._context |= contexts.FAIL_ON_TEXT
elif this == "\n":
self._context |= contexts.FAIL_ON_TEXT
elif this is self.END or not this.isspace():
self._context |= contexts.HAS_TEXT
return True


+ 42
- 58
mwparserfromhell/smart_list.py View File

@@ -27,8 +27,10 @@ reflect changes made to the main list, and vice-versa.
"""

from __future__ import unicode_literals
from sys import maxsize
from weakref import ref

from .compat import maxsize, py3k
from .compat import py3k

__all__ = ["SmartList"]

@@ -45,16 +47,16 @@ def inheritdoc(method):
class _SliceNormalizerMixIn(object):
"""MixIn that provides a private method to normalize slices."""

def _normalize_slice(self, key):
def _normalize_slice(self, key, clamp=False):
"""Return a slice equivalent to the input *key*, standardized."""
if key.start is not None:
if key.start is None:
start = 0
else:
start = (len(self) + key.start) if key.start < 0 else key.start
if key.stop is None or key.stop == maxsize:
stop = len(self) if clamp else None
else:
start = 0
if key.stop is not None:
stop = (len(self) + key.stop) if key.stop < 0 else key.stop
else:
stop = maxsize
return slice(start, stop, key.step or 1)


@@ -80,13 +82,6 @@ class SmartList(_SliceNormalizerMixIn, list):
[2, 3, 4]
>>> parent
[0, 1, 2, 3, 4]

The parent needs to keep a list of its children in order to update them,
which prevents them from being garbage-collected. If you are keeping the
parent around for a while but creating many children, it is advisable to
call :meth:`._ListProxy.detach` when you're finished with them. Certain
parent methods, like :meth:`reverse` and :meth:`sort`, will do this
automatically.
"""

def __init__(self, iterable=None):
@@ -99,10 +94,11 @@ class SmartList(_SliceNormalizerMixIn, list):
def __getitem__(self, key):
if not isinstance(key, slice):
return super(SmartList, self).__getitem__(key)
key = self._normalize_slice(key)
key = self._normalize_slice(key, clamp=False)
sliceinfo = [key.start, key.stop, key.step]
child = _ListProxy(self, sliceinfo)
self._children[id(child)] = (child, sliceinfo)
child_ref = ref(child, self._delete_child)
self._children[id(child_ref)] = (child_ref, sliceinfo)
return child

def __setitem__(self, key, item):
@@ -110,20 +106,21 @@ class SmartList(_SliceNormalizerMixIn, list):
return super(SmartList, self).__setitem__(key, item)
item = list(item)
super(SmartList, self).__setitem__(key, item)
key = self._normalize_slice(key)
key = self._normalize_slice(key, clamp=True)
diff = len(item) + (key.start - key.stop) // key.step
if not diff:
return
values = self._children.values if py3k else self._children.itervalues
if diff:
for child, (start, stop, step) in values():
if start > key.stop:
self._children[id(child)][1][0] += diff
if stop >= key.stop and stop != maxsize:
self._children[id(child)][1][1] += diff
for child, (start, stop, step) in values():
if start > key.stop:
self._children[id(child)][1][0] += diff
if stop is not None and stop >= key.stop:
self._children[id(child)][1][1] += diff

def __delitem__(self, key):
super(SmartList, self).__delitem__(key)
if isinstance(key, slice):
key = self._normalize_slice(key)
key = self._normalize_slice(key, clamp=True)
else:
key = slice(key, key + 1, 1)
diff = (key.stop - key.start) // key.step
@@ -131,7 +128,7 @@ class SmartList(_SliceNormalizerMixIn, list):
for child, (start, stop, step) in values():
if start > key.start:
self._children[id(child)][1][0] -= diff
if stop >= key.stop and stop != maxsize:
if stop is not None and stop >= key.stop:
self._children[id(child)][1][1] -= diff

if not py3k:
@@ -154,10 +151,16 @@ class SmartList(_SliceNormalizerMixIn, list):
self.extend(other)
return self

def _delete_child(self, child_ref):
"""Remove a child reference that is about to be garbage-collected."""
del self._children[id(child_ref)]

def _detach_children(self):
"""Remove all children and give them independent parent copies."""
children = [val[0] for val in self._children.values()]
for child in children:
child.detach()
child()._parent = list(self)
self._children.clear()

@inheritdoc
def append(self, item):
@@ -226,7 +229,6 @@ class _ListProxy(_SliceNormalizerMixIn, list):
super(_ListProxy, self).__init__()
self._parent = parent
self._sliceinfo = sliceinfo
self._detached = False

def __repr__(self):
return repr(self._render())
@@ -273,24 +275,20 @@ class _ListProxy(_SliceNormalizerMixIn, list):

def __getitem__(self, key):
if isinstance(key, slice):
key = self._normalize_slice(key)
if key.stop == maxsize:
keystop = self._stop
else:
keystop = key.stop + self._start
adjusted = slice(key.start + self._start, keystop, key.step)
key = self._normalize_slice(key, clamp=True)
keystart = min(self._start + key.start, self._stop)
keystop = min(self._start + key.stop, self._stop)
adjusted = slice(keystart, keystop, key.step)
return self._parent[adjusted]
else:
return self._render()[key]

def __setitem__(self, key, item):
if isinstance(key, slice):
key = self._normalize_slice(key)
if key.stop == maxsize:
keystop = self._stop
else:
keystop = key.stop + self._start
adjusted = slice(key.start + self._start, keystop, key.step)
key = self._normalize_slice(key, clamp=True)
keystart = min(self._start + key.start, self._stop)
keystop = min(self._start + key.stop, self._stop)
adjusted = slice(keystart, keystop, key.step)
self._parent[adjusted] = item
else:
length = len(self)
@@ -302,12 +300,10 @@ class _ListProxy(_SliceNormalizerMixIn, list):

def __delitem__(self, key):
if isinstance(key, slice):
key = self._normalize_slice(key)
if key.stop == maxsize:
keystop = self._stop
else:
keystop = key.stop + self._start
adjusted = slice(key.start + self._start, keystop, key.step)
key = self._normalize_slice(key, clamp=True)
keystart = min(self._start + key.start, self._stop)
keystop = min(self._start + key.stop, self._stop)
adjusted = slice(keystart, keystop, key.step)
del self._parent[adjusted]
else:
length = len(self)
@@ -370,7 +366,7 @@ class _ListProxy(_SliceNormalizerMixIn, list):
@property
def _stop(self):
"""The ending index of this list, exclusive."""
if self._sliceinfo[1] == maxsize:
if self._sliceinfo[1] is None:
return len(self._parent)
return self._sliceinfo[1]

@@ -456,17 +452,5 @@ class _ListProxy(_SliceNormalizerMixIn, list):
item.sort(**kwargs)
self._parent[self._start:self._stop:self._step] = item

def detach(self):
"""Detach the child so it operates like a normal list.

This allows children to be properly garbage-collected if their parent
is being kept around for a long time. This method has no effect if the
child is already detached.
"""
if not self._detached:
self._parent._children.pop(id(self))
self._parent = list(self._parent)
self._detached = True


del inheritdoc

+ 3
- 0
scripts/README View File

@@ -0,0 +1,3 @@
This directory contains support files used for *developing* mwparserfromhell,
not running it. If you are looking for code examples, read the documentation
or explore the source code.

+ 16
- 16
scripts/release.sh View File

@@ -31,10 +31,17 @@ update_version() {
echo " done."
}

update_appveyor() {
filename="appveyor.yml"
echo -n "Updating $filename..."
sed -e "s/version: .*/version: $VERSION-b{build}/" -i "" $filename
echo " done."
}

update_changelog() {
filename="CHANGELOG"
echo -n "Updating $filename..."
sed -e '1s/.*/v'$VERSION' (released '$RELEASE_DATE'):/' -i "" $filename
sed -e "1s/.*/v$VERSION (released $RELEASE_DATE):/" -i "" $filename
echo " done."
}

@@ -45,10 +52,10 @@ update_docs_changelog() {
previous_lineno=$(expr $(grep -n -e "^---" $filename | sed '2q;d' | cut -d ':' -f 1) - 1)
previous_version=$(sed $previous_lineno'q;d' $filename)
sed \
-e '4s/.*/v'$VERSION \
-e '5s/.*/'$dashes \
-e '7s/.*/`Released '$RELEASE_DATE' <https:\/\/github.com\/earwig\/mwparserfromhell\/tree\/v'$VERSION'>`_/' \
-e '8s/.*/(`changes <https:\/\/github.com\/earwig\/mwparserfromhell\/compare\/v'$previous_version'...v'$VERSION'>`__):/' \
-e "4s/.*/v$VERSION/" \
-e "5s/.*/$dashes/" \
-e "7s/.*/\`Released $RELEASE_DATE <https:\/\/github.com\/earwig\/mwparserfromhell\/tree\/v$VERSION>\`_/" \
-e "8s/.*/(\`changes <https:\/\/github.com\/earwig\/mwparserfromhell\/compare\/$previous_version...v$VERSION>\`__):/" \
-i "" $filename
echo " done."
}
@@ -67,25 +74,18 @@ do_git_stuff() {
}

upload_to_pypi() {
# TODO: check whether these commands give output
echo -n "PyPI: uploading source tarball and docs..."
python setup.py register sdist upload -s
python setup.py upload_docs
python setup.py -q register sdist upload -s
python setup.py -q upload_docs
echo " done."
}

windows_build() {
echo "PyPI: building/uploading Windows binaries..."
echo "*** Run in Windows: ./scripts/win_build.py"
echo "*** Press enter when done."
read
}

post_release() {
echo
echo "*** Release completed."
echo "*** Update: https://github.com/earwig/mwparserfromhell/releases/tag/v$VERSION"
echo "*** Verify: https://pypi.python.org/pypi/mwparserfromhell"
echo "*** Verify: https://ci.appveyor.com/project/earwig/mwparserfromhell"
echo "*** Verify: https://mwparserfromhell.readthedocs.org"
echo "*** Press enter to sanity-check the release."
read
@@ -153,11 +153,11 @@ cd "$SCRIPT_DIR/.."

check_git
update_version
update_appveyor
update_changelog
update_docs_changelog
do_git_stuff
upload_to_pypi
windows_build
post_release
test_release



+ 0
- 58
scripts/win_build.py View File

@@ -1,58 +0,0 @@
# Build requirements:
#
# Python 2.6-3.2: Visual C++ Express Edition 2008:
# http://go.microsoft.com/?linkid=7729279
#
# Python 3.3+: Visual C++ Express Edition 2010:
# http://go.microsoft.com/?linkid=9709949
#
# x64 builds: Microsoft Windows SDK for Windows 7 and .NET Framework 3.5 SP1:
# http://www.microsoft.com/en-us/download/details.aspx?id=3138
#
# Python interpreter, 2.6, 2.7, 3.2-3.4:
# https://www.python.org/downloads/
#
# Pip, setuptools, wheel:
# https://bootstrap.pypa.io/get-pip.py
# and run *for each* Python version:
# c:\pythonXX\python get-pip.py
# c:\pythonXX\scripts\pip install wheel
#
# Afterwards, run this script with any of the python interpreters (2.7 suggested)

from __future__ import print_function
import os
from subprocess import call, STDOUT

ENVIRONMENTS = ["26", "27", "32", "33", "34"]

def run(pyver, cmds):
cmd = [r"C:\Python%s\Python.exe" % pyver, "setup.py"] + cmds
print(" ".join(cmd), end=" ")

with open("%s%s.log" % (cmds[0], pyver), "w") as logfile:
retval = call(cmd, stdout=logfile, stderr=STDOUT, cwd="..")
if not retval:
print("[OK]")
else:
print("[FAILED (%i)]" % retval)
return retval

def main():
path = os.path.split(__file__)[0]
if path:
os.chdir(path)

print("Building Windows wheels for Python %s:" % ", ".join(ENVIRONMENTS))
for pyver in ENVIRONMENTS:
print()
try:
os.unlink("mwparserfromhell/parser/_tokenizer.pyd")
except OSError:
pass

if run(pyver, ["test"]) == 0:
run(pyver, ["bdist_wheel", "upload"]) # TODO: add "-s" to GPG sign

if __name__ == "__main__":
main()

+ 43
- 0
scripts/win_wrapper.cmd View File

@@ -0,0 +1,43 @@
:: To build extensions for 64 bit Python 3, we need to configure environment
:: variables to use the MSVC 2010 C++ compilers from GRMSDKX_EN_DVD.iso of:
:: MS Windows SDK for Windows 7 and .NET Framework 4 (SDK v7.1)
::
:: To build extensions for 64 bit Python 2, we need to configure environment
:: variables to use the MSVC 2008 C++ compilers from GRMSDKX_EN_DVD.iso of:
:: MS Windows SDK for Windows 7 and .NET Framework 3.5 (SDK v7.0)
::
:: 32 bit builds do not require specific environment configurations.
::
:: Note: this script needs to be run with the /E:ON and /V:ON flags for the
:: cmd interpreter, at least for (SDK v7.0)
::
:: More details at:
:: https://github.com/cython/cython/wiki/64BitCythonExtensionsOnWindows
:: http://stackoverflow.com/a/13751649/163740
::
:: Author: Olivier Grisel
:: License: CC0 1.0 Universal: http://creativecommons.org/publicdomain/zero/1.0/
@ECHO OFF

SET COMMAND_TO_RUN=%*
SET WIN_SDK_ROOT=C:\Program Files\Microsoft SDKs\Windows

SET MAJOR_PYTHON_VERSION="%PYTHON_VERSION:~0,1%"
IF %MAJOR_PYTHON_VERSION% == "2" (
SET WINDOWS_SDK_VERSION="v7.0"
) ELSE IF %MAJOR_PYTHON_VERSION% == "3" (
SET WINDOWS_SDK_VERSION="v7.1"
) ELSE (
ECHO Unsupported Python version: "%MAJOR_PYTHON_VERSION%"
EXIT 1
)

IF "%PYTHON_ARCH%"=="64" (
SET DISTUTILS_USE_SDK=1
SET MSSdk=1
"%WIN_SDK_ROOT%\%WINDOWS_SDK_VERSION%\Setup\WindowsSdkVer.exe" -q -version:%WINDOWS_SDK_VERSION%
"%WIN_SDK_ROOT%\%WINDOWS_SDK_VERSION%\Bin\SetEnv.cmd" /x64 /release
call %COMMAND_TO_RUN% || EXIT 1
) ELSE (
call %COMMAND_TO_RUN% || EXIT 1
)

+ 40
- 60
setup.py View File

@@ -21,88 +21,67 @@
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.

import os
from __future__ import print_function
from distutils.errors import DistutilsError, CCompilerError
from glob import glob
from os import environ
import sys

if (sys.version_info[0] == 2 and sys.version_info[1] < 6) or \
(sys.version_info[1] == 3 and sys.version_info[1] < 2):
raise Exception("mwparserfromhell needs Python 2.6+ or 3.2+")

if sys.version_info >= (3, 0):
basestring = (str, )
if ((sys.version_info[0] == 2 and sys.version_info[1] < 6) or
(sys.version_info[1] == 3 and sys.version_info[1] < 2)):
raise RuntimeError("mwparserfromhell needs Python 2.6+ or 3.2+")

from setuptools import setup, find_packages, Extension
from setuptools.command.build_ext import build_ext

from mwparserfromhell import __version__
from mwparserfromhell.compat import py26, py3k

with open("README.rst", **{'encoding':'utf-8'} if py3k else {}) as fp:
with open("README.rst", **({'encoding':'utf-8'} if py3k else {})) as fp:
long_docs = fp.read()

tokenizer = Extension("mwparserfromhell.parser._tokenizer",
sources=["mwparserfromhell/parser/tokenizer.c"],
depends=["mwparserfromhell/parser/tokenizer.h"])
use_extension = True
fallback = True

use_extension=True
# Allow env var WITHOUT_EXTENSION and args --with[out]-extension:

# Allow env var WITHOUT_EXTENSION and args --with[out]-extension
if '--without-extension' in sys.argv:
use_extension = False
elif '--with-extension' in sys.argv:
pass
elif os.environ.get('WITHOUT_EXTENSION', '0') == '1':
env_var = environ.get("WITHOUT_EXTENSION")
if "--without-extension" in sys.argv:
use_extension = False
elif "--with-extension" in sys.argv:
fallback = False
elif env_var is not None:
if env_var == "1":
use_extension = False
elif env_var == "0":
fallback = False

# Remove the command line argument as it isnt understood by
# setuptools/distutils
sys.argv = [arg for arg in sys.argv
if not arg.startswith('--with')
and not arg.endswith('-extension')]

# Remove the command line argument as it isn't understood by setuptools:

def optional_compile_setup(func=setup, use_ext=use_extension,
*args, **kwargs):
"""
Wrap setup to allow optional compilation of extensions.

Falls back to pure python mode (no extensions)
if compilation of extensions fails.
"""
extensions = kwargs.get('ext_modules', None)

if use_ext and extensions:
try:
func(*args, **kwargs)
return
except SystemExit as e:
assert(e.args)
if e.args[0] is False:
raise
elif isinstance(e.args[0], basestring):
if e.args[0].startswith('usage: '):
raise
else:
# Fallback to pure python mode
print('setup with extension failed: %s' % repr(e))
pass
except Exception as e:
print('setup with extension failed: %s' % repr(e))
sys.argv = [arg for arg in sys.argv
if arg != "--without-extension" and arg != "--with-extension"]

if extensions:
if use_ext:
print('Falling back to pure python mode.')
else:
print('Using pure python mode.')
def build_ext_patched(self):
try:
build_ext_original(self)
except (DistutilsError, CCompilerError) as exc:
print("error: " + str(exc))
print("Falling back to pure Python mode.")
del self.extensions[:]

del kwargs['ext_modules']
if fallback:
build_ext.run, build_ext_original = build_ext_patched, build_ext.run

func(*args, **kwargs)
# Project-specific part begins here:

tokenizer = Extension("mwparserfromhell.parser._tokenizer",
sources=glob("mwparserfromhell/parser/ctokenizer/*.c"),
depends=glob("mwparserfromhell/parser/ctokenizer/*.h"))

optional_compile_setup(
setup(
name = "mwparserfromhell",
packages = find_packages(exclude=("tests",)),
ext_modules = [tokenizer],
ext_modules = [tokenizer] if use_extension else [],
tests_require = ["unittest2"] if py26 else [],
test_suite = "tests.discover",
version = __version__,
@@ -126,6 +105,7 @@ optional_compile_setup(
"Programming Language :: Python :: 3.2",
"Programming Language :: Python :: 3.3",
"Programming Language :: Python :: 3.4",
"Programming Language :: Python :: 3.5",
"Topic :: Text Processing :: Markup"
],
)

+ 41
- 26
tests/_test_tokenizer.py View File

@@ -42,8 +42,8 @@ class TokenizerTestCase(object):
directory.
"""

@classmethod
def _build_test_method(cls, funcname, data):
@staticmethod
def _build_test_method(funcname, data):
"""Create and return a method to be treated as a test case method.

*data* is a dict containing multiple keys: the *input* text to be
@@ -58,13 +58,35 @@ class TokenizerTestCase(object):
expected = data["output"]
actual = self.tokenizer().tokenize(data["input"])
self.assertEqual(expected, actual)

if not py3k:
inner.__name__ = funcname.encode("utf8")
inner.__doc__ = data["label"]
return inner

@staticmethod
def _parse_test(test, data):
"""Parse an individual *test*, storing its info in *data*."""
for line in test.strip().splitlines():
if line.startswith("name:"):
data["name"] = line[len("name:"):].strip()
elif line.startswith("label:"):
data["label"] = line[len("label:"):].strip()
elif line.startswith("input:"):
raw = line[len("input:"):].strip()
if raw[0] == '"' and raw[-1] == '"':
raw = raw[1:-1]
raw = raw.encode("raw_unicode_escape")
data["input"] = raw.decode("unicode_escape")
elif line.startswith("output:"):
raw = line[len("output:"):].strip()
try:
data["output"] = eval(raw, vars(tokens))
except Exception as err:
raise _TestParseError(err)

@classmethod
def _load_tests(cls, filename, name, text):
def _load_tests(cls, filename, name, text, restrict=None):
"""Load all tests in *text* from the file *filename*."""
tests = text.split("\n---\n")
counter = 1
@@ -72,23 +94,7 @@ class TokenizerTestCase(object):
for test in tests:
data = {"name": None, "label": None, "input": None, "output": None}
try:
for line in test.strip().splitlines():
if line.startswith("name:"):
data["name"] = line[len("name:"):].strip()
elif line.startswith("label:"):
data["label"] = line[len("label:"):].strip()
elif line.startswith("input:"):
raw = line[len("input:"):].strip()
if raw[0] == '"' and raw[-1] == '"':
raw = raw[1:-1]
raw = raw.encode("raw_unicode_escape")
data["input"] = raw.decode("unicode_escape")
elif line.startswith("output:"):
raw = line[len("output:"):].strip()
try:
data["output"] = eval(raw, vars(tokens))
except Exception as err:
raise _TestParseError(err)
cls._parse_test(test, data)
except _TestParseError as err:
if data["name"]:
error = "Could not parse test '{0}' in '{1}':\n\t{2}"
@@ -97,6 +103,7 @@ class TokenizerTestCase(object):
error = "Could not parse a test in '{0}':\n\t{1}"
print(error.format(filename, err))
continue

if not data["name"]:
error = "A test in '{0}' was ignored because it lacked a name"
print(error.format(filename))
@@ -105,27 +112,35 @@ class TokenizerTestCase(object):
error = "Test '{0}' in '{1}' was ignored because it lacked an input or an output"
print(error.format(data["name"], filename))
continue

number = str(counter).zfill(digits)
counter += 1
if restrict and data["name"] != restrict:
continue

fname = "test_{0}{1}_{2}".format(name, number, data["name"])
meth = cls._build_test_method(fname, data)
setattr(cls, fname, meth)
counter += 1

@classmethod
def build(cls):
"""Load and install all tests from the 'tokenizer' directory."""
def load_file(filename):
def load_file(filename, restrict=None):
with codecs.open(filename, "rU", encoding="utf8") as fp:
text = fp.read()
name = path.split(filename)[1][:0-len(extension)]
cls._load_tests(filename, name, text)
name = path.split(filename)[1][:-len(extension)]
cls._load_tests(filename, name, text, restrict)

directory = path.join(path.dirname(__file__), "tokenizer")
extension = ".mwtest"
if len(sys.argv) > 2 and sys.argv[1] == "--use":
for name in sys.argv[2:]:
load_file(path.join(directory, name + extension))
sys.argv = [sys.argv[0]] # So unittest doesn't try to load these
if "." in name:
name, test = name.split(".", 1)
else:
test = None
load_file(path.join(directory, name + extension), test)
sys.argv = [sys.argv[0]] # So unittest doesn't try to parse this
cls.skip_others = True
else:
for filename in listdir(directory):


+ 2
- 2
tests/test_docs.py View File

@@ -115,8 +115,8 @@ class TestDocs(unittest.TestCase):
@unittest.skipIf("NOWEB" in os.environ, "web test disabled by environ var")
def test_readme_5(self):
"""test a block of example code in the README; includes a web call"""
url1 = "http://en.wikipedia.org/w/api.php"
url2 = "http://en.wikipedia.org/w/index.php?title={0}&action=raw"
url1 = "https://en.wikipedia.org/w/api.php"
url2 = "https://en.wikipedia.org/w/index.php?title={0}&action=raw"
title = "Test"
data = {"action": "query", "prop": "revisions", "rvlimit": 1,
"rvprop": "content", "format": "json", "titles": title}


+ 20
- 12
tests/test_smart_list.py View File

@@ -52,6 +52,7 @@ class TestSmartList(unittest.TestCase):
self.assertEqual([0, 1, 2], list1[:3])
self.assertEqual([0, 1, 2, 3, "one", "two"], list1[:])
self.assertEqual([3, "one", "two"], list1[3:])
self.assertEqual([3, "one", "two"], list1[3:100])
self.assertEqual(["one", "two"], list1[-2:])
self.assertEqual([0, 1], list1[:-4])
self.assertEqual([], list1[6:])
@@ -389,28 +390,35 @@ class TestSmartList(unittest.TestCase):
self.assertEqual([4, 3, 2, 1.9, 1.8, 5, 6, 7, 8, 8.1, 8.2], child1)
self.assertEqual([4, 3, 2, 1.9, 1.8], child2)

child1.detach()
self.assertEqual([1, 4, 3, 2, 1.9, 1.8, 5, 6, 7, 8, 8.1, 8.2], parent)
self.assertEqual([4, 3, 2, 1.9, 1.8, 5, 6, 7, 8, 8.1, 8.2], child1)
child3 = parent[9:]
self.assertEqual([8, 8.1, 8.2], child3)

del parent[8:]
self.assertEqual([1, 4, 3, 2, 1.9, 1.8, 5, 6], parent)
self.assertEqual([4, 3, 2, 1.9, 1.8, 5, 6], child1)
self.assertEqual([4, 3, 2, 1.9, 1.8], child2)
self.assertEqual([], child3)

del child1
self.assertEqual([1, 4, 3, 2, 1.9, 1.8, 5, 6], parent)
self.assertEqual([4, 3, 2, 1.9, 1.8], child2)
self.assertEqual([], child3)
self.assertEqual(2, len(parent._children))

del child3
self.assertEqual([1, 4, 3, 2, 1.9, 1.8, 5, 6], parent)
self.assertEqual([4, 3, 2, 1.9, 1.8], child2)
self.assertEqual(1, len(parent._children))

parent.remove(1.9)
parent.remove(1.8)
self.assertEqual([1, 4, 3, 2, 5, 6, 7, 8, 8.1, 8.2], parent)
self.assertEqual([4, 3, 2, 1.9, 1.8, 5, 6, 7, 8, 8.1, 8.2], child1)
self.assertEqual([1, 4, 3, 2, 5, 6], parent)
self.assertEqual([4, 3, 2], child2)

parent.reverse()
self.assertEqual([8.2, 8.1, 8, 7, 6, 5, 2, 3, 4, 1], parent)
self.assertEqual([4, 3, 2, 1.9, 1.8, 5, 6, 7, 8, 8.1, 8.2], child1)
self.assertEqual([6, 5, 2, 3, 4, 1], parent)
self.assertEqual([4, 3, 2], child2)
self.assertEqual(0, len(parent._children))

child2.detach()
self.assertEqual([8.2, 8.1, 8, 7, 6, 5, 2, 3, 4, 1], parent)
self.assertEqual([4, 3, 2, 1.9, 1.8, 5, 6, 7, 8, 8.1, 8.2], child1)
self.assertEqual([4, 3, 2], child2)

if __name__ == "__main__":
unittest.main(verbosity=2)

+ 12
- 3
tests/test_template.py View File

@@ -213,6 +213,9 @@ class TestTemplate(TreeEqualityTestCase):
pgens("f", "g")])
node37 = Template(wraptext("a"), [pgenh("1", "")])
node38 = Template(wraptext("abc"))
node39 = Template(wraptext("a"), [pgenh("1", " b ")])
node40 = Template(wraptext("a"), [pgenh("1", " b"), pgenh("2", " c")])
node41 = Template(wraptext("a"), [pgens("1", " b"), pgens("2", " c")])

node1.add("e", "f", showkey=True)
node2.add(2, "g", showkey=False)
@@ -255,6 +258,9 @@ class TestTemplate(TreeEqualityTestCase):
node37.add(1, "b")
node38.add("1", "foo")
self.assertRaises(ValueError, node38.add, "z", "bar", showkey=False)
node39.add("1", "c")
node40.add("3", "d")
node41.add("3", "d")

self.assertEqual("{{a|b=c|d|e=f}}", node1)
self.assertEqual("{{a|b=c|d|g}}", node2)
@@ -299,6 +305,9 @@ class TestTemplate(TreeEqualityTestCase):
self.assertEqual("{{a|b=c|d=h|f=g}}", node36)
self.assertEqual("{{a|b}}", node37)
self.assertEqual("{{abc|foo}}", node38)
self.assertEqual("{{a|c}}", node39)
self.assertEqual("{{a| b| c|d}}", node40)
self.assertEqual("{{a|1= b|2= c|3= d}}", node41)

def test_remove(self):
"""test Template.remove()"""
@@ -395,13 +404,13 @@ class TestTemplate(TreeEqualityTestCase):
self.assertRaises(ValueError, node2.remove, "1")
self.assertEqual("{{foo}}", node2)
self.assertEqual("{{foo||abc=}}", node3)
self.assertEqual("{{foo||baz}}", node4)
self.assertEqual("{{foo|2=baz}}", node4)
self.assertEqual("{{foo|b=c}}", node5)
self.assertEqual("{{foo| a=|b=c}}", node6)
self.assertEqual("{{foo|1 =|2=c}}", node7)
self.assertEqual("{{foo|2=c}}", node8)
self.assertEqual("{{foo||c}}", node9)
self.assertEqual("{{foo||c}}", node10)
self.assertEqual("{{foo|2=c}}", node10)
self.assertEqual("{{foo|b=c|a =d}}", node11)
self.assertEqual("{{foo| a=|b=c|a =d}}", node12)
self.assertEqual("{{foo| a=b|a =d}}", node13)
@@ -410,7 +419,7 @@ class TestTemplate(TreeEqualityTestCase):
self.assertEqual("{{foo| a=b|b=c|a =}}", node16)
self.assertEqual("{{foo|b|c}}", node17)
self.assertEqual("{{foo|1 =|b|c}}", node18)
self.assertEqual("{{foo|1 =a||c}}", node19)
self.assertEqual("{{foo|1 =a|2=c}}", node19)
self.assertEqual("{{foo|1 =a||c}}", node20)
self.assertEqual("{{foo|c=d|e=f}}", node21)
self.assertEqual("{{foo|a=|c=d|e=f}}", node22)


+ 77
- 0
tests/tokenizer/integration.mwtest View File

@@ -241,3 +241,80 @@ name: newline_and_comment_in_template_name_7
label: a template name containing a newline followed by a comment
input: "{{foobar\n<!|key=value}}"
output: [Text(text="{{foobar\n<!|key=value}}")]

---

name: newline_and_comment_in_template_name_8
label: a template name containing a newline followed by a comment
input: "{{<!-- comment -->\nfoobar\n<!-- comment -->}}"
output: [TemplateOpen(), CommentStart(), Text(text=" comment "), CommentEnd(), Text(text="\nfoobar\n"), CommentStart(), Text(text=" comment "), CommentEnd(), TemplateClose()]

---

name: tag_in_link_title
label: HTML tags are invalid in link titles, even when complete
input: "[[foo<i>bar</i>baz]]"
output: [Text(text="[[foo"), TagOpenOpen(), Text(text="i"), TagCloseOpen(padding=""), Text(text="bar"), TagOpenClose(), Text(text="i"), TagCloseClose(), Text(text="baz]]")]

---

name: tag_in_template_name
label: HTML tags are invalid in template names, even when complete
input: "{{foo<i>bar</i>baz}}"
output: [Text(text="{{foo"), TagOpenOpen(), Text(text="i"), TagCloseOpen(padding=""), Text(text="bar"), TagOpenClose(), Text(text="i"), TagCloseClose(), Text(text="baz}}")]

---

name: tag_in_link_text
label: HTML tags are valid in link text
input: "[[foo|<i>bar</i>baz]]"
output: [WikilinkOpen(), Text(text="foo"), WikilinkSeparator(), TagOpenOpen(), Text(text="i"), TagCloseOpen(padding=""), Text(text="bar"), TagOpenClose(), Text(text="i"), TagCloseClose(), Text(text="baz"), WikilinkClose()]

---

name: comment_in_link_title
label: comments are valid in link titles
input: "[[foo<!--bar-->baz]]"
output: [WikilinkOpen(), Text(text="foo"), CommentStart(), Text(text="bar"), CommentEnd(), Text(text="baz"), WikilinkClose()]

---

name: incomplete_comment_in_link_title
label: incomplete comments are invalid in link titles
input: "[[foo<!--bar--baz]]"
output: [Text(text="[[foo<!--bar--baz]]")]

---

name: incomplete_comment_in_link_title_2
label: incomplete comments are invalid in link titles
input: "[[foo<!--barbaz]]"
output: [Text(text="[[foo<!--barbaz]]")]

---

name: incomplete_comment_in_link_title_3
label: incomplete comments are invalid in link titles
input: "[[foo<!barbaz]]"
output: [Text(text="[[foo<!barbaz]]")]

---

name: incomplete_comment_in_link_title_4
label: incomplete comments are invalid in link titles
input: "[[foo<!--]]"
output: [Text(text="[[foo<!--]]")]

---

name: incomplete_comment_in_link_title_5
label: incomplete comments are invalid in link titles
input: "[[foo<!-"
output: [Text(text="[[foo<!-")]

---

name: incomplete_comment_in_link_title_6
label: incomplete comments are invalid in link titles
input: "[[foo<!--bar"
output: [Text(text="[[foo<!--bar")]

+ 51
- 14
tests/tokenizer/templates.mwtest View File

@@ -1,17 +1,3 @@
name: blank
label: template with no content
input: "{{}}"
output: [TemplateOpen(), TemplateClose()]


name: blank_with_params
label: template with no content, but pipes and equal signs
input: "{{||=|}}"
output: [TemplateOpen(), TemplateParamSeparator(), TemplateParamSeparator(), TemplateParamEquals(), TemplateParamSeparator(), TemplateClose()]


name: no_params
label: simplest type of template
input: "{{template}}"
@@ -61,6 +47,13 @@ output: [TemplateOpen(), Text(text="foo"), TemplateParamSeparator(), Text(text="

---

name: blank_params
label: template with blank parameters (mix of pipes and equal signs)
input: "{{,||=|}}"
output: [TemplateOpen(), Text(text=","), TemplateParamSeparator(), TemplateParamSeparator(), TemplateParamEquals(), TemplateParamSeparator(), TemplateClose()]

---

name: nested_unnamed_param
label: nested template as an unnamed parameter
input: "{{foo|{{bar}}}}"
@@ -390,6 +383,34 @@ output: [TemplateOpen(), Text(text="foo\n "), TemplateParamSeparator(), Text(te

---

name: invalid_blank
label: invalid template with no content
input: "{{}}"
output: [Text(text="{{}}")]

---

name: invalid_blank_whitespace
label: invalid template with no content, but whitespace
input: "{{ }}"
output: [Text(text="{{ }}")]

---

name: invalid_blank_pipe
label: invalid template with no content, but a parameter
input: "{{|foo}}"
output: [Text(text="{{|foo}}")]

---

name: invalid_blank_whitespace_pipe
label: invalid template with no content, but whitespace and a parameter
input: "{{ |foo}}"
output: [Text(text="{{ |foo}}")]

---

name: invalid_name_left_brace_middle
label: invalid characters in template name: left brace in middle
input: "{{foo{bar}}"
@@ -509,6 +530,20 @@ output: [TemplateOpen(), Text(text="foo"), TemplateParamSeparator(), Text(text="

---

name: invalid_left_angle_bracket
label: invalid template: left angle bracket in name
input: "{{foo<bar}}"
output: [Text(text="{{foo<bar}}")]

---

name: invalid_right_angle_bracket
label: invalid template: right angle bracket in name
input: "{{foo>bar}}"
output: [Text(text="{{foo>bar}}")]

---

name: incomplete_stub
label: incomplete templates that should fail gracefully: just an opening
input: "{{"
@@ -651,5 +686,5 @@ output: [Text(text="{{ {{ {{ {{ {{ {{ {{ {{ {{ {{ {{ {{ {{ {{ {{ {{ {{ {{ {{ {{

name: recursion_opens_and_closes
label: test potentially dangerous recursion: template openings and closings
input: "{{|{{}}{{|{{}}{{|{{}}{{|{{}}{{|{{}}{{|{{}}{{|{{}}{{|{{}}{{|{{}}{{|{{}}{{|{{}}{{|{{}}{{|{{}}{{|{{}}"
output: [Text(text="{{|"), TemplateOpen(), TemplateClose(), Text(text="{{|"), TemplateOpen(), TemplateClose(), TemplateOpen(), TemplateParamSeparator(), TemplateOpen(), TemplateClose(), Text(text="{{"), TemplateParamSeparator(), Text(text="{{"), TemplateClose(), Text(text="{{|{{}}{{|{{}}{{|{{}}{{|{{}}{{|{{}}{{|{{}}{{|{{}}{{|{{}}{{|{{}}{{|{{}}")]
input: "{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}"
output: [Text(text="{{x|"), TemplateOpen(), Text(text="x"), TemplateClose(), Text(text="{{x|"), TemplateOpen(), Text(text="x"), TemplateClose(), TemplateOpen(), Text(text="x"), TemplateParamSeparator(), TemplateOpen(), Text(text="x"), TemplateClose(), Text(text="{{x"), TemplateParamSeparator(), Text(text="{{x"), TemplateClose(), Text(text="{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}{{x|{{x}}")]

+ 1
- 1
tests/tokenizer/text.mwtest View File

@@ -27,6 +27,6 @@ output: [Text(text="𐌲𐌿𐍄𐌰𐍂𐌰𐌶𐌳𐌰")]
---

name: large
label: a lot of text, requiring multiple textbuffer blocks in the C tokenizer
label: a lot of text, requiring proper storage in the C tokenizer
input: "ZWfsZYcZyhGbkDYJiguJuuhsNyHGFkFhnjkbLJyXIygTHqcXdhsDkEOTSIKYlBiohLIkiXxvyebUyCGvvBcYqFdtcftGmaAanKXEIyYSEKlTfEEbdGhdePVwVImOyKiHSzAEuGyEVRIKPZaNjQsYqpqARIQfvAklFtQyTJVGlLwjJIxYkiqmHBmdOvTyNqJRbMvouoqXRyOhYDwowtkcZGSOcyzVxibQdnzhDYbrgbatUrlOMRvFSzmLWHRihtXnddwYadPgFWUOxAzAgddJVDXHerawdkrRuWaEXfuwQSkQUmLEJUmrgXDVlXCpciaisfuOUjBldElygamkkXbewzLucKRnAEBimIIotXeslRRhnqQjrypnLQvvdCsKFWPVTZaHvzJMFEahDHWcCbyXgxFvknWjhVfiLSDuFhGoFxqSvhjnnRZLmCMhmWeOgSoanDEInKTWHnbpKyUlabLppITDFFxyWKAnUYJQIcmYnrvMmzmtYvsbCYbebgAhMFVVFAKUSvlkLFYluDpbpBaNFWyfXTaOdSBrfiHDTWGBTUCXMqVvRCIMrEjWpQaGsABkioGnveQWqBTDdRQlxQiUipwfyqAocMddXqdvTHhEwjEzMkOSWVPjJvDtClhYwpvRztPmRKCSpGIpXQqrYtTLmShFdpKtOxGtGOZYIdyUGPjdmyvhJTQMtgYJWUUZnecRjBfQXsyWQWikyONySLzLEqRFqcJYdRNFcGwWZtfZasfFWcvdsHRXoqKlKYihRAOJdrPBDdxksXFwKceQVncmFXfUfBsNgjKzoObVExSnRnjegeEhqxXzPmFcuiasViAFeaXrAxXhSfSyCILkKYpjxNeKynUmdcGAbwRwRnlAFbOSCafmzXddiNpLCFTHBELvArdXFpKUGpSHRekhrMedMRNkQzmSyFKjVwiWwCvbNWjgxJRzYeRxHiCCRMXktmKBxbxGZvOpvZIJOwvGIxcBLzsMFlDqAMLtScdsJtrbIUAvKfcdChXGnBzIxGxXMgxJhayrziaCswdpjJJJhkaYnGhHXqZwOzHFdhhUIEtfjERdLaSPRTDDMHpQtonNaIgXUYhjdbnnKppfMBxgNSOOXJAPtFjfAKnrRDrumZBpNhxMstqjTGBViRkDqbTdXYUirsedifGYzZpQkvdNhtFTOPgsYXYCwZHLcSLSfwfpQKtWfZuRUUryHJsbVsAOQcIJdSKKlOvCeEjUQNRPHKXuBJUjPuaAJJxcDMqyaufqfVwUmHLdjeYZzSiiGLHOTCInpVAalbXXTMLugLiwFiyPSuSFiyJUKVrWjbZAHaJtZnQmnvorRrxdPKThqXzNgTjszQiCoMczRnwGYJMERUWGXFyrSbAqsHmLwLlnJOJoXNsjVehQjVOpQOQJAZWwFZBlgyVIplzLTlFwumPgBLYrUIAJAcmvHPGfHfWQguCjfTYzxYfbohaLFAPwxFRrNuCdCzLlEbuhyYjCmuDBTJDMCdLpNRVqEALjnPSaBPsKWRCKNGwEMFpiEWbYZRwaMopjoUuBUvMpvyLfsPKDrfQLiFOQIWPtLIMoijUEUYfhykHrSKbTtrvjwIzHdWZDVwLIpNkloCqpzIsErxxKAFuFEjikWNYChqYqVslXMtoSWzNhbMuxYbzLfJIcPGoUeGPkGyPQNhDyrjgdKekzftFrRPTuyLYqCArkDcWHTrjPQHfoThBNnTQyMwLEWxEnBXLtzJmFVLGEPrdbEwlXpgYfnVnWoNXgPQKKyiXifpvrmJATzQOzYwFhliiYxlbnsEPKbHYUfJLrwYPfSUwTIHiEvBFMrEtVmqJobfcwsiiEudTIiAnrtuywgKLOiMYbEIOAOJdOXqroPjWnQQcTNxFvkIEIsuHLyhSqSphuSmlvknzydQEnebOreeZwOouXYKlObAkaWHhOdTFLoMCHOWrVKeXjcniaxtgCziKEqWOZUWHJQpcDJzYnnduDZrmxgjZroBRwoPBUTJMYipsgJwbTSlvMyXXdAmiEWGMiQxhGvHGPLOKeTxNaLnFVbWpiYIVyqN"
output: [Text(text="ZWfsZYcZyhGbkDYJiguJuuhsNyHGFkFhnjkbLJyXIygTHqcXdhsDkEOTSIKYlBiohLIkiXxvyebUyCGvvBcYqFdtcftGmaAanKXEIyYSEKlTfEEbdGhdePVwVImOyKiHSzAEuGyEVRIKPZaNjQsYqpqARIQfvAklFtQyTJVGlLwjJIxYkiqmHBmdOvTyNqJRbMvouoqXRyOhYDwowtkcZGSOcyzVxibQdnzhDYbrgbatUrlOMRvFSzmLWHRihtXnddwYadPgFWUOxAzAgddJVDXHerawdkrRuWaEXfuwQSkQUmLEJUmrgXDVlXCpciaisfuOUjBldElygamkkXbewzLucKRnAEBimIIotXeslRRhnqQjrypnLQvvdCsKFWPVTZaHvzJMFEahDHWcCbyXgxFvknWjhVfiLSDuFhGoFxqSvhjnnRZLmCMhmWeOgSoanDEInKTWHnbpKyUlabLppITDFFxyWKAnUYJQIcmYnrvMmzmtYvsbCYbebgAhMFVVFAKUSvlkLFYluDpbpBaNFWyfXTaOdSBrfiHDTWGBTUCXMqVvRCIMrEjWpQaGsABkioGnveQWqBTDdRQlxQiUipwfyqAocMddXqdvTHhEwjEzMkOSWVPjJvDtClhYwpvRztPmRKCSpGIpXQqrYtTLmShFdpKtOxGtGOZYIdyUGPjdmyvhJTQMtgYJWUUZnecRjBfQXsyWQWikyONySLzLEqRFqcJYdRNFcGwWZtfZasfFWcvdsHRXoqKlKYihRAOJdrPBDdxksXFwKceQVncmFXfUfBsNgjKzoObVExSnRnjegeEhqxXzPmFcuiasViAFeaXrAxXhSfSyCILkKYpjxNeKynUmdcGAbwRwRnlAFbOSCafmzXddiNpLCFTHBELvArdXFpKUGpSHRekhrMedMRNkQzmSyFKjVwiWwCvbNWjgxJRzYeRxHiCCRMXktmKBxbxGZvOpvZIJOwvGIxcBLzsMFlDqAMLtScdsJtrbIUAvKfcdChXGnBzIxGxXMgxJhayrziaCswdpjJJJhkaYnGhHXqZwOzHFdhhUIEtfjERdLaSPRTDDMHpQtonNaIgXUYhjdbnnKppfMBxgNSOOXJAPtFjfAKnrRDrumZBpNhxMstqjTGBViRkDqbTdXYUirsedifGYzZpQkvdNhtFTOPgsYXYCwZHLcSLSfwfpQKtWfZuRUUryHJsbVsAOQcIJdSKKlOvCeEjUQNRPHKXuBJUjPuaAJJxcDMqyaufqfVwUmHLdjeYZzSiiGLHOTCInpVAalbXXTMLugLiwFiyPSuSFiyJUKVrWjbZAHaJtZnQmnvorRrxdPKThqXzNgTjszQiCoMczRnwGYJMERUWGXFyrSbAqsHmLwLlnJOJoXNsjVehQjVOpQOQJAZWwFZBlgyVIplzLTlFwumPgBLYrUIAJAcmvHPGfHfWQguCjfTYzxYfbohaLFAPwxFRrNuCdCzLlEbuhyYjCmuDBTJDMCdLpNRVqEALjnPSaBPsKWRCKNGwEMFpiEWbYZRwaMopjoUuBUvMpvyLfsPKDrfQLiFOQIWPtLIMoijUEUYfhykHrSKbTtrvjwIzHdWZDVwLIpNkloCqpzIsErxxKAFuFEjikWNYChqYqVslXMtoSWzNhbMuxYbzLfJIcPGoUeGPkGyPQNhDyrjgdKekzftFrRPTuyLYqCArkDcWHTrjPQHfoThBNnTQyMwLEWxEnBXLtzJmFVLGEPrdbEwlXpgYfnVnWoNXgPQKKyiXifpvrmJATzQOzYwFhliiYxlbnsEPKbHYUfJLrwYPfSUwTIHiEvBFMrEtVmqJobfcwsiiEudTIiAnrtuywgKLOiMYbEIOAOJdOXqroPjWnQQcTNxFvkIEIsuHLyhSqSphuSmlvknzydQEnebOreeZwOouXYKlObAkaWHhOdTFLoMCHOWrVKeXjcniaxtgCziKEqWOZUWHJQpcDJzYnnduDZrmxgjZroBRwoPBUTJMYipsgJwbTSlvMyXXdAmiEWGMiQxhGvHGPLOKeTxNaLnFVbWpiYIVyqN")]

+ 28
- 0
tests/tokenizer/wikilinks.mwtest View File

@@ -124,6 +124,34 @@ output: [Text(text="[[foo|"), WikilinkOpen(), Text(text="bar"), WikilinkClose()]

---

name: invalid_left_angle_bracket
label: invalid wikilink: left angle bracket
input: "[[foo<bar]]"
output: [Text(text="[[foo<bar]]")]

---

name: invalid_right_angle_bracket
label: invalid wikilink: right angle bracket
input: "[[foo>bar]]"
output: [Text(text="[[foo>bar]]")]

---

name: invalid_newline_at_start
label: invalid wikilink: newline at start of title
input: "[[\nfoobar]]"
output: [Text(text="[[\nfoobar]]")]

---

name: invalid_newline_at_end
label: invalid wikilink: newline at end of title
input: "[[foobar\n]]"
output: [Text(text="[[foobar\n]]")]

---

name: incomplete_open_only
label: incomplete wikilinks: just an open
input: "[["


Loading…
Cancel
Save