This regression seems more severe than the bug the commit was
attempting to fix (incorrect parsing of nested wikilinks in normal
links), so that bug is reintroduced until localization-aware parsing
that allows us to detect file links is added.
This commit partially reverts fac60dee48.
* Proposed fix for https://github.com/earwig/mwparserfromhell/issues/197
* Port the fix for #197 to the C tokenizer
* Fix parsing of external links where the URL is terminated by some special character
- One existing test case has been found wrong -- current MediaWiki
version always terminates the URL when an opening bracket is
encountered.
- Other test cases added: double quote, two single quotes and angles
always terminate the URL (regardless if it is a free link or external
link inside brackets). One single quote does not terminate the URL.
* Fix case-insensitive parsing of URI schemes
Also removed the max cycles stop-gap, allowing much more complex pages
to be parsed quickly without losing nodes at the end
Also fixes#65, fixes#102, fixes#165, fixes#183
Also fixes#81 (Rafael Nadal parsing bug)
Also fixes#53, fixes#58, fixes#88, fixes#152 (duplicate issues)
Tests were not correctly testing the situations without a table close.
Fixed tests and then fixed tokenizers for failing tests. Also refactored
pytokenizer to more closely match the ctokenizer by only holding the
`_parse` methods in the try blocks and no other code.
Table tags no longer self-closing. Rows and cells now contain their
contents. Also refactored out an `emit_table_tag` method.
Note: this will require changes to the Tag node and possibly the builder,
those changes will be in the next commit.
Removed the `StopIteration()` exception for handling table style
and instead call `_handle_table_cell_end()` with a new parameter.
Also added some random tests for table openings.
Changed row recursion handling to make sure the tag is emitted even
when hitting recursion limits. Need to test table recursion to make
sure that works. Also fixed a bug in which tables were eating the
trailing token. Added several tests for rows and trailing tokens with
tables.
Tables and rows use newlines as padding, partly because these characters
are pretty important to the integrity of the table. They might need
to be in the preceding whitespace of inner tags instead as padding after,
not sure.
Started parsing table support and added the start of table support.
This is a big commit (ugh) and it should probably be split up into
multiple smaller ones if possible, but that seems unworkable as of
right now because of all the dependencies. Also breaks tests of
CTokenizer (double ugh) because I haven't started table support there.
May want to pick line by line on this commit later but I need to save
my work for now.