A Python parser for MediaWiki wikicode https://mwparserfromhell.readthedocs.io/
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

README.rst 6.1 KiB

12 years ago
11 years ago
11 years ago
12 years ago
12 years ago
12 years ago
11 years ago
12 years ago
12 years ago
12 years ago
12 years ago
12 years ago
12 years ago
12 years ago
12 years ago
12 years ago
12 years ago
12 years ago
12 years ago
12 years ago
12 years ago
12 years ago
12 years ago
12 years ago
12 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160
  1. mwparserfromhell
  2. ================
  3. .. image:: https://travis-ci.org/earwig/mwparserfromhell.png?branch=develop
  4. :alt: Build Status
  5. :target: http://travis-ci.org/earwig/mwparserfromhell
  6. **mwparserfromhell** (the *MediaWiki Parser from Hell*) is a Python package
  7. that provides an easy-to-use and outrageously powerful parser for MediaWiki_
  8. wikicode. It supports Python 2 and Python 3.
  9. Developed by Earwig_ with contributions from `Σ`_, Legoktm_, and others.
  10. Full documentation is available on ReadTheDocs_. Development occurs on GitHub_.
  11. Installation
  12. ------------
  13. The easiest way to install the parser is through the `Python Package Index`_,
  14. so you can install the latest release with ``pip install mwparserfromhell``
  15. (`get pip`_). Alternatively, get the latest development version::
  16. git clone https://github.com/earwig/mwparserfromhell.git
  17. cd mwparserfromhell
  18. python setup.py install
  19. If you get ``error: Unable to find vcvarsall.bat`` while installing, this is
  20. because Windows can't find the compiler for C extensions. Consult this
  21. `StackOverflow question`_ for help. You can also set ``ext_modules`` in
  22. ``setup.py`` to an empty list to prevent the extension from building.
  23. You can run the comprehensive unit testing suite with
  24. ``python setup.py test -q``.
  25. Usage
  26. -----
  27. Normal usage is rather straightforward (where ``text`` is page text)::
  28. >>> import mwparserfromhell
  29. >>> wikicode = mwparserfromhell.parse(text)
  30. ``wikicode`` is a ``mwparserfromhell.Wikicode`` object, which acts like an
  31. ordinary ``unicode`` object (or ``str`` in Python 3) with some extra methods.
  32. For example::
  33. >>> text = "I has a template! {{foo|bar|baz|eggs=spam}} See it?"
  34. >>> wikicode = mwparserfromhell.parse(text)
  35. >>> print wikicode
  36. I has a template! {{foo|bar|baz|eggs=spam}} See it?
  37. >>> templates = wikicode.filter_templates()
  38. >>> print templates
  39. ['{{foo|bar|baz|eggs=spam}}']
  40. >>> template = templates[0]
  41. >>> print template.name
  42. foo
  43. >>> print template.params
  44. ['bar', 'baz', 'eggs=spam']
  45. >>> print template.get(1).value
  46. bar
  47. >>> print template.get("eggs").value
  48. spam
  49. Since nodes can contain other nodes, getting nested templates is trivial::
  50. >>> text = "{{foo|{{bar}}={{baz|{{spam}}}}}}"
  51. >>> mwparserfromhell.parse(text).filter_templates()
  52. ['{{foo|{{bar}}={{baz|{{spam}}}}}}', '{{bar}}', '{{baz|{{spam}}}}', '{{spam}}']
  53. You can also pass ``recursive=False`` to ``filter_templates()`` and explore
  54. templates manually. This is possible because nodes can contain additional
  55. ``Wikicode`` objects::
  56. >>> code = mwparserfromhell.parse("{{foo|this {{includes a|template}}}}")
  57. >>> print code.filter_templates(recursive=False)
  58. ['{{foo|this {{includes a|template}}}}']
  59. >>> foo = code.filter_templates(recursive=False)[0]
  60. >>> print foo.get(1).value
  61. this {{includes a|template}}
  62. >>> print foo.get(1).value.filter_templates()[0]
  63. {{includes a|template}}
  64. >>> print foo.get(1).value.filter_templates()[0].get(1).value
  65. template
  66. Templates can be easily modified to add, remove, or alter params. ``Wikicode``
  67. objects can be treated like lists, with ``append()``, ``insert()``,
  68. ``remove()``, ``replace()``, and more. They also have a ``matches()`` method
  69. for comparing page or template names, which takes care of capitalization and
  70. whitespace::
  71. >>> text = "{{cleanup}} '''Foo''' is a [[bar]]. {{uncategorized}}"
  72. >>> code = mwparserfromhell.parse(text)
  73. >>> for template in code.filter_templates():
  74. ... if template.name.matches("Cleanup") and not template.has("date"):
  75. ... template.add("date", "July 2012")
  76. ...
  77. >>> print code
  78. {{cleanup|date=July 2012}} '''Foo''' is a [[bar]]. {{uncategorized}}
  79. >>> code.replace("{{uncategorized}}", "{{bar-stub}}")
  80. >>> print code
  81. {{cleanup|date=July 2012}} '''Foo''' is a [[bar]]. {{bar-stub}}
  82. >>> print code.filter_templates()
  83. ['{{cleanup|date=July 2012}}', '{{bar-stub}}']
  84. You can then convert ``code`` back into a regular ``unicode`` object (for
  85. saving the page!) by calling ``unicode()`` on it::
  86. >>> text = unicode(code)
  87. >>> print text
  88. {{cleanup|date=July 2012}} '''Foo''' is a [[bar]]. {{bar-stub}}
  89. >>> text == code
  90. True
  91. Likewise, use ``str(code)`` in Python 3.
  92. Integration
  93. -----------
  94. ``mwparserfromhell`` is used by and originally developed for EarwigBot_;
  95. ``Page`` objects have a ``parse`` method that essentially calls
  96. ``mwparserfromhell.parse()`` on ``page.get()``.
  97. If you're using Pywikipedia_, your code might look like this::
  98. import mwparserfromhell
  99. import wikipedia as pywikibot
  100. def parse(title):
  101. site = pywikibot.getSite()
  102. page = pywikibot.Page(site, title)
  103. text = page.get()
  104. return mwparserfromhell.parse(text)
  105. If you're not using a library, you can parse any page using the following code
  106. (via the API_)::
  107. import json
  108. import urllib
  109. import mwparserfromhell
  110. API_URL = "http://en.wikipedia.org/w/api.php"
  111. def parse(title):
  112. data = {"action": "query", "prop": "revisions", "rvlimit": 1,
  113. "rvprop": "content", "format": "json", "titles": title}
  114. raw = urllib.urlopen(API_URL, urllib.urlencode(data)).read()
  115. res = json.loads(raw)
  116. text = res["query"]["pages"].values()[0]["revisions"][0]["*"]
  117. return mwparserfromhell.parse(text)
  118. .. _MediaWiki: http://mediawiki.org
  119. .. _ReadTheDocs: http://mwparserfromhell.readthedocs.org
  120. .. _Earwig: http://en.wikipedia.org/wiki/User:The_Earwig
  121. .. _Σ: http://en.wikipedia.org/wiki/User:%CE%A3
  122. .. _Legoktm: http://en.wikipedia.org/wiki/User:Legoktm
  123. .. _GitHub: https://github.com/earwig/mwparserfromhell
  124. .. _Python Package Index: http://pypi.python.org
  125. .. _StackOverflow question: http://stackoverflow.com/questions/2817869/error-unable-to-find-vcvarsall-bat
  126. .. _get pip: http://pypi.python.org/pypi/pip
  127. .. _EarwigBot: https://github.com/earwig/earwigbot
  128. .. _Pywikipedia: https://www.mediawiki.org/wiki/Manual:Pywikipediabot
  129. .. _API: http://mediawiki.org/wiki/API