Ben Kurtovic
4baab6f57c
Implement lazy importing of root-level modules and packages.
- Simplify all imports
- Update dependency version in setup.py
- Change waitTime default from three seconds to two
hace 12 años
Ben Kurtovic
8d8703358c
More fixes and tweaks; cleanup; etc.
hace 12 años
Ben Kurtovic
f993b847ab
Encode URLs as UTF-8 before opening them.
hace 12 años
Ben Kurtovic
570168ed0e
Institute a timeout so we don't try to open these suspicious URLs forever.
hace 12 años
Ben Kurtovic
439b855254
Fully implement logging; fix non-unicode log messages.
hace 12 años
Ben Kurtovic
a074da853b
More work on copyvios, including an exclusions database ( #5 )
* Added exclusions module with a fully implemented ExclusionsDB that can pull
from multiple sources for different sites.
* Moved CopyvioCheckResult to its own module, to be imported by __init__.
* Some other related changes.
hace 12 años
Ben Kurtovic
c260648bdb
Finish chunking algorithm, improve !link, other fixes.
hace 12 años
Ben Kurtovic
569c815d99
Implement NLTK for chunking article content ( #5 ).
hace 12 años
Ben Kurtovic
1af4217b63
Update copyright notices and some other improvements.
hace 12 años
Ben Kurtovic
d45e342bac
DOCUMENT EVERYTHING ( #5 )
Also implementing MWParserFromHell, plus some cleanup.
hace 12 años
Ben Kurtovic
d87c226417
__repr__ and __str__ for everything per #5 and #22 .
hace 12 años
Ben Kurtovic
7dbbe9683c
Update imports and exceptions.
hace 12 años
Ben Kurtovic
5ca1d91f3e
Use __all__ within e.w.copyvios and shorter imports
hace 12 años
Ben Kurtovic
86a8440730
Moving parsers to own file.
hace 12 años
Ben Kurtovic
d4e947b98b
earwigbot.wiki.copyvios.search module split
hace 12 años
Ben Kurtovic
e6a381f3f7
Restructuring copyvio stuff as its own package.
hace 12 años
Ben Kurtovic
9434a416a1
Moved search engine/credential info into config proper.
- In config.json, search config relocated from
tasks.afc_copyvios to wiki.
- Site.__init__() takes a `search_config' argument, which is
auto-supplied from its value in config.json by get_site().
- Page.copyvio_check() doesn't ask for search config
anymore, meaning doing checks from the command line
is less painful.
- Added a Page.copyvio_compare() function, which works
just like copyvio_check() but on a specified URL; this is
for cache retrieval on the web front-end.
hace 12 años
Ben Kurtovic
f382ceb38e
Pushing some smarter logic for MarkovChains
- Incomplete; need this for the TS rewrite
- Also starting work on docstrings for some methods
hace 12 años
Ben Kurtovic
755dff9714
Copyvios: auto-fail very small articles (< 20 chain links)
hace 12 años
Ben Kurtovic
6009c050f9
Minor integer division fix.
hace 12 años
Ben Kurtovic
df7868da3e
Updates to copyright violation stuff.
hace 12 años
Ben Kurtovic
ee2b1133bb
Algorithm for comparing article content against a suspected source using MarkovChains
hace 12 años
Ben Kurtovic
2da906109b
Copyright update for 2012.
hace 12 años
Ben Kurtovic
13100533b9
CopyrightMixin needs Page._site
hace 12 años
Ben Kurtovic
c48073515b
#wikipedia-en-afc -> #wikipedia-en-afc-feed
hace 12 años
Ben Kurtovic
24f7eabb77
Some more work on copyvio detection code
Also removed the hardcoded version in user-agent strings.
hace 12 años
Ben Kurtovic
56e6140284
More work on copyright violation detection code.
hace 12 años
Ben Kurtovic
0b6d5eac5e
Some code for copyvio detection, including querying Yahoo! BOSS correctly.
hace 12 años