Ben Kurtovic
|
466d3a42f1
|
copyvios: Minor refactor for cleaner stack frames.
|
5 years ago |
Ben Kurtovic
|
b4b079ffd0
|
Update copyright year for 2016.
|
8 years ago |
Ben Kurtovic
|
4828cbad69
|
Catch possible ValueError when doing opener.open().
|
8 years ago |
Ben Kurtovic
|
f52fb06c19
|
Add a debug message when catching ParserExclusionError.
|
8 years ago |
Ben Kurtovic
|
91846ce4fb
|
Refactor out mirror hinting logic in source parsers.
|
8 years ago |
Ben Kurtovic
|
147b46f572
|
A couple more fixes and cleanup.
|
8 years ago |
Ben Kurtovic
|
03910b6cb5
|
Add mirror detection logic to parsers; fixes.
|
8 years ago |
Ben Kurtovic
|
81a090c923
|
Allow content parsers to signal that a source should be excluded.
|
8 years ago |
Ben Kurtovic
|
bb819c9306
|
Explicitly include excluded URLs in the result set; mark as excluded.
|
8 years ago |
Ben Kurtovic
|
4e8be871b7
|
Update copyright year for 2015.
|
9 years ago |
Ben Kurtovic
|
9ffc3f1bf5
|
Raise file crawl size limit for PDFs.
|
9 years ago |
Ben Kurtovic
|
3f2dd1094f
|
Catch HTTPException in opener.open.
|
9 years ago |
Ben Kurtovic
|
08d02917f2
|
Strange typo.
|
9 years ago |
Ben Kurtovic
|
c2a5946874
|
Fix generating -0.0 as a confidence value.
|
9 years ago |
Ben Kurtovic
|
106e58b164
|
Update confidence function comments.
|
9 years ago |
Ben Kurtovic
|
5194525a32
|
Note when sources might have been missed.
|
9 years ago |
Ben Kurtovic
|
065d9ea498
|
Fix; should always return a float.
|
9 years ago |
Ben Kurtovic
|
290f81abed
|
Prevent -0.0 from being a confidence value.
|
9 years ago |
Ben Kurtovic
|
932b93572a
|
Simplify function.
|
9 years ago |
Ben Kurtovic
|
30f72df470
|
Refactor parsers; fix empty document behavior.
|
9 years ago |
Ben Kurtovic
|
5349179088
|
Fix parsing of plain text documents (earwig/copyvios#3)
|
9 years ago |
Ben Kurtovic
|
f10908e34e
|
Handle struct.error from GzipFile.read() (Python bug?)
|
9 years ago |
Ben Kurtovic
|
303c39c8c7
|
Add an option to disable short-circuiting.
|
9 years ago |
Ben Kurtovic
|
f8f4669460
|
Remove unnecessary key attribute of sources.
|
9 years ago |
Ben Kurtovic
|
9fd145da5c
|
Add some docs; better sorting function.
|
9 years ago |
Ben Kurtovic
|
7afb484cea
|
Refactor a bunch of copyvio internals. Store all sources with a result object.
|
9 years ago |
Ben Kurtovic
|
54ddff049f
|
Make CopyvioSource public; tweaks.
|
9 years ago |
Ben Kurtovic
|
0438766ee4
|
Handle empty URLs better.
|
9 years ago |
Ben Kurtovic
|
2147207388
|
Remove unnecessary variable assign.
|
9 years ago |
Ben Kurtovic
|
f37621e5ec
|
Use a deque for a FIFO instead of the python list LIFO.
|
9 years ago |
Ben Kurtovic
|
8e439e1eea
|
source.join() now blocks when in the middle of processing.
|
9 years ago |
Ben Kurtovic
|
dbb1ae5483
|
Handle empty queues correctly. Remove some log messages.
|
9 years ago |
Ben Kurtovic
|
2fa8aeba5b
|
Fix a blocking issue.
|
9 years ago |
Ben Kurtovic
|
939d8be08f
|
Fix variable.
|
9 years ago |
Ben Kurtovic
|
3ed8837a3e
|
Fix stopping queues in local mode.
|
9 years ago |
Ben Kurtovic
|
de7576728f
|
Fix dequeueing logic a bit.
|
9 years ago |
Ben Kurtovic
|
b939262b11
|
Bugfix.
|
9 years ago |
Ben Kurtovic
|
32ef0fbf1f
|
Add a bunch of temporary debugging code.
|
9 years ago |
Ben Kurtovic
|
c7b3b7bc7f
|
CopyvioSource.workspace should be public.
|
9 years ago |
Ben Kurtovic
|
e73e626994
|
Some locks needed to be tightened.
|
9 years ago |
Ben Kurtovic
|
486c4692ed
|
Remove _workers attr of workspaces.
|
9 years ago |
Ben Kurtovic
|
7c0e98596c
|
Some bugfixes.
|
9 years ago |
Ben Kurtovic
|
361f7709f8
|
Starting work on global workers.
|
9 years ago |