Add:
bitshift/crawler/crawler.py
-Add more efficient method of querying GitHub's API for stargazer
counts, by batching 25 repositories per request.
-Add watcher counts for Bitbucket repositories, by querying the
Bitbucket API once per repository (inefficient, but the API in question
isn't sufficiently robust to accommodate a better approach, and Git
repositories surface so infrequently that there shouldn't be any query
limit problems).
Several of the closed issues were addressed partly in previous commits;
definitively close them with this, for the moment, final update to the crawler
package.
Ref:
bitshift/crawler/indexer.py
-move all `GitIndexer` specific functions (eg, `_decode`,
`_is_ascii()`)from the global scope to the class definition.
Add:
bitshift/
__init__.py
-add `_configure_logging()`, which sets up a more robust logging
infrastructure than was previously used: log files are rotated once
per hour, and have some additional formatting rules.
(crawler, indexer).py
-add hierarchically-descending loggers to individual threaded
classes (`GitHubCrawler`, `GitIndexer`, etc.); add logging calls.
indexer.py
-remove file filtering regex matches from `_get_tracked_files()`,
as non-code files will be discarded by the parsers.
Add:
bitshift/crawler/crawler.py
-add `_get_repo_stars()` to `GitHubCrawler`, which queries the GitHub
API for the number of a stars that a given repository has.
-log the `next_api_url` every time it's generated by `GitHubCrawler` and
`BitbucketCrawler` to two respective log-files.
Add:
bitshift/crawler/
(crawler, indexer).py
-comment out all logging statements, as they may be causing a
memory leak (the crawler is meant to run perpetually, meaning that,
depending on how the `logging` module is implemented, it may be
accumulating logged strings in memory.)
bitshift/crawler/indexer.py
-make `_index_repository()` and `_index_repository_codelets()`
functions of the `GitIndexer` class.
-replace `_get_tracked_files()` subprocess call, which found the
files in a Git repository and removed any that were non-ASCII, with
a pure Python solution.
-add `_is_ascii()`.
Add:
bitshift/crawler/
__init__.py
-Initialize 'BitbucketCrawler()' singleton.
-Instantiate all thread instances on-the-fly in a 'threads' array, as
opposed to individual named variables.
crawler.py
-Add 'BitbucketCrawler()', to crawl Bitbucket for repositories.
-Not entirely tested for proper functionality.
-The Bitbucket framework is not yet accounted for in
'indexer._generate_file_url()'.
Add:
bitshift/crawler/
__init__.py
-Initialize 'BitbucketCrawler()' singleton.
-Instantiate all thread instances on-the-fly in a 'threads' array, as
opposed to individual named variables.
crawler.py
-Add 'BitbucketCrawler()', to crawl Bitbucket for repositories.
-Not entirely tested for proper functionality.
-The Bitbucket framework is not yet accounted for in
'indexer._generate_file_url()'.
Add:
bitshift/crawler/indexer.py
-Add 'pkill git' to the 'git clone' subprocess in '_clone_repository()',
to kill hanging remotes -- it's un-Pythonic, but, thus far, the only
method that's proved successful. The RAM problem still persists; the
latest dry-run lasted 01:11:00 before terminating due to a lack of
allocatable memory.
-Add exception names to `logging` messages.
bitshift/assets
-Update 'tag()' docstring to current 'bitshift' standards (add a ':type'
and ':rtype:' field).
Add:
bitshift/crawler/
(crawler, indexer).py
-add a 'time.sleep()' call whenever a thread is blocking on items
in a Queue, to prevent excessive polling (which hogs system
resources).
indexer.py
-move 'git clone' functionality from the 'GitIndexer' singleton to
a separate, threaded '_GitCloner'.
-'crawler.GitHubCrawler' now shares a "clone" queue with
'_GitCloner', which shares an "index" queue with 'GitIndexer'.
-both indexing and cloning are time-intensive processes, so this
improvement should (hypothetically) boost performance.
-add `GitRepository` class, instances of which are passed around in
the queues.