Several of the closed issues were addressed partly in previous commits;
definitively close them with this, for the moment, final update to the crawler
package.
Ref:
bitshift/crawler/indexer.py
-move all `GitIndexer` specific functions (eg, `_decode`,
`_is_ascii()`)from the global scope to the class definition.
Add:
bitshift/
__init__.py
-add `_configure_logging()`, which sets up a more robust logging
infrastructure than was previously used: log files are rotated once
per hour, and have some additional formatting rules.
(crawler, indexer).py
-add hierarchically-descending loggers to individual threaded
classes (`GitHubCrawler`, `GitIndexer`, etc.); add logging calls.
indexer.py
-remove file filtering regex matches from `_get_tracked_files()`,
as non-code files will be discarded by the parsers.
Add:
bitshift/crawler/crawler.py
-add `_get_repo_stars()` to `GitHubCrawler`, which queries the GitHub
API for the number of a stars that a given repository has.
-log the `next_api_url` every time it's generated by `GitHubCrawler` and
`BitbucketCrawler` to two respective log-files.
Add:
bitshift/crawler/
(crawler, indexer).py
-comment out all logging statements, as they may be causing a
memory leak (the crawler is meant to run perpetually, meaning that,
depending on how the `logging` module is implemented, it may be
accumulating logged strings in memory.)
bitshift/crawler/indexer.py
-make `_index_repository()` and `_index_repository_codelets()`
functions of the `GitIndexer` class.
-replace `_get_tracked_files()` subprocess call, which found the
files in a Git repository and removed any that were non-ASCII, with
a pure Python solution.
-add `_is_ascii()`.
Add:
bitshift/crawler/
__init__.py
-Initialize 'BitbucketCrawler()' singleton.
-Instantiate all thread instances on-the-fly in a 'threads' array, as
opposed to individual named variables.
crawler.py
-Add 'BitbucketCrawler()', to crawl Bitbucket for repositories.
-Not entirely tested for proper functionality.
-The Bitbucket framework is not yet accounted for in
'indexer._generate_file_url()'.
Add:
bitshift/crawler/
__init__.py
-Initialize 'BitbucketCrawler()' singleton.
-Instantiate all thread instances on-the-fly in a 'threads' array, as
opposed to individual named variables.
crawler.py
-Add 'BitbucketCrawler()', to crawl Bitbucket for repositories.
-Not entirely tested for proper functionality.
-The Bitbucket framework is not yet accounted for in
'indexer._generate_file_url()'.
Add:
bitshift/crawler/indexer.py
-Add 'pkill git' to the 'git clone' subprocess in '_clone_repository()',
to kill hanging remotes -- it's un-Pythonic, but, thus far, the only
method that's proved successful. The RAM problem still persists; the
latest dry-run lasted 01:11:00 before terminating due to a lack of
allocatable memory.
-Add exception names to `logging` messages.
bitshift/assets
-Update 'tag()' docstring to current 'bitshift' standards (add a ':type'
and ':rtype:' field).
Add:
bitshift/crawler/
(crawler, indexer).py
-add a 'time.sleep()' call whenever a thread is blocking on items
in a Queue, to prevent excessive polling (which hogs system
resources).
indexer.py
-move 'git clone' functionality from the 'GitIndexer' singleton to
a separate, threaded '_GitCloner'.
-'crawler.GitHubCrawler' now shares a "clone" queue with
'_GitCloner', which shares an "index" queue with 'GitIndexer'.
-both indexing and cloning are time-intensive processes, so this
improvement should (hypothetically) boost performance.
-add `GitRepository` class, instances of which are passed around in
the queues.
Add:
bitshift/crawler/indexer.py
-add two `try: except: pass` blocks, one to _decode() and another to
GitIndexer.run(); bad practice, but GitIndexer has numerous unreliable
moving parts that can throw too many unforseeable exceptions. Only
current viable option.
-add file-extension regex ignore rules (for text, markdown, etc. files)
to _get_tracked_files().
Add:
bitshift/crawler/indexer.py
-add _debug().
-add content to the module docstring; add documentation to GitIndexer,
and the functions that were lacking it.
-add another perl one-liner to supplement the `git clone` subprocess
call, which terminates it after a set amount of time (should it have
frozen) -- fixes a major bug that caused the entire indexer to hang.
Add, Fix:
bitshift/crawler/
__init__.py
-add module and crawl() docstrings.
-add repository_queue size limit.
crawler.py
-account for time spent executing an API query in the run() loop
sleep() interval.
Additions are not tested and not yet documented.
Add:
crawler.py
-add threaded GitHubCrawler class, which interacts with a GitIndexer
via a Queue.
git_indexer.py
-add threaded GitIndexer class, which interacts with GitHubCrawler via
a Queue.
-rename context-manager ChangeDir class to _ChangeDir, because it's
essentially "private".
__init__.py
-add body to crawl(), which creates instances of GitHubCrawler and
GitIndexer and starts them.
Add:
bitshift/codelet.py
-add name field to Codelet.
bitshift/crawler/crawler.py
-fix previously defunct code (which was committed at a point of
incompletion) -- incorrect dictionary keys, etc..
-reformat some function calls' argument alignment to fit PEP standards.
bitshift/crawler.py
-add sleep() to ensure that an API query is made at regular intervals
(determined by the GitHub API limit).
Add:
bitshift/crawler/(crawler, git_indexer).py
-move Codelet creation from the crawler to the git_indexer, in
preparation for making crawling/indexing independent, threaded
processes.
Mod:
bitshift/codelet.py
-modify documentation for the author instance variable.
Add:
bitshift/crawler/crawler.py
-add base crawler module
-add github(), to index Github.
Mod:
bitshift/crawler/
-add package subdirectory for the crawler module, and any subsidiary
modules (eg, git_indexer).
bitshift/author_files.py > bitshift/crawler/git_indexer.py
-rename the module to "git_indexer", to better reflect its use.
-convert from stand-alone script to a module whose functions integrate
cleanly with the rest of the application.
-add all necessary, tested functions, with Sphinx documentation.
Add:
author_files.py
-add prototype script to output metadata about every file in a Git
repository: filename, author names, dates of creation and modification.
-lacking Sphinx documentation.