a rule for running it in the Rakefile.
Add:
parser_server.rb:
- listens for connections from the python client process
parser.rb:
- creates a syntax tree from the input and returns relevant data
about it to the client
Mod:
Parser.java:
- Moved client reading and writing methods to the abstract
parser class, so that it is not specific to the JavaParser
JavaParser.java:
- Implemented NodeVisitor._cache. The cache is a stack of data
packets. When a node that we want information on is first
visited, a new packet of data is pushed onto the stack. The
child nodes of that original node than add information to the
packet, and when the original node is traversed again on the
way up the tree, the data is popped from the cache and added
to the symbols. This makes it possible to gather information
about various levels of the tree easily.
JavaSymbols.java:
- Refactor all the insertMethods to simply add a packet of data
to the appropriate HashMap.
Symbols.java
- Add a createCoord method which returns an arraylist
representing a point in a document.
Parse.java:
Added comments
JavaParser.java:
Updated the genSymbols method and a private class 'NodeVisitor' which
implements ASTVisitor. genSymbols returns an instance of the
Symbols class containing all relevant data about the Java code.
JavaSymbols.java:
Add fields which map class, interface, method, field, and
variable names to positions.
Add:
c.py
- CTreeCutter class is very similar to PyTreeCutter. It utilizes
self.cache as opposed to PyTreeCutter which doesn't yet.
- CTreeCutter visit functions simply add start and end lines of
the node to the cache, and visit_Decl pushes the cache onto
accum.
- parse_c performs a task identical to parse_py. However, many
c files need to be pre-processed before they are parsed.
Add:
bitshift/crawler/crawler.py
-Add more efficient method of querying GitHub's API for stargazer
counts, by batching 25 repositories per request.
-Add watcher counts for Bitbucket repositories, by querying the
Bitbucket API once per repository (inefficient, but the API in question
isn't sufficiently robust to accommodate a better approach, and Git
repositories surface so infrequently that there shouldn't be any query
limit problems).
Several of the closed issues were addressed partly in previous commits;
definitively close them with this, for the moment, final update to the crawler
package.
Ref:
bitshift/crawler/indexer.py
-move all `GitIndexer` specific functions (eg, `_decode`,
`_is_ascii()`)from the global scope to the class definition.
Add:
bitshift/
__init__.py
-add `_configure_logging()`, which sets up a more robust logging
infrastructure than was previously used: log files are rotated once
per hour, and have some additional formatting rules.
(crawler, indexer).py
-add hierarchically-descending loggers to individual threaded
classes (`GitHubCrawler`, `GitIndexer`, etc.); add logging calls.
indexer.py
-remove file filtering regex matches from `_get_tracked_files()`,
as non-code files will be discarded by the parsers.
Add:
bitshift/crawler/crawler.py
-add `_get_repo_stars()` to `GitHubCrawler`, which queries the GitHub
API for the number of a stars that a given repository has.
-log the `next_api_url` every time it's generated by `GitHubCrawler` and
`BitbucketCrawler` to two respective log-files.
Add:
bitshift/crawler/
(crawler, indexer).py
-comment out all logging statements, as they may be causing a
memory leak (the crawler is meant to run perpetually, meaning that,
depending on how the `logging` module is implemented, it may be
accumulating logged strings in memory.)
bitshift/crawler/indexer.py
-make `_index_repository()` and `_index_repository_codelets()`
functions of the `GitIndexer` class.
-replace `_get_tracked_files()` subprocess call, which found the
files in a Git repository and removed any that were non-ASCII, with
a pure Python solution.
-add `_is_ascii()`.
Add:
bitshift/crawler/
__init__.py
-Initialize 'BitbucketCrawler()' singleton.
-Instantiate all thread instances on-the-fly in a 'threads' array, as
opposed to individual named variables.
crawler.py
-Add 'BitbucketCrawler()', to crawl Bitbucket for repositories.
-Not entirely tested for proper functionality.
-The Bitbucket framework is not yet accounted for in
'indexer._generate_file_url()'.
Add:
bitshift/crawler/
__init__.py
-Initialize 'BitbucketCrawler()' singleton.
-Instantiate all thread instances on-the-fly in a 'threads' array, as
opposed to individual named variables.
crawler.py
-Add 'BitbucketCrawler()', to crawl Bitbucket for repositories.
-Not entirely tested for proper functionality.
-The Bitbucket framework is not yet accounted for in
'indexer._generate_file_url()'.
Add:
bitshift/crawler/indexer.py
-Add 'pkill git' to the 'git clone' subprocess in '_clone_repository()',
to kill hanging remotes -- it's un-Pythonic, but, thus far, the only
method that's proved successful. The RAM problem still persists; the
latest dry-run lasted 01:11:00 before terminating due to a lack of
allocatable memory.
-Add exception names to `logging` messages.
bitshift/assets
-Update 'tag()' docstring to current 'bitshift' standards (add a ':type'
and ':rtype:' field).
Add:
bitshift/crawler/
(crawler, indexer).py
-add a 'time.sleep()' call whenever a thread is blocking on items
in a Queue, to prevent excessive polling (which hogs system
resources).
indexer.py
-move 'git clone' functionality from the 'GitIndexer' singleton to
a separate, threaded '_GitCloner'.
-'crawler.GitHubCrawler' now shares a "clone" queue with
'_GitCloner', which shares an "index" queue with 'GitIndexer'.
-both indexing and cloning are time-intensive processes, so this
improvement should (hypothetically) boost performance.
-add `GitRepository` class, instances of which are passed around in
the queues.
Add:
bitshift/crawler/indexer.py
-add two `try: except: pass` blocks, one to _decode() and another to
GitIndexer.run(); bad practice, but GitIndexer has numerous unreliable
moving parts that can throw too many unforseeable exceptions. Only
current viable option.
-add file-extension regex ignore rules (for text, markdown, etc. files)
to _get_tracked_files().
Add:
bitshift/crawler/indexer.py
-add _debug().
-add content to the module docstring; add documentation to GitIndexer,
and the functions that were lacking it.
-add another perl one-liner to supplement the `git clone` subprocess
call, which terminates it after a set amount of time (should it have
frozen) -- fixes a major bug that caused the entire indexer to hang.
Add, Fix:
bitshift/crawler/
__init__.py
-add module and crawl() docstrings.
-add repository_queue size limit.
crawler.py
-account for time spent executing an API query in the run() loop
sleep() interval.
Additions are not tested and not yet documented.
Add:
crawler.py
-add threaded GitHubCrawler class, which interacts with a GitIndexer
via a Queue.
git_indexer.py
-add threaded GitIndexer class, which interacts with GitHubCrawler via
a Queue.
-rename context-manager ChangeDir class to _ChangeDir, because it's
essentially "private".
__init__.py
-add body to crawl(), which creates instances of GitHubCrawler and
GitIndexer and starts them.
Add:
bitshift/codelet.py
-add name field to Codelet.
bitshift/crawler/crawler.py
-fix previously defunct code (which was committed at a point of
incompletion) -- incorrect dictionary keys, etc..
-reformat some function calls' argument alignment to fit PEP standards.
bitshift/crawler.py
-add sleep() to ensure that an API query is made at regular intervals
(determined by the GitHub API limit).
Add:
bitshift/crawler/(crawler, git_indexer).py
-move Codelet creation from the crawler to the git_indexer, in
preparation for making crawling/indexing independent, threaded
processes.
Mod:
bitshift/codelet.py
-modify documentation for the author instance variable.