bitshift

Commit Graph

Author	SHA1	Message	Date
Benjamin Attal	2338887a52	Working version of java parser up and running.	10 years ago
Benjamin Attal	19a5457f07	Change director structure for java	10 years ago
Benjamin Attal	306875dae7	Make Parser implement runnable so parsing tasks can be started in separate threads. Make Parser constructor accept a client socket, add reading and writing methods for the socket to JavaParser. Parse main method sets up a server for accepting parse jobs from the crawler, and starts threads for each parse task.	10 years ago
Benjamin Attal	77e2b6f524	Fix errors in java parser, mostly casting issues. In Parse.java, set up a tcp server for communication with python processes. Builds with maven	10 years ago
Benjamin Attal	669c30cac7	Mod: Parse.java: Added comments JavaParser.java: Updated the genSymbols method and a private class 'NodeVisitor' which implements ASTVisitor. genSymbols returns an instance of the Symbols class containing all relevant data about the Java code. JavaSymbols.java: Add fields which map class, interface, method, field, and variable names to positions.	10 years ago
Benjamin Attal	63b09caa6c	Changed directory structure of java parser. Decided on multiple parsers in different languages, refactored bitshift/parser to fit with that paradigm.	10 years ago
Benjamin Attal	a1066dd093	Modify parser/__init__.py so that it communicates with the Java parsing process and reads a result back from a unique file. Add template files for Java parsers.	10 years ago
Benjamin Attal	3bc748242d	Refactor parser/__init__.py for new parsing mechanism	10 years ago
Benjamin Attal	430b7d3588	Remove unecessary submodule.	10 years ago
Benjamin Attal	a8f918f7c4	Update class names. Move language ids to languages.py	10 years ago
Benjamin Attal	0a57cf50e6	Add first version of the c parser Add: c.py - CTreeCutter class is very similar to PyTreeCutter. It utilizes self.cache as opposed to PyTreeCutter which doesn't yet. - CTreeCutter visit functions simply add start and end lines of the node to the cache, and visit_Decl pushes the cache onto accum. - parse_c performs a task identical to parse_py. However, many c files need to be pre-processed before they are parsed.	10 years ago
Benjamin Attal	847410b13c	Minor fix-ups in python parser. Mod: python.py - Add self.cache to allow for saving of unassocaited metadata as the PyTreeCutter moves down the syntax tree. - Update docstrings.	10 years ago
Benjamin Attal	d485b87f21	Fix docstring in bitshift/parser/python.py	10 years ago
Benjamin Attal	b77db873c1	Refactor parsing in python by adding node visitor class. Performs same tasks as previous version, but is more concise. Add: bitshift/parser/python.py: Add PyTreeCutter class to perform actions on specific nodes.	10 years ago
Benjamin Attal	4d8c818c05	Corrected documentation in bitshift/codelet.py and bitshift/parser/__init__.py	10 years ago
Benjamin Attal	5db273a773	Bugfixes for _serialize function in bitshift/parser/python.py	10 years ago
Benjamin Attal	0c5e4572f8	Add placeholder functions for parsing c and java in bitshift/parser. Add parse_py function with helper functions. Parse_py grabs relevant information on variables, functions, and classes from abstract syntax tree of codelet code.	10 years ago
Benjamin Attal	903e4ccc05	Add constants in bitshift/config.py for languages instead of just strings.	10 years ago
Benjamin Attal	efdcb3793a	Add docstrings for functions in parser. Add ivar for syntax tree to codelet documentation.	10 years ago
Benjamin Attal	d88e68e16e	Add dispatch 'parse' function to parser __init__.py. Basic code language identification as well. Included pycparser as a depedency.	10 years ago
Ben Kurtovic	4dfd297472	Update some documentation.	10 years ago
Ben Kurtovic	c4816c2bb8	Merge branch 'develop' into feature/query_parser	10 years ago
Ben Kurtovic	2cf98df3e2	Merge branch 'develop' of github.com:earwig/bitshift into develop Conflicts: app.py setup.py	10 years ago
Ben Kurtovic	a3b1f6d0c3	Merge branch 'feature/database' into develop	10 years ago
Ben Kurtovic	56f23e682a	Database to v6; flesh out a lot of Database.search().	10 years ago
Severyn Kozak	7c5c9fc7e1	Add GitHub stars, Bitbucket watchers; close #14 . Add: bitshift/crawler/crawler.py -Add more efficient method of querying GitHub's API for stargazer counts, by batching 25 repositories per request. -Add watcher counts for Bitbucket repositories, by querying the Bitbucket API once per repository (inefficient, but the API in question isn't sufficiently robust to accommodate a better approach, and Git repositories surface so infrequently that there shouldn't be any query limit problems).	10 years ago
Severyn Kozak	d142f1fd55	Complete Crawler. Close #15 , #14 , #11 , #8 . Several of the closed issues were addressed partly in previous commits; definitively close them with this, for the moment, final update to the crawler package. Ref: bitshift/crawler/indexer.py -move all `GitIndexer` specific functions (eg, `_decode`, `_is_ascii()`)from the global scope to the class definition.	10 years ago
Severyn Kozak	6762c1fa3d	Re-add logging, rem file filters. Add: bitshift/ __init__.py -add `_configure_logging()`, which sets up a more robust logging infrastructure than was previously used: log files are rotated once per hour, and have some additional formatting rules. (crawler, indexer).py -add hierarchically-descending loggers to individual threaded classes (`GitHubCrawler`, `GitIndexer`, etc.); add logging calls. indexer.py -remove file filtering regex matches from `_get_tracked_files()`, as non-code files will be discarded by the parsers.	10 years ago
Severyn Kozak	1b2739f8c4	Add GitHub repo star count, simple logging. Add: bitshift/crawler/crawler.py -add `_get_repo_stars()` to `GitHubCrawler`, which queries the GitHub API for the number of a stars that a given repository has. -log the `next_api_url` every time it's generated by `GitHubCrawler` and `BitbucketCrawler` to two respective log-files.	10 years ago
Severyn Kozak	ad7ce9d9cf	Commit latest crawler, continue fix of #8 . Add: bitshift/crawler/*.py -Remove use of the `logging` module, which appeared to be causing a memory leak even with log-file rotation.	10 years ago
Severyn Kozak	f38772760b	Remove some subprocesses, comment out logging. Add: bitshift/crawler/ (crawler, indexer).py -comment out all logging statements, as they may be causing a memory leak (the crawler is meant to run perpetually, meaning that, depending on how the `logging` module is implemented, it may be accumulating logged strings in memory.) bitshift/crawler/indexer.py -make `_index_repository()` and `_index_repository_codelets()` functions of the `GitIndexer` class. -replace `_get_tracked_files()` subprocess call, which found the files in a Git repository and removed any that were non-ASCII, with a pure Python solution. -add `_is_ascii()`.	10 years ago
Severyn Kozak	2954161747	Add partially integrated BitbucketCrawler(). Add: bitshift/crawler/ __init__.py -Initialize 'BitbucketCrawler()' singleton. -Instantiate all thread instances on-the-fly in a 'threads' array, as opposed to individual named variables. crawler.py -Add 'BitbucketCrawler()', to crawl Bitbucket for repositories. -Not entirely tested for proper functionality. -The Bitbucket framework is not yet accounted for in 'indexer._generate_file_url()'.	10 years ago
Severyn Kozak	93ed68645d	Add partially integrated BitbucketCrawler(). Add: bitshift/crawler/ __init__.py -Initialize 'BitbucketCrawler()' singleton. -Instantiate all thread instances on-the-fly in a 'threads' array, as opposed to individual named variables. crawler.py -Add 'BitbucketCrawler()', to crawl Bitbucket for repositories. -Not entirely tested for proper functionality. -The Bitbucket framework is not yet accounted for in 'indexer._generate_file_url()'.	10 years ago
Severyn Kozak	6718650a8c	First part of #8 fix. Add: bitshift/crawler/indexer.py -Add 'pkill git' to the 'git clone' subprocess in '_clone_repository()', to kill hanging remotes -- it's un-Pythonic, but, thus far, the only method that's proved successful. The RAM problem still persists; the latest dry-run lasted 01:11:00 before terminating due to a lack of allocatable memory. -Add exception names to `logging` messages. bitshift/assets -Update 'tag()' docstring to current 'bitshift' standards (add a ':type' and ':rtype:' field).	10 years ago
Severyn Kozak	3ce399adbf	Add threaded cloner, GitRepository class (#7 ). Add: bitshift/crawler/ (crawler, indexer).py -add a 'time.sleep()' call whenever a thread is blocking on items in a Queue, to prevent excessive polling (which hogs system resources). indexer.py -move 'git clone' functionality from the 'GitIndexer' singleton to a separate, threaded '_GitCloner'. -'crawler.GitHubCrawler' now shares a "clone" queue with '_GitCloner', which shares an "index" queue with 'GitIndexer'. -both indexing and cloning are time-intensive processes, so this improvement should (hypothetically) boost performance. -add `GitRepository` class, instances of which are passed around in the queues.	10 years ago
Severyn Kozak	755dce6ae3	Add logging to crawler/indexer. Add: bitshift/crawler/(__init__, crawler, indexer).py -add `logging` module to all `bitshift.crawler` modules, for some basic diagnostic output.	10 years ago
Severyn Kozak	f4b28e6178	Add file-ext regex rules, exception handlers. Add: bitshift/crawler/indexer.py -add two `try: except: pass` blocks, one to _decode() and another to GitIndexer.run(); bad practice, but GitIndexer has numerous unreliable moving parts that can throw too many unforseeable exceptions. Only current viable option. -add file-extension regex ignore rules (for text, markdown, etc. files) to _get_tracked_files().	10 years ago
Severyn Kozak	627c848f20	Add tested indexer. Add: bitshift/crawler/indexer.py -add _debug(). -add content to the module docstring; add documentation to GitIndexer, and the functions that were lacking it. -add another perl one-liner to supplement the `git clone` subprocess call, which terminates it after a set amount of time (should it have frozen) -- fixes a major bug that caused the entire indexer to hang.	10 years ago
Severyn Kozak	b680756f8d	Test crawler, complete documentation. Add, Fix: bitshift/crawler/ __init__.py -add module and crawl() docstrings. -add repository_queue size limit. crawler.py -account for time spent executing an API query in the run() loop sleep() interval.	10 years ago
Severyn Kozak	b7ccec0501	Add untested threaded indexer/crawler prototype. Additions are not tested and not yet documented. Add: crawler.py -add threaded GitHubCrawler class, which interacts with a GitIndexer via a Queue. git_indexer.py -add threaded GitIndexer class, which interacts with GitHubCrawler via a Queue. -rename context-manager ChangeDir class to _ChangeDir, because it's essentially "private". __init__.py -add body to crawl(), which creates instances of GitHubCrawler and GitIndexer and starts them.	10 years ago
Severyn Kozak	97198ee523	Update Crawler documentation. Add: bitshift/crawler/git_indexer.py -add some missing docstrings, complete others.	10 years ago
Severyn Kozak	c655d97f48	Add class ChangeDir, amend unsafe subprocess. Add: bitshift/crawler/git_indexer.py -add ChangeDir class, a context-management wrapper for os.chdir(). -replace unsafe "rm -rf" subprocess call with shutil.rmtree()	10 years ago
Severyn Kozak	9fc4598001	Clean up crawler/, fix minor bugs. Add: bitshift/codelet.py -add name field to Codelet. bitshift/crawler/crawler.py -fix previously defunct code (which was committed at a point of incompletion) -- incorrect dictionary keys, etc.. -reformat some function calls' argument alignment to fit PEP standards. bitshift/crawler.py -add sleep() to ensure that an API query is made at regular intervals (determined by the GitHub API limit).	10 years ago
Severyn Kozak	77b448c3de	Mod Codelet, mov codelet creation from crawler. Add: bitshift/crawler/(crawler, git_indexer).py -move Codelet creation from the crawler to the git_indexer, in preparation for making crawling/indexing independent, threaded processes. Mod: bitshift/codelet.py -modify documentation for the author instance variable.	10 years ago
Severyn Kozak	ef9c0609fe	Mov author_files > git_inder, heavily refactor. Add: bitshift/crawler/crawler.py -add base crawler module -add github(), to index Github. Mod: bitshift/crawler/ -add package subdirectory for the crawler module, and any subsidiary modules (eg, git_indexer). bitshift/author_files.py > bitshift/crawler/git_indexer.py -rename the module to "git_indexer", to better reflect its use. -convert from stand-alone script to a module whose functions integrate cleanly with the rest of the application. -add all necessary, tested functions, with Sphinx documentation.	10 years ago
Severyn Kozak	ef73c04347	Add prototype repo-indexer script author_files.py. Add: author_files.py -add prototype script to output metadata about every file in a Git repository: filename, author names, dates of creation and modification. -lacking Sphinx documentation.	10 years ago
Ben Kurtovic	950b6994f0	Database to v5; finish Database.insert().	10 years ago
Ben Kurtovic	d6ccdbd16d	Fix a couble Database bugs.	10 years ago
Ben Kurtovic	d2aef2829e	Finish database insertion, except for origins.	10 years ago
Ben Kurtovic	97b0644bf0	Database to v4: split off symbol_locations table.	10 years ago

1 2 3 4

180 Commits (c6e5b4f0cc17ed18fb64ac3770df12498db5db2a) All Branches Search

180 Commits (c6e5b4f0cc17ed18fb64ac3770df12498db5db2a)

All Branches