bitshift

Commit Graph

Author	SHA1	Message	Date
Ben Kurtovic	54dfe93e8a	Fix filenames as unicode.	10 years ago
Ben Kurtovic	a748baf83f	Fix authors as unicode.	10 years ago
Benjamin Attal	3fd25b944e	Change parser commands into subprocesses rather than servers.	10 years ago
Ben Kurtovic	8363f2b0f4	Generic try/catch directly around parse() for parser errors.	10 years ago
Ben Kurtovic	36dd97dd1f	Support only searching for symbol decls/uses (fixes #53 )	10 years ago
Ben Kurtovic	fb35774790	Remove funcName from log messages.	10 years ago
Ben Kurtovic	e4ddd3ec5f	Fix for repo.git.log().	10 years ago
Ben Kurtovic	55980f33fd	Better method for determining commit history.	10 years ago
Ben Kurtovic	627deadc86	Improve metadata retrieval.	10 years ago
Ben Kurtovic	782b9b9faf	Use the branch name.	10 years ago
Ben Kurtovic	2987ae27cb	Sadface.	10 years ago
Ben Kurtovic	4c34055849	Another bugfix.	10 years ago
Ben Kurtovic	be091dff9b	Assorted bugfixes.	10 years ago
Ben Kurtovic	a1a5252aa7	Typo.	10 years ago
Ben Kurtovic	afc5980683	Rewrite much of the indexer to use GitPython.	10 years ago
Ben Kurtovic	3dbdf86ff7	Fix.	10 years ago
Ben Kurtovic	4c6f4039a2	Ugly, but fixes a crawler threading bug.	10 years ago
Ben Kurtovic	ac5f0981cc	Rearrange sleep.	10 years ago
Ben Kurtovic	1c0c4104e5	Support crawling specific repos; add some logging.	10 years ago
Ben Kurtovic	bf25b3af66	Don't configure logging twice.	10 years ago
Ben Kurtovic	53a8ad91fa	Fix for symbol locs.	10 years ago
Ben Kurtovic	91ab08f99c	Try something.	10 years ago
Ben Kurtovic	d609c233a1	Attempt to fix /tmp race condition.	10 years ago
Ben Kurtovic	11b460eaa0	Fix repo names.	10 years ago
Ben Kurtovic	e77de2305c	Start working on new language system.	10 years ago
Benjamin Attal	2d643b1069	Stop ruby parser from failing. Add other parser fixes. Should be good to go now.	10 years ago
Ben Kurtovic	ddcb5b221f	Use logs to calculate ranks (closes #61 ).	10 years ago
Ben Kurtovic	10e7491a40	Fix indexer breaking http:// URLs.	10 years ago
Ben Kurtovic	1015298109	Make it easy to stop crawler/parsers. Cleanup.	10 years ago
Benjamin Attal	4202552a1e	Remove unecessary import	10 years ago
Benjamin Attal	21cf52ea65	Call start_parse_servers from crawl.py	10 years ago
Ben Kurtovic	f02dc4497c	Fixes.	10 years ago
Severyn Kozak	94953624c8	Fix #34 . Add: bitshift/crawler/indexer.py -Add a `try-except` block to catch the `UnsupportedFileError` exception.	10 years ago
Ben Kurtovic	5a83720617	Strip encoding lines.	10 years ago
Severyn Kozak	fc8d478060	Untested fix #33 . Add: bitshift/crawler/indexer.py -Add conditional to remove the full path of a repository if the owner's directory contains only one sub-directory.	10 years ago
Ben Kurtovic	a3eacc287e	Try to make exception reporting more useful.	10 years ago
Ben Kurtovic	9f935bbb74	This is ugly, but it improves the current setup.	10 years ago
Severyn Kozak	b698a16c98	Add parse() and insert() calls to crawler. Add: bitshift/crawler/indexer.py -Add `parse()` and `insert()` calls to `_insert_repository_codelets()`.	10 years ago
Severyn Kozak	f8436fa484	Part of #26 . Move __init__.py to crawl.py. Add: bitshift/crawler/(__init__, crawl).py -Move `__init__.py` to `crawl.py`, and add a `main` block to allow running the crawler via `python -m`.	10 years ago
Severyn Kozak	7c5c9fc7e1	Add GitHub stars, Bitbucket watchers; close #14 . Add: bitshift/crawler/crawler.py -Add more efficient method of querying GitHub's API for stargazer counts, by batching 25 repositories per request. -Add watcher counts for Bitbucket repositories, by querying the Bitbucket API once per repository (inefficient, but the API in question isn't sufficiently robust to accommodate a better approach, and Git repositories surface so infrequently that there shouldn't be any query limit problems).	10 years ago
Severyn Kozak	d142f1fd55	Complete Crawler. Close #15 , #14 , #11 , #8 . Several of the closed issues were addressed partly in previous commits; definitively close them with this, for the moment, final update to the crawler package. Ref: bitshift/crawler/indexer.py -move all `GitIndexer` specific functions (eg, `_decode`, `_is_ascii()`)from the global scope to the class definition.	10 years ago
Severyn Kozak	6762c1fa3d	Re-add logging, rem file filters. Add: bitshift/ __init__.py -add `_configure_logging()`, which sets up a more robust logging infrastructure than was previously used: log files are rotated once per hour, and have some additional formatting rules. (crawler, indexer).py -add hierarchically-descending loggers to individual threaded classes (`GitHubCrawler`, `GitIndexer`, etc.); add logging calls. indexer.py -remove file filtering regex matches from `_get_tracked_files()`, as non-code files will be discarded by the parsers.	10 years ago
Severyn Kozak	1b2739f8c4	Add GitHub repo star count, simple logging. Add: bitshift/crawler/crawler.py -add `_get_repo_stars()` to `GitHubCrawler`, which queries the GitHub API for the number of a stars that a given repository has. -log the `next_api_url` every time it's generated by `GitHubCrawler` and `BitbucketCrawler` to two respective log-files.	10 years ago
Severyn Kozak	ad7ce9d9cf	Commit latest crawler, continue fix of #8 . Add: bitshift/crawler/*.py -Remove use of the `logging` module, which appeared to be causing a memory leak even with log-file rotation.	10 years ago
Severyn Kozak	f38772760b	Remove some subprocesses, comment out logging. Add: bitshift/crawler/ (crawler, indexer).py -comment out all logging statements, as they may be causing a memory leak (the crawler is meant to run perpetually, meaning that, depending on how the `logging` module is implemented, it may be accumulating logged strings in memory.) bitshift/crawler/indexer.py -make `_index_repository()` and `_index_repository_codelets()` functions of the `GitIndexer` class. -replace `_get_tracked_files()` subprocess call, which found the files in a Git repository and removed any that were non-ASCII, with a pure Python solution. -add `_is_ascii()`.	10 years ago
Severyn Kozak	2954161747	Add partially integrated BitbucketCrawler(). Add: bitshift/crawler/ __init__.py -Initialize 'BitbucketCrawler()' singleton. -Instantiate all thread instances on-the-fly in a 'threads' array, as opposed to individual named variables. crawler.py -Add 'BitbucketCrawler()', to crawl Bitbucket for repositories. -Not entirely tested for proper functionality. -The Bitbucket framework is not yet accounted for in 'indexer._generate_file_url()'.	10 years ago
Severyn Kozak	93ed68645d	Add partially integrated BitbucketCrawler(). Add: bitshift/crawler/ __init__.py -Initialize 'BitbucketCrawler()' singleton. -Instantiate all thread instances on-the-fly in a 'threads' array, as opposed to individual named variables. crawler.py -Add 'BitbucketCrawler()', to crawl Bitbucket for repositories. -Not entirely tested for proper functionality. -The Bitbucket framework is not yet accounted for in 'indexer._generate_file_url()'.	10 years ago
Severyn Kozak	6718650a8c	First part of #8 fix. Add: bitshift/crawler/indexer.py -Add 'pkill git' to the 'git clone' subprocess in '_clone_repository()', to kill hanging remotes -- it's un-Pythonic, but, thus far, the only method that's proved successful. The RAM problem still persists; the latest dry-run lasted 01:11:00 before terminating due to a lack of allocatable memory. -Add exception names to `logging` messages. bitshift/assets -Update 'tag()' docstring to current 'bitshift' standards (add a ':type' and ':rtype:' field).	10 years ago
Severyn Kozak	3ce399adbf	Add threaded cloner, GitRepository class (#7 ). Add: bitshift/crawler/ (crawler, indexer).py -add a 'time.sleep()' call whenever a thread is blocking on items in a Queue, to prevent excessive polling (which hogs system resources). indexer.py -move 'git clone' functionality from the 'GitIndexer' singleton to a separate, threaded '_GitCloner'. -'crawler.GitHubCrawler' now shares a "clone" queue with '_GitCloner', which shares an "index" queue with 'GitIndexer'. -both indexing and cloning are time-intensive processes, so this improvement should (hypothetically) boost performance. -add `GitRepository` class, instances of which are passed around in the queues.	10 years ago
Severyn Kozak	755dce6ae3	Add logging to crawler/indexer. Add: bitshift/crawler/(__init__, crawler, indexer).py -add `logging` module to all `bitshift.crawler` modules, for some basic diagnostic output.	10 years ago

1 2

61 Commits (develop)