bitshift

Commit Graph

Author	SHA1	Message	Date
Severyn Kozak	d142f1fd55	Complete Crawler. Close #15 , #14 , #11 , #8 . Several of the closed issues were addressed partly in previous commits; definitively close them with this, for the moment, final update to the crawler package. Ref: bitshift/crawler/indexer.py -move all `GitIndexer` specific functions (eg, `_decode`, `_is_ascii()`)from the global scope to the class definition.	10 years ago
Severyn Kozak	6762c1fa3d	Re-add logging, rem file filters. Add: bitshift/ __init__.py -add `_configure_logging()`, which sets up a more robust logging infrastructure than was previously used: log files are rotated once per hour, and have some additional formatting rules. (crawler, indexer).py -add hierarchically-descending loggers to individual threaded classes (`GitHubCrawler`, `GitIndexer`, etc.); add logging calls. indexer.py -remove file filtering regex matches from `_get_tracked_files()`, as non-code files will be discarded by the parsers.	10 years ago
Severyn Kozak	1b2739f8c4	Add GitHub repo star count, simple logging. Add: bitshift/crawler/crawler.py -add `_get_repo_stars()` to `GitHubCrawler`, which queries the GitHub API for the number of a stars that a given repository has. -log the `next_api_url` every time it's generated by `GitHubCrawler` and `BitbucketCrawler` to two respective log-files.	10 years ago
Severyn Kozak	ad7ce9d9cf	Commit latest crawler, continue fix of #8 . Add: bitshift/crawler/*.py -Remove use of the `logging` module, which appeared to be causing a memory leak even with log-file rotation.	10 years ago
Severyn Kozak	f38772760b	Remove some subprocesses, comment out logging. Add: bitshift/crawler/ (crawler, indexer).py -comment out all logging statements, as they may be causing a memory leak (the crawler is meant to run perpetually, meaning that, depending on how the `logging` module is implemented, it may be accumulating logged strings in memory.) bitshift/crawler/indexer.py -make `_index_repository()` and `_index_repository_codelets()` functions of the `GitIndexer` class. -replace `_get_tracked_files()` subprocess call, which found the files in a Git repository and removed any that were non-ASCII, with a pure Python solution. -add `_is_ascii()`.	10 years ago
Severyn Kozak	2954161747	Add partially integrated BitbucketCrawler(). Add: bitshift/crawler/ __init__.py -Initialize 'BitbucketCrawler()' singleton. -Instantiate all thread instances on-the-fly in a 'threads' array, as opposed to individual named variables. crawler.py -Add 'BitbucketCrawler()', to crawl Bitbucket for repositories. -Not entirely tested for proper functionality. -The Bitbucket framework is not yet accounted for in 'indexer._generate_file_url()'.	10 years ago
Severyn Kozak	93ed68645d	Add partially integrated BitbucketCrawler(). Add: bitshift/crawler/ __init__.py -Initialize 'BitbucketCrawler()' singleton. -Instantiate all thread instances on-the-fly in a 'threads' array, as opposed to individual named variables. crawler.py -Add 'BitbucketCrawler()', to crawl Bitbucket for repositories. -Not entirely tested for proper functionality. -The Bitbucket framework is not yet accounted for in 'indexer._generate_file_url()'.	10 years ago
Severyn Kozak	6718650a8c	First part of #8 fix. Add: bitshift/crawler/indexer.py -Add 'pkill git' to the 'git clone' subprocess in '_clone_repository()', to kill hanging remotes -- it's un-Pythonic, but, thus far, the only method that's proved successful. The RAM problem still persists; the latest dry-run lasted 01:11:00 before terminating due to a lack of allocatable memory. -Add exception names to `logging` messages. bitshift/assets -Update 'tag()' docstring to current 'bitshift' standards (add a ':type' and ':rtype:' field).	10 years ago
Severyn Kozak	3ce399adbf	Add threaded cloner, GitRepository class (#7 ). Add: bitshift/crawler/ (crawler, indexer).py -add a 'time.sleep()' call whenever a thread is blocking on items in a Queue, to prevent excessive polling (which hogs system resources). indexer.py -move 'git clone' functionality from the 'GitIndexer' singleton to a separate, threaded '_GitCloner'. -'crawler.GitHubCrawler' now shares a "clone" queue with '_GitCloner', which shares an "index" queue with 'GitIndexer'. -both indexing and cloning are time-intensive processes, so this improvement should (hypothetically) boost performance. -add `GitRepository` class, instances of which are passed around in the queues.	10 years ago
Severyn Kozak	755dce6ae3	Add logging to crawler/indexer. Add: bitshift/crawler/(__init__, crawler, indexer).py -add `logging` module to all `bitshift.crawler` modules, for some basic diagnostic output.	10 years ago
Severyn Kozak	f4b28e6178	Add file-ext regex rules, exception handlers. Add: bitshift/crawler/indexer.py -add two `try: except: pass` blocks, one to _decode() and another to GitIndexer.run(); bad practice, but GitIndexer has numerous unreliable moving parts that can throw too many unforseeable exceptions. Only current viable option. -add file-extension regex ignore rules (for text, markdown, etc. files) to _get_tracked_files().	10 years ago
Severyn Kozak	627c848f20	Add tested indexer. Add: bitshift/crawler/indexer.py -add _debug(). -add content to the module docstring; add documentation to GitIndexer, and the functions that were lacking it. -add another perl one-liner to supplement the `git clone` subprocess call, which terminates it after a set amount of time (should it have frozen) -- fixes a major bug that caused the entire indexer to hang.	10 years ago
Severyn Kozak	b680756f8d	Test crawler, complete documentation. Add, Fix: bitshift/crawler/ __init__.py -add module and crawl() docstrings. -add repository_queue size limit. crawler.py -account for time spent executing an API query in the run() loop sleep() interval.	10 years ago
Severyn Kozak	b7ccec0501	Add untested threaded indexer/crawler prototype. Additions are not tested and not yet documented. Add: crawler.py -add threaded GitHubCrawler class, which interacts with a GitIndexer via a Queue. git_indexer.py -add threaded GitIndexer class, which interacts with GitHubCrawler via a Queue. -rename context-manager ChangeDir class to _ChangeDir, because it's essentially "private". __init__.py -add body to crawl(), which creates instances of GitHubCrawler and GitIndexer and starts them.	10 years ago
Severyn Kozak	97198ee523	Update Crawler documentation. Add: bitshift/crawler/git_indexer.py -add some missing docstrings, complete others.	10 years ago
Severyn Kozak	c655d97f48	Add class ChangeDir, amend unsafe subprocess. Add: bitshift/crawler/git_indexer.py -add ChangeDir class, a context-management wrapper for os.chdir(). -replace unsafe "rm -rf" subprocess call with shutil.rmtree()	10 years ago
Severyn Kozak	9fc4598001	Clean up crawler/, fix minor bugs. Add: bitshift/codelet.py -add name field to Codelet. bitshift/crawler/crawler.py -fix previously defunct code (which was committed at a point of incompletion) -- incorrect dictionary keys, etc.. -reformat some function calls' argument alignment to fit PEP standards. bitshift/crawler.py -add sleep() to ensure that an API query is made at regular intervals (determined by the GitHub API limit).	10 years ago
Severyn Kozak	77b448c3de	Mod Codelet, mov codelet creation from crawler. Add: bitshift/crawler/(crawler, git_indexer).py -move Codelet creation from the crawler to the git_indexer, in preparation for making crawling/indexing independent, threaded processes. Mod: bitshift/codelet.py -modify documentation for the author instance variable.	10 years ago
Severyn Kozak	ef9c0609fe	Mov author_files > git_inder, heavily refactor. Add: bitshift/crawler/crawler.py -add base crawler module -add github(), to index Github. Mod: bitshift/crawler/ -add package subdirectory for the crawler module, and any subsidiary modules (eg, git_indexer). bitshift/author_files.py > bitshift/crawler/git_indexer.py -rename the module to "git_indexer", to better reflect its use. -convert from stand-alone script to a module whose functions integrate cleanly with the rest of the application. -add all necessary, tested functions, with Sphinx documentation.	10 years ago
Severyn Kozak	ef73c04347	Add prototype repo-indexer script author_files.py. Add: author_files.py -add prototype script to output metadata about every file in a Git repository: filename, author names, dates of creation and modification. -lacking Sphinx documentation.	10 years ago
Ben Kurtovic	950b6994f0	Database to v5; finish Database.insert().	10 years ago
Ben Kurtovic	d6ccdbd16d	Fix a couble Database bugs.	10 years ago
Ben Kurtovic	d2aef2829e	Finish database insertion, except for origins.	10 years ago
Ben Kurtovic	97b0644bf0	Database to v4: split off symbol_locations table.	10 years ago
Ben Kurtovic	e3a838220c	Flesh out most of Database.insert().	10 years ago
Ben Kurtovic	821a6ae4f1	DB -> v3 for symbol->code assoc vs. ->codelet (fixes #13 )	10 years ago
Ben Kurtovic	0b655daaff	Finish migration to v2.	10 years ago
Ben Kurtovic	a5cc3537cb	Credits.	10 years ago
Ben Kurtovic	22d6b62547	Update schema to v2; database updates.	10 years ago
Ben Kurtovic	0d0a74f9df	Some more work on db stuff.	10 years ago
Ben Kurtovic	54bca5894f	Move database stuff to a subpackage; updates.	10 years ago
Ben Kurtovic	ad3de0615f	Fix some typos in the schema.	10 years ago
Ben Kurtovic	fb4e0d5916	FULLTEXT KEYs where appropriate.	10 years ago
Ben Kurtovic	75b243f685	Remove languages table; add indexed field for codelet rank.	10 years ago
Ben Kurtovic	1cbe669c02	More work on db schema; all except FTS indices.	10 years ago
Ben Kurtovic	bc3b9e7587	Some more database design work.	10 years ago
Ben Kurtovic	085fd62704	Database schema, hashing module, some other things.	10 years ago
Ben Kurtovic	962dd9aef5	Docstrings for Database methods; oursql dependency.	10 years ago
Ben Kurtovic	34e629b3cd	Some early work on varous query objects.	10 years ago
Severyn Kozak	20b518fccc	Minor refactor of codelet. Add: bitshift/codelet.py -complete docstrings, add filename to Codelet constructor.	10 years ago
Severyn Kozak	6a4ba580ed	Add Codelet, crawler dependencies to setup. Add: bitshift/codelet.py -add Codelet class with constructor. README.md -add SASS stylesheet documentation	10 years ago
Ben Kurtovic	902d734c28	Update __init__.py.	10 years ago
Severyn Kozak	b70e2c961d	Update assets module with template docstring. Mod: bitshift/assets.py -convert existing docstrings to the Sphinx auto-doc format.	10 years ago
Ben Kurtovic	0c68988982	CREATE THE THINGS	10 years ago
Ben Kurtovic	6a9598fe12	Basic setup.py.	10 years ago
Ben Kurtovic	08249e086e	Fix __init__.py and add some info to README.	10 years ago
Ben Kurtovic	6adea4a97e	Adding basic sphinx documentation.	10 years ago
Ben Kurtovic	404a2fb7e3	Fix names in license.	10 years ago
Ben Kurtovic	82147c7b51	Fix description.	10 years ago
Severyn Kozak	6ff65c0906	Merge branch 'master' into develop Conflicts: app.py	10 years ago

... 4 5 6 7 8

354 Commits (d31c49b1fd5866c33b8529156178571cf07912cf) All Branches Search

354 Commits (d31c49b1fd5866c33b8529156178571cf07912cf)

All Branches