diff --git a/_posts/2014-08-20-copyvio-detector.md b/_posts/2014-08-20-copyvio-detector.md index c1a80bb..f7194e3 100644 --- a/_posts/2014-08-20-copyvio-detector.md +++ b/_posts/2014-08-20-copyvio-detector.md @@ -34,10 +34,10 @@ Sources are fetched and then parsed differently depending on the document type handled by [pdfminer](http://www.unixuser.org/~euske/python/pdfminer/)), and normalized to a plain text form. We then create multiple [Markov chains](https://en.wikipedia.org/wiki/Markov_chain) – the *article -chain* is built from word trigrams from the article text, and a *source chain* -is built from each source text. A *delta chain* is created for each source -chain, representing the intersection of it and the article chain by examining -which nodes are shared. +chain* is built from word [5-grams](https://en.wikipedia.org/wiki/N-gram) from +the article text, and a *source chain* is built from each source text. A *delta +chain* is created for each source chain, representing the intersection of it +and the article chain by examining which nodes are shared. But how do we use these chains to decide whether a violation is present?