In my previous post, Highlighting Duplicate Sentences with PHP, I described a method for highlighting any sentences that appear multiple times within a string. Now it’s time to explore highlighting duplicate phrases. This tutorial assumes that you have a basic understanding of PHP, HTML, and CSS.
While the difference between a phrase and a sentence may seem minimal, the distinction actually adds another dimension of complexity.First of all, the definition of a phrase is not as concrete as the definition of a sentence. For our purposes, we will define a phrase as 3 to 10 consecutive words. As we will see, the smaller the range of possible phrase sizes, the faster the algorithm will run. Another complication is that phrases, unlike sentences, can overlap. Also, periods cannot split a phrase. For instance, consider the string “My name is Asa. I like bikes”
is not a single seven word phrase. Because of the period, “My name is Asa” and “I like bikes” are separate phrases.
It has come to my attention that there is some interest in a script that can find repeated sentences within some text, so I decided to whip up some code and write a little tutorial. My solution is probably not the most efficient, but it should work fine for most online applications. I tested it on the United States Declaration of Independence, a roughly 1300 word document, and there was no noticeable load time.
Feel free to Checkout the demo and download the source code.
The basic idea behind my solution is to split the given string into an array of sentences. Then we loop through the array, find the repeated sentences, and add some tags to highlight the repeated sentences. The biggest issues are remembering which sentences are repeated and inserting the tags. I handled these issues by adding span tags to the sentence strings immediately after recognizing a duplicate. Once a duplicate has been highlighted, the script goes back to the first occurrence of the sentence and highlights the appropriate string in the array.