Search engines use shingles (groups of content or clusters of words in “exact match” form) and shingle analysis to extract the block-level contextual references and assemble the content of a web page. You hear the warnings about duplicate content in regards to SEO, but what does it really mean?
Search engine etiquette in many facets mirrors human behavior, for example, link popularity emulates cliques and third party referrals a.k.a endorsed editorial links as authority, expertise and reputation are emulated by trust rank. Variables such as link popularity and trust sculpt the outcome of why, where and how billions of websites rank in search engine’s index.
Just like you would be penalized if a teacher caught you cheating or copying from others, search engines are algorithmically inclined to seek out the source of a shingle (to eliminate duplicates from their index).
There are two types of duplicate content for a website (1) if your server headers are not properly configured so each page in your site is available from an http:// prefix and the http://www. prefix, which means that each page is a potential replica of the other (remedied by an .HTACCESS preference) or (2) duplicate shingles “exact match segments” of words spread across (a) your own pages or (b) multiple websites.
Regardless if the shingles are smatterings across your own website or multiple sites online, your pages can essentially invoke a penalty and potentially cancel each other out. From the standpoint of storage to search engines (which costs money), let’s face it, a copy is really not that original.
From their prospective (if spiders have one) they are keeping the original sources intact as the authority and for the parrots that scrape or attempt to spin articles, they ultimately leave a trail unless the segments are sufficiently scrambled and re-written. Often, simply writing unique content would have been less effort.
Scraping content for reproduction from topical sources via RSS feeds is a very common tactic for building automated MFA (made for ad sense) sites. There is nothing more annoying than seeing a post or article that took the author hours to create end up on some “no name site” with a “fictitious writer” who is now competing with your website for the very same keywords, tags and titles you just meticulously crafted with care.
Despite the reason why webmasters copy or scrape content such as (1) they are just plain lazy (2) to build up topical sites so they can implement a 301 redirect to their “real money maker website” after the site gains page rank (3) for ad sense or affiliate revenues or (4) for search engine rankings based on the labors and content of others; the impact of duplicate content ripples across the web keeping spiders busy repressing blatant plagiarism.
The more sophisticated and savvy search engines like Google can smell a scraped shingle a mile away and make adjustments to ensure that the original source gets the credit and the duplicate simply recedes into the penalty zone where its semantic currency is capped and quarantined.
When it comes down to it, you are better off avoiding penalties (large or small) to streamline the velocity of your website as it moves through the evaluation process. Just like barnacles impede a ships trajectory, duplicate content impedes your relevance score unless your site is the original source.
What this means in layman’s terms is, if you are using article marketing as a tactic for building links or driving traffic to your pages, make sure the content is original. Believe me, search engines know, so in closing, just write your own.
If you are seeking solutions to quell plagiarism, services exist such as copyscape who will alert you if you if your content is being cannibalized or blatantly scraped. Or, at any time, just grab a segment of your own content, put “quotes around it” in a Google search box and hit return to see what data is retrieved. If your site is the only one using the exact match formation of the snippet, then your content is sited as the original.
You can also use Google Blog Search and set alerts for titles or specific content “in quotes” to get email alerts as they happen and use Domain Tools to find the phone numbers for the offensive webmaster in question (to give them a ring when they least expect it).
So, now if someone asks you about block level shingles or shingle analysis, you know that in context to search engines, they are not referring to a roof.