Duplicate Content and Shingle Analysis for SEO

Search engines use shingles (groups of content or clusters of words in “exact match” form) and shingle analysis to extract the block-level contextual references and assemble the content of a web page. You hear the warnings about duplicate content in regards to SEO, but what does it really mean?

Search engine etiquette in many facets mirrors human behavior, for example, link popularity emulates cliques and third party referrals a.k.a endorsed editorial links as authority, expertise and reputation are emulated by trust rank. Variables such as link popularity and trust sculpt the outcome of why, where and how billions of websites rank in search engine’s index.

Just like you would be penalized if a teacher caught you cheating or copying from others, search engines are algorithmically inclined to seek out the source of a shingle (to eliminate duplicates from their index).

There are two types of duplicate content for a website (1) if your server headers are not properly configured so each page in your site is available from an http:// prefix and the http://www. prefix, which means that each page is a potential replica of the other (remedied by an .HTACCESS preference) or (2) duplicate shingles “exact match segments” of words spread across (a) your own pages or (b) multiple websites.

Regardless if the shingles are smatterings across your own website or multiple sites online, your pages can essentially invoke a penalty and potentially cancel each other out. From the standpoint of storage to search engines (which costs money), let’s face it, a copy is really not that original.

From their prospective (if spiders have one) they are keeping the original sources intact as the authority and for the parrots that scrape or attempt to spin articles, they ultimately leave a trail unless the segments are sufficiently scrambled and re-written. Often, simply writing unique content would have been less effort.

Scraping content for reproduction from topical sources via RSS feeds is a very common tactic for building automated MFA (made for ad sense) sites. There is nothing more annoying than seeing a post or article that took the author hours to create end up on some “no name site” with a “fictitious writer” who is now competing with your website for the very same keywords, tags and titles you just meticulously crafted with care.

Despite the reason why webmasters copy or scrape content such as (1) they are just plain lazy (2) to build up topical sites so they can implement a 301 redirect to their “real money maker website” after the site gains page rank (3) for ad sense or affiliate revenues or (4) for search engine rankings based on the labors and content of others; the impact of duplicate content ripples across the web keeping spiders busy repressing blatant plagiarism.

The more sophisticated and savvy search engines like Google can smell a scraped shingle a mile away and make adjustments to ensure that the original source gets the credit and the duplicate simply recedes into the penalty zone where its semantic currency is capped and quarantined.

When it comes down to it, you are better off avoiding penalties (large or small) to streamline the velocity of your website as it moves through the evaluation process. Just like barnacles impede a ships trajectory, duplicate content impedes your relevance score unless your site is the original source.

What this means in layman’s terms is, if you are using article marketing as a tactic for building links or driving traffic to your pages, make sure the content is original. Believe me, search engines know, so in closing, just write your own.

If you are seeking solutions to quell plagiarism, services exist such as copyscape who will alert you if you if your content is being cannibalized or blatantly scraped. Or, at any time, just grab a segment of your own content, put “quotes around it” in a Google search box and hit return to see what data is retrieved. If your site is the only one using the exact match formation of the snippet, then your content is sited as the original.

You can also use Google Blog Search and set alerts for titles or specific content “in quotes” to get email alerts as they happen and use Domain Tools to find the phone numbers for the offensive webmaster in question (to give them a ring when they least expect it).

So, now if someone asks you about block level shingles or shingle analysis, you know that in context to search engines, they are not referring to a roof.

Read More Related Posts

Duplicate Content, What is it and How to Avoid it!

You've heard of duplicate content, but how do you know if you are creating it or how to stop it from penalizing your website? If you have a WordPress site, ...

Duplicate Content, What is it and How to

35 thoughts on “Duplicate Content and Shingle Analysis for SEO”

Daniel Lehrman says:
October 23, 2008 at 5:35 pm

nice analogy
Michael Mok says:
February 13, 2009 at 2:12 am

While these make sense for web content, it does not work for shopping sites. Eg our site sell the same game as 20 other shopping sites. Being the same game means the name and description is the same. We are currently penalised while other site with the same content stays in front.
Jeffrey Smith says:
February 13, 2009 at 3:47 am

On the contrary Michael, it does apply to your site. The reason is due to duplicate content, which essentially means that you need to add additional content to shift the keyword density of your shingles.

If possible edit the order to keep the context and meaning, but shift the actual words (rewrite it)…

Then, use static pages (such as a sitemap) or blog post with deep links to link to the page with preferred anchor text.

That should get the page reindexed and if you add at least 5-10 links per page (if possible), with unique content, you will see a complete correction of how your site ranks.
Introspective says:
July 13, 2009 at 9:44 am

I used to publish my articles, but now I wander should I stop doing this, because the risk of duplicate content penalty. Should I stop publish my articles on article directories?
The Roaming Counselor says:
November 28, 2009 at 11:24 pm

Sometimes I wonder about going to far with emphasizing my keywords…a lot of my external links will use 2 keyterms, where term A is always the same, and term B is different. Do you think that model would be flagged?
Jeffrey Smith says:
November 29, 2009 at 11:24 am

I can be, better to add a modifier or plural variation to term B just to be safe…

Thanks for visiting
Alec Difrawi says:
March 4, 2011 at 3:32 pm

I completely agree, Those who steal content may not get caught eventually but in the long run those who copy others articles will be penalized leaving the original content ranking at the top.
My Online Piano Lesson says:
March 4, 2011 at 3:35 pm

I agree with the duplicate content,but what if your site is based off embedding videos from a YouTube account you own. Do you get penalized by search engines for using the videos upload as content. Also since bots can crawl flash files they have to rank them based on page rich text correct?