Duplicate content in search engines
There’re various opinions around how Google and other search engines deal with duplicate content. Google’s comments on the matter give an indication of how they deal with it, and the main takehome points are:
- They can usually spot unintentional (i.e non-malicious) duplicate content
- In the above case, Google will simply index the content it feels most relevant to the query that’s been typed in (so analysis/decision is done on a per-query basis), and will give users the option to see similar-looking results from that site – see below:
![]()
- In the case of webmasters trying to game search results by keyword/content-stuffing, they’ll spot that too – and will penalise accordingly.
This is good news and everything, but it does rather raise the question: How can Google tell if you’re malicious or not? And what if Google misunderstands your content and accidentally tags you as a spammer?
If you’re new to web copy writing content/management, duplication perhaps doesn’t sound like an issue…but from a search perspective, it can be. Why could you be penalised? How do you get round it? What are the options? Read on and find out…
Why does Google penalise for duplicate content?
In a nutshell, the reason Google gets unhappy about malicious duplicate content is because – in it’s quest for relevancy and providing people with what it thinks are the best results – the search algorithm has been refined to ignore, and in fact penalise sites which repeat content. It sounds obvious, and it’s a fallout from the bad-old-days of search optimisation, where the early algorithm looked for keyword density on pages.
During this period, webmasters would look to game the results by simply stuffing sections of their sites with vast amounts of repeated spammy bullshit content, in an effort to convince Google it was the most relevant for a particular set of search terms. Google and other engines are now wise to it.
However, this could have an unwanted effect of penalisation for entirely legit companies who have unwittingly repeated lots of content without really realising it. It’s not definite…but it could happen.
As mentioned above, the financial services sector is one such industry that could easily fall foul of this, due to the kind of content they publish regarding investments, insurance and the weighty legal pages they are required to make publicly available for each product/service. Other sites may have ‘printer-friendly’ pages that could be considered copies of other content.
How do you get round it?
The primary weapon against duplication is the use of the ‘canonical’ meta tag. If you’ve lots of pages of identical (or near-identical) content, it tells search engines which page you want them to consider the ‘original’. The code looks like this:
<link rel="canonical" href="[your canonical url here]" />
From Greenlight’s excellent “Is Duplicate Content A Thing Of The Past” whitepaper PDF:
Like all <link> tags this sits in the <head> of your HTML markup before any of the visible content. When added to a duplicate page this acts as a strong suggestion to search engines to treat the page as if it were 301 redirected (permanently forwarded) to the canonical version. Providing the pages are identical or very similar search engines should index and pass all link equity to the canonical version of the page instead of the duplicate.
It’s recommended that you use absolute rather than relative URLs when specifying your canonical URL, and you can only reference pages on the same domain as a canonical URL. Google supports this tag now, Microsoft will support it in an imminent release and Yahoo! will roll out support over the coming months.
What other options are available to you?
In order to help combat duplication – and to maintain general web hygiene – there are other things you can do to help ensure content is indexed properly.
Use rel=nofollow on links pointing to content you don’t want to be followed. Useful, but can be a pain to maintain if there are a lot of pages/links that need to be restricted.
Use a robots.txt file to indicate which files/folders you do/don’t want to be indexed. This is discouraged by Google in the case of handling duplicate content, though.
Both of the above relies on the search engine obeying the common rules laid down. The main players like Google, Yahoo, and Bing will…but that doesn’t mean others will too.
Conclusion and other stuff
For me, this also raises other questions regarding:
- Indexation of PDFs, and whether there’re penalties for similar-looking content in them. Does anyone know what the deal is here?
- As an addition to that, could it be possible to game results using PDF content?
- Whilst researching some uses of the canonical tag, I spotted that www.lv.com have very similar looking content in their glossary sections – yet they’ve used the canonical tag to tell Google that a different (and not similar at all in anything other than page name) is the original. And Google doesn’t seem to mind. Bizarre.











No comments yet.