According to Google Search Console, “Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar.”
Technically a duplicate content, may or may not be penalized, but can still sometimes impact search engine rankings. When there are multiple pieces of, so called “appreciably similar” content (according to Google) in more than one location on the Internet, search engines will have difficulty to decide which version is more relevant to a given search query.
Why does duplicate content matter to search engines? Well it is because it can bring about three main issues for search engines:
- They don’t know which version to include or exclude from their indices.
- They don’t know whether to direct the link metrics ( trust, authority, anchor text, etc) to one page, or keep it separated between multiple versions.
- They don’t know which version to rank for query results.
When duplicate content is present, website owners will be affected negatively by traffic losses and rankings. These losses are often due to a couple of problems:
- To provide the best search query experience, search engines will rarely show multiple versions of the same content, and thus are forced to choose which version is most likely to be the best result. This dilutes the visibility of each of the duplicates.
- Link equity can be further diluted because other sites have to choose between the duplicates as well. instead of all inbound links pointing to one piece of content, they link to multiple pieces, spreading the link equity among the duplicates. Because inbound links are a ranking factor, this can then impact the search visibility of a piece of content.
The eventual result is that a piece of content will not achieve the desired search visibility it otherwise would.
Regarding scraped or copied content, this refers to content scrapers (websites with software tools) that steal your content for their own blogs. Content referred here, includes not only blog posts or editorial content, but also product information pages. Scrapers republishing your blog content on their own sites may be a more familiar source of duplicate content, but there’s a common problem for e-commerce sites, as well, the description / information of their products. If many different websites sell the same items, and they all use the manufacturer’s descriptions of those items, identical content winds up in multiple locations across the web. Such duplicate content are not penalised.
How to fix duplicate content issues? This all comes down to the same central idea: specifying which of the duplicates is the “correct” one.
Whenever content on a site can be found at multiple URLs, it should be canonicalized for search engines. Let’s go over the three main ways to do this: Using a 301 redirect to the correct URL, the rel=canonical attribute, or using the parameter handling tool in Google Search Console.
301 redirect: In many cases, the best way to combat duplicate content is to set up a 301 redirect from the “duplicate” page to the original content page.
When multiple pages with the potential to rank well are combined into a single page, they not only stop competing with one another; they also create a stronger relevancy and popularity signal overall. This will positively impact the “correct” page’s ability to rank well.
Rel=”canonical”: Another option for dealing with duplicate content is to use the rel=canonical attribute. This tells search engines that a given page should be treated as though it were a copy of a specified URL, and all of the links, content metrics, and “ranking power” that search engines apply to this page should actually be credited to the specified URL.
Meta Robots Noindex: One meta tag that can be particularly useful in dealing with duplicate content is meta robots, when used with the values “noindex, follow.” Commonly called Meta Noindex, Follow and technically known as content=”noindex,follow” this meta robots tag can be added to the HTML head of each individual page that should be excluded from a search engine’s index.
The meta robots tag allows search engines to crawl the links on a page but keeps them from including those links in their indices. It’s important that the duplicate page can still be crawled, even though you’re telling Google not to index it, because Google explicitly cautions against restricting crawl access to duplicate content on your website. (Search engines like to be able to see everything in case you’ve made an error in your code. It allows them to make a [likely automated] “judgment call” in otherwise ambiguous situations.) Using meta robots is a particularly good solution for duplicate content issues related to pagination.
Google Search Console allows you to set the preferred domain of your site (e.g. yoursite.com instead of <a target=”_blank” rel=”nofollow” href=”http://www.yoursite.com”>http://www.yoursite.com</a> ) and specify whether Googlebot should crawl various URL parameters differently (parameter handling).
The main drawback to using parameter handling as your primary method for dealing with duplicate content is that the changes you make only work for Google. Any rules put in place using Google Search Console will not affect how Bing or any other search engine’s crawlers interpret your site; you’ll need to use the webmaster tools for other search engines in addition to adjusting the settings in Search Console.
While not all scrapers will port over the full HTML code of their source material, some will. For those that do, the self-referential rel=canonical tag will ensure your site’s version gets credit as the “original” piece of content.
Duplicate content is fixable and should be fixed. The rewards are worth the effort to fix them. Making concerted effort to creating quality content will result in better rankings by just getting rid of duplicate content on your site.