We’ve been moving a somewhat large site from a proprietary platform onto Wordpress. As you probably know, when you move a site from one platform to another, a lot of the URLs will change. In our case, amongst other things, many URLs went from having an .aspx extension to having none. On top of that, some features or pages just don’t exist anymore.
To combat the URL confusion, we proactively monitor and attempt to fix 404s before, during, and for a few months after a move. To do that we have some proprietary tools, some log parsers, and we also use Google Webmaster Tools.
I noticed a new-ish type of error being reported in Webmaster Tools labeled as a Soft 404. I hadn’t heard the term before so I did a bit of digging.
It turns out that a Soft 404 is a state where your site is responding with an HTTP status code of 200, but Google thinks it should be a 404. Why does Google think it should be a 404 response? Well, from what I can tell they are looking to see if the page content is duplicated over many URLs (and maybe doing a bit of text processing). This is important to note, because if you have several URLs doing a 301 redirect to the same page, that seems to get flagged as a Soft 404 too.
Ok, so what does this have to do with Wordpress?
When you request something that doesn’t exist on a Wordpress site, it eventually winds up calling the 404.php file in the theme directory. Depending on how you have your site configured (the site in question is in a Wordpress Network configuration), Wordpress can sometimes just process the 404 file but still return an HTTP status code of 200! The file appears to be an error to the end user, but the HTTP response says everything is fine.
Here is a screenshot of a well behaved 404 Not Found request on a Wordpress site:
This type of response is not what we’ve found with all Wordpress installs - but some respond correctly.
I’ve checked several Wordpress sites, and almost all of them behave correctly (running the latest version of Wordpress). However, this larger one, the one in the Network configuration, seems to respond with 200 instead of 404 when the code gets to the 404.php file.
A simple work around for this - to stop the bleeding - is to just brute force the header call in the 404.php file. For example:
<?php header('HTTP/1.0 404 Not Found'); ?> <?php get_header(); ?>
That will return a 404 http response code along with the file’s contents - meaning, it should be doing what was intended.
I think there is a second class of Soft 404s that we ran in to as well. These seem to be caused by 301 redirects and changed content. For example, on the old system there where separate contact pages, but on the new site there is only one contact page. To try to keep a nice user experience, we added a rewrite rule like the following:
RewriteRule ^(.*)/contact.aspx$ /contact-us/ [R=301,L]
This winds up redirecting several different URLs to one redesigned page. From the users point of view, this is what they were after - all the old links work, and the user just sees a new page with a new URL. However, it seems, the redirecting pages might get marked as Soft 404s.
I am just guessing, but I think to fix Google’s Soft 404s for these types of pages, they want to have no 301 redirect, and instead send an HTTP response of 404 and the display of the actual 404 page — the 404 page, could have a hyperlink that says “Did you mean [new_url]?“. In other words, I think they only want 301 redirects that have a 1 to 1 relationship (in an effort to keep from having duplicate content?).
Take this last bit with a grain of salt, this is just speculation as I am working through a large number of Soft 404s and I have absolutely no ties with Google.
I hope this is not true, and it’s likely just a reporting error, because it doesn’t seem to be suggesting the correct usage for a 301 (granted, this is just wikipedia):
The HTTP response status code 301 Moved Permanently is used for permanent redirection, meaning current links or records using the URL that the 301 Moved Permanently response is received for should be updated to the new URL provided in the Location field of the response. This status code should be used with the location header. RFC 2616 states that:
- If a client has link-editing capabilities, it should update all references to the Request URI.
- The response is cachable.
- Unless the request method was HEAD, the entity should contain a small hypertext note with a hyperlink to the new URI(s).
- If the 301 status code is received in response to a request of any type other than GET or HEAD, the client must ask the user before redirecting.
I am far from a scholar, but the way I read that it doesn’t seem like a 301 needs to be in a one to one relationship with the content. So webmaster tools may just be reporting some oddities in the Soft 404s section.
So, long way around, if you have some Wordpress sites, double check that your 404 pages really are returning 404s. And you might want to keep an eye on the Soft 404s section in Webmaster tools and compare them to your 301 redirects to see if that’s where the error is coming from. I don’t know if these things impact your SEO or rankings, but it’s often nice to straighten things up.