January 7th, 2013

Learn About Robots.txt with Interactive Examples

One of the things that excites me most about the development of the web is the growth in learning resources. When I went to college in 1998, it was exciting enough to be able to search journals, get access to thousands of dollars-worth of textbooks, and download open source software. These days, technologies lik Khan Academy iTunesU Treehouse n Codecademy ake that to another level.

I've been particularly excited by the possibilities for interactive learning we see coming out of places like Codecademy. It's obviously most suited to learning things that look like programming languages - where computers are naturally good at interpreting the "answer" - which got me thinking about what bits of online marketing look like that.

The kinds of things that computers are designed to interpret in our marketing world are:

  • Search queries particularly those that look more like programming constructs than natural language queries such as [site:distilled.net -inurl:www]
  • The on-site part o setting up analytics setting custom variables and events, adding virtual pageviews, modifying e-commerce tracking, and the like
  • Robots.txt syntax nd rules
  • HTML onstructs like links, meta page information, alt attributes, etc.
  • Skills lik Excel formulae hat many of us find a critical part of our day-to-day job

I've been gradually building out codecademy-style interactive learning environments for all of these things fo DistilledU, our online training platform, but most of them are only available to paying members. I thought it would make a nice start to 2013 to pull one of these modules out from behind the paywall and give it away to the SEOmoz community. I picked th robots.txt one ecause our in-app feedback is showing that it's one of the ones from which people learned the most.

Also, despite years of experience, I discovered some things I didn't know as I wrote this module (particularly about precedence of different rules and the interaction of wildcards with explicit rules). I'm hoping that it'll be useful to many of you as well - beginners and experts alike.

Interactive guide to Robots.txt

Robots.txt is a plain-text file found in the root of a domain (e.g. www.example.com/robots.txt). It is a widely-acknowledged standard and allows webmasters to control all kinds of automated consumption of their site, not just by search engines.

In addition to reading about th protocol, robots.txt is one of the more accessible areas of SEO since you can access any site's robots.txt. Once you have completed this module, you will find value in making sure you understand the robots.txt files of some large sites (for exampl Google n Amazon).

For each of the following sections, modify the text in the textareas and see them go green when you get the right answer.

Create Website

Web design ncompasses many different skills and disciplines in the production and maintenance of websites. The different areas of web design include web graphic design interface design; authoring, including standardised code and proprietary software user experience design; an search engine optimization. Often many individuals will work in teams covering different aspects of the design process, although some designers will cover them all.[2] he term web design is normally used to describe the design process relating to the front-end (client side) design of a website including writing mark up, but this is a grey area as this is also covered b web development. Web designers are expected to have an awareness o usability nd if their role involves creating mark up then they are also expected to be up to date wit web accessibility uidelines.

Excel Statistics for SEO and Data Analysis

Everybody has probably already realized that there is almost no data that we cannot get. We can get data about our website by using free tools, but we also spend tons of money on paid tools to get even more. Analyzing the competition is just as easy, competitive intelligence tools are everywhere, we often use Compete or Hitwise. Opens Site Explorer is great for getting more data about our and competitors backlink profile. No matter what information we are trying to get, we can, by spending fortunes or no money. My favorite part is that almost every tool has one common feature and that is the "Export" button. This is the most powerful feature of all these tools because by exporting the data into Excel and we can sort it, filter it and model it in any way we want. Most of us use Excel on the regular basis, we are familiar with the basic functions but Excel can do way more than that. In the following article I will try to present the most common statistical techniques and the best part it is that we don't have to memorize complicated statistical equations, it's everything built into Excel!

Statistics is all about collecting, analyzing and interpreting data. It comes very handy when decision making faces uncertainty. By using statistics, we can overcome these situations and generate actionable analysis.

Statistics is divided into two major branches descriptiv an inferential.

Descriptive statistics re used when you know all the values in the dataset. For example, you take a survey of 1000 people asking if they like oranges, with two choices (Yes and No). You collect the results and you find out that 900 answered Yes, and 100 answered No. You find the proportion 90% is Yes 10 is No. Pretty simple right?

But what happens when we cannot observe all the data?

When you know only part of your data than you have to us inferential statistics. Inferential statistics is used when you know only a sample (a small part) from your data and you make guesses about the entire population (data).

Let's consider you want to calculate the email open rate for the last 24 months, but you have data only from the last six months. In this case, assuming that from 1000 emails you had 200 people opening the email, which resulted in 800 emails that didn't convert. This equates to 20% open rate and 80% who did not open. This data is true for the last six months, but it might not be true for 24 months. Inferential statistics helps us understand how close we are to the entire population and how confident we are in this assumption.

The open rate for the sample may be 20% but it may vary a little. Therefore, let's consider +- 3% in this case the range is from 17% to 23%. This sounds pretty good but how confident are we in these data? Alternatively, what percentage of a random sample taken from the entire population (data set) will fall in the range of 17%-23%?

In statistics, the 95% confidence level is considered to be reliable data. This means 95% of the sample data we take from the entire population will produce an open rate of 17-23%, the other 5% will be either above 23% or below 17%. But we are 95% sure that the open rate is 20% +- 3%

The ter dat stands for any value that describes an object or an event such as visitors, surveys, emails.

The ter data set as two components observation unit, which is for example visitors and the variables that can represent the demographic characteristics of your visitors such as age, salary or education level.Populatio refers to every member of your group, or in web analytics all the visitors. Let's assume 10,000 visitors.

sampl is only a part of your population, based on a date range, visitors who converted, etc. but in statistics the most valuable sample is considered a random sample.

Th data distribution s given by the frequency with which the values in the data set occur. By plotting the frequencies on a chart, with the range of the values on the horizontal axis and the frequencies on the vertical axis, we obtain the distribution curve. The most commonly used distribution is the normal distribution or the bell-shaped curve.

5 .htaccess File Snippets You Should Have Handy

In the Moz Q&A, there are often questions that are directly asked about, or answered with, a reference to the all-powerful .htaccess file. I've put together a few useful .htaccess snippets which are often helpful. For those who aren't aware, the .htaccess file is a type of config file for the Apache server, which allows you to manipulate and redirect URLs amongst other things.

Everyone will be familiar with tip number four, which is the classic 301 redirect that SEOs have come to know and love. However, the other tips in this list are less common, but are quite useful to know when you need them. After you've read this post, bookmark it, and hopefully it will save you some time in the future.

1) Make URLs SEO-friendly and future-proof

Back when I was more of a developer than an SEO, I built an e-commerce site selling vacations, with a product URL structure:


A nicer URL would probably be:


The second version will allow me to move away from PHP later, it is probably better for SEO, and allows me to even put further sub-folders later if I want. However, it isn't realistic to create a new folder for every product or category. Besides, it all lives in a database normally.

Apache identifies files and how to handle them by their extensions, which we can override on a file by file basis:

<Files magic>
ForceType application/x-httpd-php5

This will allow the 'magic' file, which is a PHP file without an extension, to then look like a folder and handle the 'inner' folders as parameters. You can test it out here (try changing the folder names inside the magic 'folder'):


2) Apply rel="canonical" to PDFs and images

The SEO community has adopted rel="canonical" quickly, and it is usually kicked around in discussions about IA and canonicalization issues, where before we only had redirects and blocking to solve a problem. It is a handy little tag that goes in the head section of an HTML page.

However, many people still don't know that you can apply rel="canonical" in an alternative way, using HTTP, for cases where there is no HTML to insert a tag into. An often cited example that can be used for applying rel="canonical" to PDFs is to point to an HTML version or to the download page for a PDF document.

An alternative use would be for applying rel="canonical" to image files. This suggestion came from a client of mine recently, and is something a couple of us had kicked about once before in the Distilled office. My first reaction to the client was that this practice sounded a little bit 'dodgy,' but the more I think about it, the more it seems reasonable.

They had a product range that attracts people to link to their images, but that isn't very helpful to them in terms of SEO (any traffic coming from image search is unlikely to convert), but rel="canonical" those links to images to the product page, and suddenly they are helpful links, and the rel="canonical" seems pretty reasonable.

Here is an example of applying HTTP rel="canonical" to a PDF and a JPG file:

<Files download.pdf>
Header add Link '<http://www.tomanthony.co.uk/httest/pdf-download.html>; rel="canonical"'
<Files product.jpg>
Header add Link '<http://www.tomanthony.co.uk/httest/product-page.html>; rel="canonical"'

We could also use some variables magic (you didn't know .htaccess could do variables!?) to apply this to all PDFs in a folder, linking back the HTML page with the same name (be careful with this if you are unsure):

RewriteRule ([^/]+)\.pdf$ - [E=FILENAME:$1]
<FilesMatch "\.pdf$">
Header add Link '<http://www.tomanthony.co.uk/httest/%{FILENAME}e.html>; rel="canonical"'

You can read more about it here:


3) Robots directives

You can't instruct all search engines not to index a page, unless you allow them to access the page. If you block a page with robots.txt, then Google might still index it if it has a lot of links pointing to it. You need to put the noindex Meta Robots tag on every page you want to issue that instruction on. If you aren't using a CMS or are using one that is limited in its ease, this could be a lot of work. .htaccess to the rescue!

You can apply directives to all files in a directory by creating an .htaccess file in that directory and adding this command:

Header set X-Robots-Tag "noindex, noarchive, nosnippet"

If you want to read a bit more about this, I suggest this excellent post from Yoast


4) Various types of redirect

The common SEO redirect is ensuring that a canonical domain is used, normally www vs. non-www. There are also a couple of other redirects you might find useful. I have kept them simple here, but often times you will want to combine these to ensure you avoid chaining redirects:

# Ensure www on all URLs.
RewriteCond %{HTTP_HOST} ^example.com [NC]
RewriteRule ^(.*)$ http://www.example.com/$1 [L,R=301]
# Ensure we are using HTTPS version of the site.
RewriteCond %{HTTPS} !on
RewriteRule (.*) https://%{HTTP_HOST}%{REQUEST_URI} [L,R=301]
# Ensure all URLs have a trailing slash.
RewriteBase /
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_URI} !(.*)/$
RewriteRule ^(.*)$ http://www.example.com/$1/ [L,R=301]

5) Custom 404 error page

None of your visitors should be seeing a white error page with black techno-babble when they end up on at a broken URL. You should always be serving a nice 404 page which also gives the visitor links to get back on track.

You can also end up getting lots of links and traffic if you but your time and effort into a cool 404 page, like Distilled's:

This is very easy to setup with .htaccess:

ErrorDocument 404 /cool404.html
# Can also do the same for other errors...
ErrorDocument 500 /cool500.html

Potential of Image Search - Ways to Track Traffic

Among the basic guidelines on how to optimize a website, there are issues concerning graphics optimization. In SEO audit checklists, issues such as alternative text attributes or properly constructed image file names, appear next to recommendations for changes in meta titles, descriptions, headers and basic technical SEO issues.

It seems that graphics often play rather a complementary role to some more important technical issues and their optimization is not associated with any specific strategy. This situation is aggravated by the fact that Google rarely implies referring to Image Search in press releases and the blog for webmasters and Google Analytics do not add graphics to the default reports. In addition, the topic rarely appears during industry conferences and is rarely included in expert articles.

Key Findings

The main conclusion of this analysis is tha image search potential is underestimated. Some of the topics prompt users to search for images. These are e.g. issues concerning interior design and these associated with features of appearance. You will also find out that Google Webmaster Tools gives great opportunities to analyze images potential. The Search Queries report in Google Webmaster Tools is a powerful tool for any SEO expert or website owner. Despite the large rounding up used in GWT, the values ​​shown are very similar to what you can see in Google Analytics. In order to see the exact traffic values in Google Analytics, you have to make changes in tracking code, which is shown in last part of this article.

The Potential of Image Search

There ar phrases that are associated by default with graphics for example: “inspirations and ideas for interior decoration,” “gardens,” “breeds of dogs,” etc.). In such case sharing page views for the first page of search results graphics mounts up to 40 - 60% of the total number of views for the first page of organic results This could mean that the number of users using the image search includes only about 50% of users with a standard search engine.

Let us look at the example of "German dog" query. The www.psy.elk.pl website achieves 10,000 page views per month (data from Google Webmaster Tools), for the first page of organic search results. The image search page views reach 75,000, but in the first hundred results there are up to 15 photos showing up. It can be assumed that the first part of the image search results is viewed about 75,000/15 = 5,000 times (50% of the organic).

Image search can be a major source of traffic!

According to that data, in case of websites well-optimized for graphics, the number of visits resulting from image search may vary fro 20 to even 60% of all visits from Google.

Measuring the number of visits

The most effective and accurate way to measure the level of traffic from graphics is the use of Google Analytics. Unfortunately, by default, that kind of traffic isn't presented in the reports as the organic traffic, and it is impossible to separate it from the genuine organic traffic. Only by modifying the code and using a JavaScript function to identify a frame can we separate the images traffic from the organic one (more at the end of the article).

An interesting tool that allows quick verification of both the level of traffic and the potential of image search is the report provided by Google Webmaster Tools. Having entered the Traffic and then Search Queries section - we get a report that shows the frequency of the website’s appearance on Google and the number of clicks for each query. At the top of the report we also have the access to a variety of filters. The one interesting us is marked a Image.

Responding to a Bing Malware Warning, Inside and Out

If you’re like me, an alert like this causes a lot of stress. Lots of scary thoughts run through your head. Everything from how badly your organic visits are going to plummet to how are you going to respond to the inevitable angry call from the client.

Going through this process helped me better understand the process of getting a malware notification removed. And there were some things that really surprised me. The goal of this post is to outline the steps I took so that I might help others who have been impacted by a Bing Malware Notification (please note: this is my first blog post to SEOmoz)

Find the Malware, Obliterate it, and Re-evaluate

My first step was to talk to the site’s developer. I explained the problem and asked him to look through the site to see if he could identify any malicious code. He found it and cleaned it up.

Now that the site was clean, I logged into Bing Webmaster Tools and completed a Malware Re-Evaluation For http://www.bing.com/webmaster/help/malware-re-evaluation-e6982183. This process is fairly straight forward, and requires some basic information including: