Tag Archive | "Indexing"

Rewriting the Beginner’s Guide to SEO, Chapter 2: Crawling, Indexing, and Ranking

Posted by BritneyMuller

It’s been a few months since our last share of our work-in-progress rewrite of the Beginner’s Guide to SEO, but after a brief hiatus, we’re back to share our draft of Chapter Two with you! This wouldn’t have been possible without the help of Kameron Jenkins, who has thoughtfully contributed her great talent for wordsmithing throughout this piece.

This is your resource, the guide that likely kicked off your interest in and knowledge of SEO, and we want to do right by you. You left amazingly helpful commentary on our outline and draft of Chapter One, and we’d be honored if you would take the time to let us know what you think of Chapter Two in the comments below.


Chapter 2: How Search Engines Work – Crawling, Indexing, and Ranking

First, show up.

As we mentioned in Chapter 1, search engines are answer machines. They exist to discover, understand, and organize the internet’s content in order to offer the most relevant results to the questions searchers are asking.

In order to show up in search results, your content needs to first be visible to search engines. It’s arguably the most important piece of the SEO puzzle: If your site can’t be found, there’s no way you’ll ever show up in the SERPs (Search Engine Results Page).

How do search engines work?

Search engines have three primary functions:

  1. Crawl: Scour the Internet for content, looking over the code/content for each URL they find.
  2. Index: Store and organize the content found during the crawling process. Once a page is in the index, it’s in the running to be displayed as a result to relevant queries.
  3. Rank: Provide the pieces of content that will best answer a searcher’s query. Order the search results by the most helpful to a particular query.

What is search engine crawling?

Crawling, is the discovery process in which search engines send out a team of robots (known as crawlers or spiders) to find new and updated content. Content can vary — it could be a webpage, an image, a video, a PDF, etc. — but regardless of the format, content is discovered by links.

The bot starts out by fetching a few web pages, and then follows the links on those webpages to find new URLs. By hopping along this path of links, crawlers are able to find new content and add it to their index — a massive database of discovered URLs — to later be retrieved when a searcher is seeking information that the content on that URL is a good match for.

What is a search engine index?

Search engines process and store information they find in an index, a huge database of all the content they’ve discovered and deem good enough to serve up to searchers.

Search engine ranking

When someone performs a search, search engines scour their index for highly relevant content and then orders that content in the hopes of solving the searcher’s query. This ordering of search results by relevance is known as ranking. In general, you can assume that the higher a website is ranked, the more relevant the search engine believes that site is to the query.

It’s possible to block search engine crawlers from part or all of your site, or instruct search engines to avoid storing certain pages in their index. While there can be reasons for doing this, if you want your content found by searchers, you have to first make sure it’s accessible to crawlers and is indexable. Otherwise, it’s as good as invisible.

By the end of this chapter, you’ll have the context you need to work with the search engine, rather than against it!

Note: In SEO, not all search engines are equal

Many beginners wonder about the relative importance of particular search engines. Most people know that Google has the largest market share, but how important it is to optimize for Bing, Yahoo, and others? The truth is that despite the existence of more than 30 major web search engines, the SEO community really only pays attention to Google. Why? The short answer is that Google is where the vast majority of people search the web. If we include Google Images, Google Maps, and YouTube (a Google property), more than 90% of web searches happen on Google — that’s nearly 20 times Bing and Yahoo combined.

Crawling: Can search engines find your site?

As you’ve just learned, making sure your site gets crawled and indexed is a prerequisite for showing up in the SERPs. First things first: You can check to see how many and which pages of your website have been indexed by Google using “site:yourdomain.com“, an advanced search operator.

Head to Google and type “site:yourdomain.com” into the search bar. This will return results Google has in its index for the site specified:

Screen Shot 2017-08-03 at 5.19.15 PM.png

The number of results Google displays (see “About __ results” above) isn’t exact, but it does give you a solid idea of which pages are indexed on your site and how they are currently showing up in search results.

For more accurate results, monitor and use the Index Coverage report in Google Search Console. You can sign up for a free Google Search Console account if you don’t currently have one. With this tool, you can submit sitemaps for your site and monitor how many submitted pages have actually been added to Google’s index, among other things.

If you’re not showing up anywhere in the search results, there are a few possible reasons why:

  • Your site is brand new and hasn’t been crawled yet.
  • Your site isn’t linked to from any external websites.
  • Your site’s navigation makes it hard for a robot to crawl it effectively.
  • Your site contains some basic code called crawler directives that is blocking search engines.
  • Your site has been penalized by Google for spammy tactics.

If your site doesn’t have any other sites linking to it, you still might be able to get it indexed by submitting your XML sitemap in Google Search Console or manually submitting individual URLs to Google. There’s no guarantee they’ll include a submitted URL in their index, but it’s worth a try!

Can search engines see your whole site?

Sometimes a search engine will be able to find parts of your site by crawling, but other pages or sections might be obscured for one reason or another. It’s important to make sure that search engines are able to discover all the content you want indexed, and not just your homepage.

Ask yourself this: Can the bot crawl through your website, and not just to it?

Is your content hidden behind login forms?

If you require users to log in, fill out forms, or answer surveys before accessing certain content, search engines won’t see those protected pages. A crawler is definitely not going to log in.

Are you relying on search forms?

Robots cannot use search forms. Some individuals believe that if they place a search box on their site, search engines will be able to find everything that their visitors search for.

Is text hidden within non-text content?

Non-text media forms (images, video, GIFs, etc.) should not be used to display text that you wish to be indexed. While search engines are getting better at recognizing images, there’s no guarantee they will be able to read and understand it just yet. It’s always best to add text within the <HTML> markup of your webpage.

Can search engines follow your site navigation?

Just as a crawler needs to discover your site via links from other sites, it needs a path of links on your own site to guide it from page to page. If you’ve got a page you want search engines to find but it isn’t linked to from any other pages, it’s as good as invisible. Many sites make the critical mistake of structuring their navigation in ways that are inaccessible to search engines, hindering their ability to get listed in search results.

Common navigation mistakes that can keep crawlers from seeing all of your site:

  • Having a mobile navigation that shows different results than your desktop navigation
  • Any type of navigation where the menu items are not in the HTML, such as JavaScript-enabled navigations. Google has gotten much better at crawling and understanding Javascript, but it’s still not a perfect process. The more surefire way to ensure something gets found, understood, and indexed by Google is by putting it in the HTML.
  • Personalization, or showing unique navigation to a specific type of visitor versus others, could appear to be cloaking to a search engine crawler
  • Forgetting to link to a primary page on your website through your navigation — remember, links are the paths crawlers follow to new pages!

This is why it’s essential that your website has a clear navigation and helpful URL folder structures.

Information architecture

Information architecture is the practice of organizing and labeling content on a website to improve efficiency and fundability for users. The best information architecture is intuitive, meaning that users shouldn’t have to think very hard to flow through your website or to find something.

Your site should also have a useful 404 (page not found) page for when a visitor clicks on a dead link or mistypes a URL. The best 404 pages allow users to click back into your site so they don’t bounce off just because they tried to access a nonexistent link.

Tell search engines how to crawl your site

In addition to making sure crawlers can reach your most important pages, it’s also pertinent to note that you’ll have pages on your site you don’t want them to find. These might include things like old URLs that have thin content, duplicate URLs (such as sort-and-filter parameters for e-commerce), special promo code pages, staging or test pages, and so on.

Blocking pages from search engines can also help crawlers prioritize your most important pages and maximize your crawl budget (the average number of pages a search engine bot will crawl on your site).

Crawler directives allow you to control what you want Googlebot to crawl and index using a robots.txt file, meta tag, sitemap.xml file, or Google Search Console.

Robots.txt

Robots.txt files are located in the root directory of websites (ex. yourdomain.com/robots.txt) and suggest which parts of your site search engines should and shouldn’t crawl via specific robots.txt directives. This is a great solution when trying to block search engines from non-private pages on your site.

You wouldn’t want to block private/sensitive pages from being crawled here because the file is easily accessible by users and bots.

Pro tip:

  • If Googlebot can’t find a robots.txt file for a site (40X HTTP status code), it proceeds to crawl the site.
  • If Googlebot finds a robots.txt file for a site (20X HTTP status code), it will usually abide by the suggestions and proceed to crawl the site.
  • If Googlebot finds neither a 20X or a 40X HTTP status code (ex. a 501 server error) it can’t determine if you have a robots.txt file or not and won’t crawl your site.

Meta directives

The two types of meta directives are the meta robots tag (more commonly used) and the x-robots-tag. Each provides crawlers with stronger instructions on how to crawl and index a URL’s content.

The x-robots tag provides more flexibility and functionality if you want to block search engines at scale because you can use regular expressions, block non-HTML files, and apply sitewide noindex tags.

These are the best options for blocking more sensitive*/private URLs from search engines.

*For very sensitive URLs, it is best practice to remove them from or require a secure login to view the pages.

WordPress Tip: In Dashboard > Settings > Reading, make sure the “Search Engine Visibility” box is not checked. This blocks search engines from coming to your site via your robots.txt file!

Avoid these common pitfalls, and you’ll have clean, crawlable content that will allow bots easy access to your pages.

Once you’ve ensured your site has been crawled, the next order of business is to make sure it can be indexed.

Sitemaps

A sitemap is just what it sounds like: a list of URLs on your site that crawlers can use to discover and index your content. One of the easiest ways to ensure Google is finding your highest priority pages is to create a file that meets Google’s standards and submit it through Google Search Console. While submitting a sitemap doesn’t replace the need for good site navigation, it can certainly help crawlers follow a path to all of your important pages.

Google Search Console

Some sites (most common with e-commerce) make the same content available on multiple different URLs by appending certain parameters to URLs. If you’ve ever shopped online, you’ve likely narrowed down your search via filters. For example, you may search for “shoes” on Amazon, and then refine your search by size, color, and style. Each time you refine, the URL changes slightly. How does Google know which version of the URL to serve to searchers? Google does a pretty good job at figuring out the representative URL on its own, but you can use the URL Parameters feature in Google Search Console to tell Google exactly how you want them to treat your pages.

Indexing: How do search engines understand and remember your site?

Once you’ve ensured your site has been crawled, the next order of business is to make sure it can be indexed. That’s right — just because your site can be discovered and crawled by a search engine doesn’t necessarily mean that it will be stored in their index. In the previous section on crawling, we discussed how search engines discover your web pages. The index is where your discovered pages are stored. After a crawler finds a page, the search engine renders it just like a browser would. In the process of doing so, the search engine analyzes that page’s contents. All of that information is stored in its index.

Read on to learn about how indexing works and how you can make sure your site makes it into this all-important database.

Can I see how a Googlebot crawler sees my pages?

Yes, the cached version of your page will reflect a snapshot of the last time googlebot crawled it.

Google crawls and caches web pages at different frequencies. More established, well-known sites that post frequently like https://www.nytimes.com will be crawled more frequently than the much-less-famous website for Roger the Mozbot’s side hustle, http://www.rogerlovescupcakes.com (if only it were real…)

You can view what your cached version of a page looks like by clicking the drop-down arrow next to the URL in the SERP and choosing “Cached”:

You can also view the text-only version of your site to determine if your important content is being crawled and cached effectively.

Are pages ever removed from the index?

Yes, pages can be removed from the index! Some of the main reasons why a URL might be removed include:

  • The URL is returning a “not found” error (4XX) or server error (5XX) – This could be accidental (the page was moved and a 301 redirect was not set up) or intentional (the page was deleted and 404ed in order to get it removed from the index)
  • The URL had a noindex meta tag added – This tag can be added by site owners to instruct the search engine to omit the page from its index.
  • The URL has been manually penalized for violating the search engine’s Webmaster Guidelines and, as a result, was removed from the index.
  • The URL has been blocked from crawling with the addition of a password required before visitors can access the page.

If you believe that a page on your website that was previously in Google’s index is no longer showing up, you can manually submit the URL to Google by navigating to the “Submit URL” tool in Search Console.

Ranking: How do search engines rank URLs?

How do search engines ensure that when someone types a query into the search bar, they get relevant results in return? That process is known as ranking, or the ordering of search results by most relevant to least relevant to a particular query.

To determine relevance, search engines use algorithms, a process or formula by which stored information is retrieved and ordered in meaningful ways. These algorithms have gone through many changes over the years in order to improve the quality of search results. Google, for example, makes algorithm adjustments every day — some of these updates are minor quality tweaks, whereas others are core/broad algorithm updates deployed to tackle a specific issue, like Penguin to tackle link spam. Check out our Google Algorithm Change History for a list of both confirmed and unconfirmed Google updates going back to the year 2000.

Why does the algorithm change so often? Is Google just trying to keep us on our toes? While Google doesn’t always reveal specifics as to why they do what they do, we do know that Google’s aim when making algorithm adjustments is to improve overall search quality. That’s why, in response to algorithm update questions, Google will answer with something along the lines of: “We’re making quality updates all the time.” This indicates that, if your site suffered after an algorithm adjustment, compare it against Google’s Quality Guidelines or Search Quality Rater Guidelines, both are very telling in terms of what search engines want.

What do search engines want?

Search engines have always wanted the same thing: to provide useful answers to searcher’s questions in the most helpful formats. If that’s true, then why does it appear that SEO is different now than in years past?

Think about it in terms of someone learning a new language.

At first, their understanding of the language is very rudimentary — “See Spot Run.” Over time, their understanding starts to deepen, and they learn semantics—- the meaning behind language and the relationship between words and phrases. Eventually, with enough practice, the student knows the language well enough to even understand nuance, and is able to provide answers to even vague or incomplete questions.

When search engines were just beginning to learn our language, it was much easier to game the system by using tricks and tactics that actually go against quality guidelines. Take keyword stuffing, for example. If you wanted to rank for a particular keyword like “funny jokes,” you might add the words “funny jokes” a bunch of times onto your page, and make it bold, in hopes of boosting your ranking for that term:

Welcome to funny jokes! We tell the funniest jokes in the world. Funny jokes are fun and crazy. Your funny joke awaits. Sit back and read funny jokes because funny jokes can make you happy and funnier. Some funny favorite funny jokes.

This tactic made for terrible user experiences, and instead of laughing at funny jokes, people were bombarded by annoying, hard-to-read text. It may have worked in the past, but this is never what search engines wanted.

The role links play in SEO

When we talk about links, we could mean two things. Backlinks or “inbound links” are links from other websites that point to your website, while internal links are links on your own site that point to your other pages (on the same site).

Links have historically played a big role in SEO. Very early on, search engines needed help figuring out which URLs were more trustworthy than others to help them determine how to rank search results. Calculating the number of links pointing to any given site helped them do this.

Backlinks work very similarly to real life WOM (Word-Of-Mouth) referrals. Let’s take a hypothetical coffee shop, Jenny’s Coffee, as an example:

  • Referrals from others = good sign of authority
    Example: Many different people have all told you that Jenny’s Coffee is the best in town
  • Referrals from yourself = biased, so not a good sign of authority
    Example: Jenny claims that Jenny’s Coffee is the best in town
  • Referrals from irrelevant or low-quality sources = not a good sign of authority and could even get you flagged for spam
    Example: Jenny paid to have people who have never visited her coffee shop tell others how good it is.
  • No referrals = unclear authority
    Example: Jenny’s Coffee might be good, but you’ve been unable to find anyone who has an opinion so you can’t be sure.

This is why PageRank was created. PageRank (part of Google’s core algorithm) is a link analysis algorithm named after one of Google’s founders, Larry Page. PageRank estimates the importance of a web page by measuring the quality and quantity of links pointing to it. The assumption is that the more relevant, important, and trustworthy a web page is, the more links it will have earned.

The more natural backlinks you have from high-authority (trusted) websites, the better your odds are to rank higher within search results.

The role content plays in SEO

There would be no point to links if they didn’t direct searchers to something. That something is content! Content is more than just words; it’s anything meant to be consumed by searchers — there’s video content, image content, and of course, text. If search engines are answer machines, content is the means by which the engines deliver those answers.

Any time someone performs a search, there are thousands of possible results, so how do search engines decide which pages the searcher is going to find valuable? A big part of determining where your page will rank for a given query is how well the content on your page matches the query’s intent. In other words, does this page match the words that were searched and help fulfill the task the searcher was trying to accomplish?

Because of this focus on user satisfaction and task accomplishment, there’s no strict benchmarks on how long your content should be, how many times it should contain a keyword, or what you put in your header tags. All those can play a role in how well a page performs in search, but the focus should be on the users who will be reading the content.

Today, with hundreds or even thousands of ranking signals, the top three have stayed fairly consistent: links to your website (which serve as a third-party credibility signals), on-page content (quality content that fulfills a searcher’s intent), and RankBrain.

What is RankBrain?

RankBrain is the machine learning component of Google’s core algorithm. Machine learning is a computer program that continues to improve its predictions over time through new observations and training data. In other words, it’s always learning, and because it’s always learning, search results should be constantly improving.

For example, if RankBrain notices a lower ranking URL providing a better result to users than the higher ranking URLs, you can bet that RankBrain will adjust those results, moving the more relevant result higher and demoting the lesser relevant pages as a byproduct.

Like most things with the search engine, we don’t know exactly what comprises RankBrain, but apparently, neither do the folks at Google.

What does this mean for SEOs?

Because Google will continue leveraging RankBrain to promote the most relevant, helpful content, we need to focus on fulfilling searcher intent more than ever before. Provide the best possible information and experience for searchers who might land on your page, and you’ve taken a big first step to performing well in a RankBrain world.

Engagement metrics: correlation, causation, or both?

With Google rankings, engagement metrics are most likely part correlation and part causation.

When we say engagement metrics, we mean data that represents how searchers interact with your site from search results. This includes things like:

  • Clicks (visits from search)
  • Time on page (amount of time the visitor spent on a page before leaving it)
  • Bounce rate (the percentage of all website sessions where users viewed only one page)
  • Pogo-sticking (clicking on an organic result and then quickly returning to the SERP to choose another result)

Many tests, including Moz’s own ranking factor survey, have indicated that engagement metrics correlate with higher ranking, but causation has been hotly debated. Are good engagement metrics just indicative of highly ranked sites? Or are sites ranked highly because they possess good engagement metrics?

What Google has said

While they’ve never used the term “direct ranking signal,” Google has been clear that they absolutely use click data to modify the SERP for particular queries.

According to Google’s former Chief of Search Quality, Udi Manber:

“The ranking itself is affected by the click data. If we discover that, for a particular query, 80% of people click on #2 and only 10% click on #1, after a while we figure out probably #2 is the one people want, so we’ll switch it.”

Another comment from former Google engineer Edmond Lau corroborates this:

“It’s pretty clear that any reasonable search engine would use click data on their own results to feed back into ranking to improve the quality of search results. The actual mechanics of how click data is used is often proprietary, but Google makes it obvious that it uses click data with its patents on systems like rank-adjusted content items.”

Because Google needs to maintain and improve search quality, it seems inevitable that engagement metrics are more than correlation, but it would appear that Google falls short of calling engagement metrics a “ranking signal” because those metrics are used to improve search quality, and the rank of individual URLs is just a byproduct of that.

What tests have confirmed

Various tests have confirmed that Google will adjust SERP order in response to searcher engagement:

  • Rand Fishkin’s 2014 test resulted in a #7 result moving up to the #1 spot after getting around 200 people to click on the URL from the SERP. Interestingly, ranking improvement seemed to be isolated to the location of the people who visited the link. The rank position spiked in the US, where many participants were located, whereas it remained lower on the page in Google Canada, Google Australia, etc.
  • Larry Kim’s comparison of top pages and their average dwell time pre- and post-RankBrain seemed to indicate that the machine-learning component of Google’s algorithm demotes the rank position of pages that people don’t spend as much time on.
  • Darren Shaw’s testing has shown user behavior’s impact on local search and map pack results as well.

Since user engagement metrics are clearly used to adjust the SERPs for quality, and rank position changes as a byproduct, it’s safe to say that SEOs should optimize for engagement. Engagement doesn’t change the objective quality of your web page, but rather your value to searchers relative to other results for that query. That’s why, after no changes to your page or its backlinks, it could decline in rankings if searchers’ behaviors indicates they like other pages better.

In terms of ranking web pages, engagement metrics act like a fact-checker. Objective factors such as links and content first rank the page, then engagement metrics help Google adjust if they didn’t get it right.

The evolution of search results

Back when search engines lacked a lot of the sophistication they have today, the term “10 blue links” was coined to describe the flat structure of the SERP. Any time a search was performed, Google would return a page with 10 organic results, each in the same format.

In this search landscape, holding the #1 spot was the holy grail of SEO. But then something happened. Google began adding results in new formats on their search result pages, called SERP features. Some of these SERP features include:

  • Paid advertisements
  • Featured snippets
  • People Also Ask boxes
  • Local (map) pack
  • Knowledge panel
  • Sitelinks

And Google is adding new ones all the time. It even experimented with “zero-result SERPs,” a phenomenon where only one result from the Knowledge Graph was displayed on the SERP with no results below it except for an option to “view more results.”

The addition of these features caused some initial panic for two main reasons. For one, many of these features caused organic results to be pushed down further on the SERP. Another byproduct is that fewer searchers are clicking on the organic results since more queries are being answered on the SERP itself.

So why would Google do this? It all goes back to the search experience. User behavior indicates that some queries are better satisfied by different content formats. Notice how the different types of SERP features match the different types of query intents.

Query Intent

Possible SERP Feature Triggered

Informational

Featured Snippet

Informational with one answer

Knowledge Graph / Instant Answer

Local

Map Pack

Transactional

Shopping

We’ll talk more about intent in Chapter 3, but for now, it’s important to know that answers can be delivered to searchers in a wide array of formats, and how you structure your content can impact the format in which it appears in search.

Localized search

A search engine like Google has its own proprietary index of local business listings, from which it creates local search results.

If you are performing local SEO work for a business that has a physical location customers can visit (ex: dentist) or for a business that travels to visit their customers (ex: plumber), make sure that you claim, verify, and optimize a free Google My Business Listing.

When it comes to localized search results, Google uses three main factors to determine ranking:

  1. Relevance
  2. Distance
  3. Prominence

Relevance

Relevance is how well a local business matches what the searcher is looking for. To ensure that the business is doing everything it can to be relevant to searchers, make sure the business’ information is thoroughly and accurately filled out.

Distance

Google use your geo-location to better serve you local results. Local search results are extremely sensitive to proximity, which refers to the location of the searcher and/or the location specified in the query (if the searcher included one).

Organic search results are sensitive to a searcher’s location, though seldom as pronounced as in local pack results.

Prominence

With prominence as a factor, Google is looking to reward businesses that are well-known in the real world. In addition to a business’ offline prominence, Google also looks to some online factors to determine local ranking, such as:

Reviews

The number of Google reviews a local business receives, and the sentiment of those reviews, have a notable impact on their ability to rank in local results.

Citations

A “business citation” or “business listing” is a web-based reference to a local business’ “NAP” (name, address, phone number) on a localized platform (Yelp, Acxiom, YP, Infogroup, Localeze, etc.).

Local rankings are influenced by the number and consistency of local business citations. Google pulls data from a wide variety of sources in continuously making up its local business index. When Google finds multiple consistent references to a business’s name, location, and phone number it strengthens Google’s “trust” in the validity of that data. This then leads to Google being able to show the business with a higher degree of confidence. Google also uses information from other sources on the web, such as links and articles.

Check a local business’ citation accuracy here.

Organic ranking

SEO best practices also apply to local SEO, since Google also considers a website’s position in organic search results when determining local ranking.

In the next chapter, you’ll learn on-page best practices that will help Google and users better understand your content.

[Bonus!] Local engagement

Although not listed by Google as a local ranking determiner, the role of engagement is only going to increase as time goes on. Google continues to enrich local results by incorporating real-world data like popular times to visit and average length of visits…

Screenshot of Google SERP result for a local business showing busy times of day

…and even provides searchers with the ability to ask the business questions!

Screenshot of the Questions & Answers portion of a local Google SERP result

Undoubtedly now more than ever before, local results are being influenced by real-world data. This interactivity is how searchers interact with and respond to local businesses, rather than purely static (and game-able) information like links and citations.

Since Google wants to deliver the best, most relevant local businesses to searchers, it makes perfect sense for them to use real time engagement metrics to determine quality and relevance.


You don’t have to know the ins and outs of Google’s algorithm (that remains a mystery!), but by now you should have a great baseline knowledge of how the search engine finds, interprets, stores, and ranks content. Armed with that knowledge, let’s learn about choosing the keywords your content will target!

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!


Moz Blog

Posted in IM NewsComments Off

Google Clarifies Seven Points On Mobile-First Indexing After Much Confusion

Image credit to Shutterstock

Let me start by saying all of these points, in my opinion, is something we’ve covered here before but the interesting thing is that Google felt these seven items are things SEOs who give presentations have confused over the past several months…


Search Engine Roundtable

Posted in IM NewsComments Off

Search Buzz Video Recap: Google Mobile-First Indexing Rollout, Recipe Update, AdWords Editor & Matt Cutts Stories

This month, we posted our monthly Google Webmaster Report, so catch up there. Google began the second batch of mobile-first indexing, and sent out notices to those sites as well. The thing is, Google is moving over many sites that are desktop only…


Search Engine Roundtable

Posted in IM NewsComments Off

Search Buzz Video Recap: Google Algorithm & Ranking Update, Job Listings Penalty & Mobile-First Indexing

This week we saw a long going Google algorithm update that has been ongoing for a few days or so now. Google said job listings that are no longer open that have schema can be penalized. Google said do not make your site unmobile-friendly just to stay away from the mobile-first indexing…


Search Engine Roundtable

Posted in IM NewsComments Off

How Mobile-First Indexing Disrupts the Link Graph

Posted by rjonesx.

It’s happened to all of us. You bring up a webpage on your mobile device, only to find out that a feature you were accustomed to using on desktop simply isn’t available on mobile. While frustrating, it has always been a struggle for web developers and designers alike to simplify and condense their site on mobile screens without needing to strip features or content that would otherwise clutter a smaller viewport. The worst-case scenario for these trade-offs is that some features would be reserved for desktop environments, or perhaps a user might be able to opt out of the mobile view. Below is an example of how my personal blog displays the mobile version using a popular plugin by ElegantThemes called HandHeld. As you can see, the vast page is heavily stripped down and is far easier to read… but at what cost? And at what cost to the link graph?

My personal blog drops 75 of the 87 links, and all of the external links, when the mobile version is accessed. So what happens when the mobile versions of sites become the primary way the web is accessed, at scale, by the bots which power major search engines?

Google’s announcement to proceed with a mobile-first index raises new questions about how the link structure of the web as a whole might be influenced once these truncated web experiences become the first (and sometimes only) version of the web Googlebot encounters.

So, what’s the big deal?

The concern, which no doubt Google engineers have studied internally, is that mobile websites often remove content and links in order to improve user experience on a smaller screen. This abbreviated content fundamentally alters the link structure which underlies one of the most important factors in Google’s rankings. Our goal is to try and understand the impact this might have.

Before we get started, one giant unknown variable which I want to be quick to point out is we don’t know what percentage of the web Google will crawl with both its desktop and mobile bots. Perhaps Google will choose to be “mobile-first” only on sites that have historically displayed an identical codebase to both the mobile and desktop versions of Googlebot. However, for the purposes of this study, I want to show the worst-case scenario, as if Google chose not only to go “mobile-first,” but in fact to go “mobile-only.”

Methodology: Comparing mobile to desktop at scale

For this brief research, I decided to grab 20,000 random websites from the Quantcast Top Million. I would then crawl two levels deep, spoofing both the Google mobile and Google desktop versions of Googlebot. With this data, we can begin to compare how different the link structure of the web might look.

Homepage metrics

Let’s start with some descriptive statistics of the home pages of these 20,000 randomly selected sites. Of the sites analyzed, 87.42% had the same number of links on their homepage regardless of whether the bot was mobile- or desktop-oriented. Of the remaining 12.58%, 9% had fewer links and 3.58% had more. This doesn’t seem too disparate at first glance.

Perhaps more importantly, only 79.87% had identical links on the homepage when visited by desktop and mobile bots. Just because the same number of links were found didn’t mean they were actually the same links. This is important to take into consideration because links are the pathways which bots use to find content on the web. Different paths mean a different index.

Among the homepage links, we found a 7.4% drop in external links. This could mean a radical shift in some of the most important links on the web, given that homepage links often carry a great deal of link equity. Interestingly, the biggest “losers” as a percentage tended to be social sites. In retrospect, it seems reasonable that one of the common types of links a website might remove from their mobile version would be social share buttons because they’re often incorporated into the “chrome” of a page rather than the content, and the “chrome” often changes to accommodate a mobile version.

The biggest losers as a percentage in order were:

  1. linkedin.com
  2. instagram.com
  3. twitter.com
  4. facebook.com

So what’s the big deal about 5–15% differences in links when crawling the web? Well, it turns out that these numbers tend to be biased towards sites with lots of links that don’t have a mobile version. However, most of those links are main navigation links. When you crawl deeper, you just find the same links. But those that do deviate end up having radically different second-level crawl links.

Second-level metrics

Now this is where the data gets interesting. As we continue to crawl out on the web using crawl sets that are influenced by the links discovered by a mobile bot versus a desktop bot, we’ll continue to get more and more divergent results. But how far will they diverge? Let’s start with size. While we crawled an identical number of home pages, the second-tier results diverged based on the number of links found on those original home pages. Thus, the mobile crawlset was 977,840 unique URLs, while the desktop crawlset was 1,053,785. Already we can see a different index taking shape — the desktop index would be much larger. Let’s dig deeper.

I want you to take a moment and really focus on this graph. Notice there are three categories:

  • Mobile Unique: Blue bars represent unique items found by the mobile bot
  • Desktop Unique: Orange bars represent unique items found by the desktop bot
  • Shared: Gray bars represent items found by both

Notice also that there are there are four tests:

  • Number of URLs discovered
  • Number of Domains discovered
  • Number of Links discovered
  • Number of Root Linking Domains discovered

Now here is the key point, and it’s really big. There are more URLs, Domains, Links, and Root Linking Domains unique to the desktop crawl result than there are shared between the desktop and mobile crawler. The orange bar is always taller than the gray. This means that by just the second level of the crawl, the majority of link relationships, pages, and domains are different in the indexes. This is huge. This is a fundamental shift in the link graph as we have come to know it.

And now for the big question, what we all care about the most — external links.

A whopping 63% of external links are unique to the desktop crawler. In a mobile-only crawling world, the total number of external links was halved.

What is happening at the micro level?

So, what’s really causing this huge disparity in the crawl? Well, we know it has something to do with a few common shortcuts to making a site “mobile-friendly,” which include:

  1. Subdomain versions of the content that have fewer links or features
  2. The removal of links and features by user-agent detecting plugins

Of course, these changes might make the experience better for your users, but it does create a different experience for bots. Let’s take a closer look at one site to see how this plays out.

This site has ~10,000 pages according to Google and has a Domain Authority of 72 and 22,670 referring domains according to the new Moz Link Explorer. However, the site uses a popular WordPress plugin that abbreviates the content down to just the articles and pages on the site, removing links from descriptions in the articles on the category pages and removing most if not all extraneous links from the sidebar and footer. This particular plugin is used on over 200,000 websites. So, what happens when we fire up a six-level-deep crawl with Screaming Frog? (It’s great for this kind of analysis because we can easily change the user-agent and restrict settings to just crawl HTML content.)

The difference is shocking. First, notice that in the mobile crawl on the left, there is clearly a low number of links per page and that number of links is very steady as you crawl deeper through the site. This is what produces such a steady, exponential growth curve. Second, notice that the crawl abruptly ended at level four. The site just didn’t have any more pages to offer the mobile crawler! Only ~3,000 of the ~10,000 pages Google reports were found.

Now, compare this to the desktop crawler. It explodes in pages at level two, collecting nearly double the total pages of the mobile crawl at this level alone. Now, recall the graph before showing that there were more unique desktop pages than there were shared pages when we crawled 20,000 sites. Here is confirmation of exactly how it happens. Ultimately, 6x the content was made available to the desktop crawler in the same level of crawl depth.

But what impact did this have on external links?

Wow. 75% of the external, outbound links were culled in the mobile version. 4,905 external links were found in the desktop version while only 1,162 were found in the mobile. Remember, this is a DA 72 site with over twenty thousand referring domains. Imagine losing that link because the mobile index no longer finds the backlink. What should we do? Is the sky falling?

Take a deep breath

Mobile-first isn’t mobile-only

The first important caveat to all this research is that Google isn’t giving up on the desktop — they’re simply prioritizing the mobile crawl. This makes sense, as the majority of search traffic is now mobile. If Google wants to make sure quality mobile content is served, they need to shift their crawl priorities. But they also have a competing desire to find content, and doing so requires using a desktop crawler so long as webmasters continue to abbreviate the mobile versions of their sites.

This reality isn’t lost on Google. In the Original Official Google Mobile First Announcement, they write…

If you are building a mobile version of your site, keep in mind that a functional desktop-oriented site can be better than a broken or incomplete mobile version of the site.

Google took the time to state that a desktop version can be better than an “incomplete mobile version.” I don’t intend to read too much into this statement other than to say that Google wants a full mobile version, not just a postcard.

Good link placements will prevail

One anecdotal outcome of my research was that the external links which tended to survive the cull of a mobile version were often placed directly in the content. External links in sidebars like blog-rolls were essentially annihilated from the index, but in-content links survived. This may be a signal Google picks up on. External links that are both in mobile and desktop tend to be the kinds of links people might click on.

So, while there may be fewer links powering the link graph (or at least there might be a subset that is specially identified), if your links are good, content-based links, then you have a chance to see improved performance.

I was able to confirm this by looking at a subset of known good links. Using Fresh Web Explorer, I looked up fresh links to toysrus.com which is currently gaining a great deal of attention due to stores closing. We can feel confident that most of these links will be in-content because the articles themselves are about the relevant, breaking news regarding Toys R Us. Sure enough, after testing 300+ mentions, we found the links to be identical in the mobile and desktop crawls. These were good, in-content links and, subsequently, they showed up in both versions of the crawl.

Selection bias and convergence

It is probably the case that popular sites are more likely to have a mobile version than non-popular sites. Now, they might be responsive — at which point they would yield no real differences in the crawl — but at least some percentage would likely be m.* domains or utilize plugins like those mentioned above which truncate the content. At the lower rungs of the web, older, less professional content is likely to have only one version which is shown to mobile and desktop devices alike. If this is the case, we can expect that over time the differences in the index might begin to converge rather than diverge, as my study looked only at sites that were in the top million and only crawled two levels deep.

Moreover (this one is a bit speculative), but I think over time that there will be convergence between a mobile and desktop index. I don’t think the link graphs will grow exponentially different as the linked web is only so big. Rather, the paths to which certain pages are reached, and the frequency with which they are reached, will change quite a bit. So, while the link graph will differ, the set of URLs making up the link graph will largely be the same. Of course, some percentage of the mobile web will remain wholly disparate. The large number of sites that use dedicated mobile subdomains or plugins that remove substantial sections of content will remain like mobile islands in the linked web.

Impact on SERPs

It’s difficult at this point to say what the impact on search results will be. It will certainly not leave the SERPs unchanged. What would be the point of Google making and announcing a change to its indexing methods if it didn’t improve the SERPs?

That being said, this study wouldn’t be complete without some form of impact assessment. Hat tip to JR Oakes for giving me this critique, otherwise I would have forgotten to take a look.

First, there are a couple of things which could mitigate dramatic shifts in the SERPs already, regardless of the veracity of this study:

  • A slow rollout means that shifts in SERPs will be lost to the natural ranking fluctuations we already see.
  • Google can seed URLs found by mobile or by desktop into their respective crawlers, thereby limiting index divergence. (This is a big one!)
  • Google could choose to consider, for link purposes, the aggregate of both mobile and desktop crawls, not counting one to the exclusion of the other.

Second, the relationships between domains may be less affected than other index metrics. What is the likelihood that the relationship between Domain X and Domain Y (more or less links) is the same for both the mobile- and desktop-based indexes? If the relationships tend to remain the same, then the impact on SERPs will be limited. We will call this relationship being “directionally consistent.”

To accomplish this part of the study, I took a sample of domain pairs from the mobile index and compared their relationship (more or less links) to their performance in the desktop index. Did the first have more links than the second in both the mobile and desktop? Or did they perform differently?

It turns out that the indexes were fairly close in terms of directional consistency. That is to say that while the link graphs as a whole were quite different, when you compared one domain to another at random, they tended in both data sets to be directionally consistent. Approximately 88% of the domains compared maintained directional consistency via the indexes. This test was only run comparing the mobile index domains to the desktop index domains. Future research might explore the reverse relationship.

So what’s next?: Moz and the mobile-first index

Our goal for the Moz link index has always been to be as much like Google as possible. It is with that in mind that our team is experimenting with a mobile-first index as well. Our new link index and Link Explorer in Beta seeks to be more than simply one of the largest link indexes on the web, but the most relevant and useful, and we believe part of that means shaping our index with methods similar to Google. We will keep you updated!

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!


Moz Blog

Posted in IM NewsComments Off

The Google Request Indexing Quota Seems To Be Enforced

Last week we reported how Google removed the quota notification within the request indexing feature in the Google Search Console fetch as Google feature. We noted the documentation still had the quota but the interface removed the quota notification…


Search Engine Roundtable

Posted in IM NewsComments Off

Does Googlebot Support HTTP/2? Challenging Google’s Indexing Claims – An Experiment

Posted by goralewicz

I was recently challenged with a question from a client, Robert, who runs a small PR firm and needed to optimize a client’s website. His question inspired me to run a small experiment in HTTP protocols. So what was Robert’s question? He asked…

Can Googlebot crawl using HTTP/2 protocols?

You may be asking yourself, why should I care about Robert and his HTTP protocols?

As a refresher, HTTP protocols are the basic set of standards allowing the World Wide Web to exchange information. They are the reason a web browser can display data stored on another server. The first was initiated back in 1989, which means, just like everything else, HTTP protocols are getting outdated. HTTP/2 is one of the latest versions of HTTP protocol to be created to replace these aging versions.

So, back to our question: why do you, as an SEO, care to know more about HTTP protocols? The short answer is that none of your SEO efforts matter or can even be done without a basic understanding of HTTP protocol. Robert knew that if his site wasn’t indexing correctly, his client would miss out on valuable web traffic from searches.

The hype around HTTP/2

HTTP/1.1 is a 17-year-old protocol (HTTP 1.0 is 21 years old). Both HTTP 1.0 and 1.1 have limitations, mostly related to performance. When HTTP/1.1 was getting too slow and out of date, Google introduced SPDY in 2009, which was the basis for HTTP/2. Side note: Starting from Chrome 53, Google decided to stop supporting SPDY in favor of HTTP/2.

HTTP/2 was a long-awaited protocol. Its main goal is to improve a website’s performance. It’s currently used by 17% of websites (as of September 2017). Adoption rate is growing rapidly, as only 10% of websites were using HTTP/2 in January 2017. You can see the adoption rate charts here. HTTP/2 is getting more and more popular, and is widely supported by modern browsers (like Chrome or Firefox) and web servers (including Apache, Nginx, and IIS).

Its key advantages are:

  • Multiplexing: The ability to send multiple requests through a single TCP connection.
  • Server push: When a client requires some resource (let’s say, an HTML document), a server can push CSS and JS files to a client cache. It reduces network latency and round-trips.
  • One connection per origin: With HTTP/2, only one connection is needed to load the website.
  • Stream prioritization: Requests (streams) are assigned a priority from 1 to 256 to deliver higher-priority resources faster.
  • Binary framing layer: HTTP/2 is easier to parse (for both the server and user).
  • Header compression: This feature reduces overhead from plain text in HTTP/1.1 and improves performance.

For more information, I highly recommend reading “Introduction to HTTP/2” by Surma and Ilya Grigorik.

All these benefits suggest pushing for HTTP/2 support as soon as possible. However, my experience with technical SEO has taught me to double-check and experiment with solutions that might affect our SEO efforts.

So the question is: Does Googlebot support HTTP/2?

Google’s promises

HTTP/2 represents a promised land, the technical SEO oasis everyone was searching for. By now, many websites have already added HTTP/2 support, and developers don’t want to optimize for HTTP/1.1 anymore. Before I could answer Robert’s question, I needed to know whether or not Googlebot supported HTTP/2-only crawling.

I was not alone in my query. This is a topic which comes up often on Twitter, Google Hangouts, and other such forums. And like Robert, I had clients pressing me for answers. The experiment needed to happen. Below I’ll lay out exactly how we arrived at our answer, but here’s the spoiler: it doesn’t. Google doesn’t crawl using the HTTP/2 protocol. If your website uses HTTP/2, you need to make sure you continue to optimize the HTTP/1.1 version for crawling purposes.

The question

It all started with a Google Hangouts in November 2015.

When asked about HTTP/2 support, John Mueller mentioned that HTTP/2-only crawling should be ready by early 2016, and he also mentioned that HTTP/2 would make it easier for Googlebot to crawl pages by bundling requests (images, JS, and CSS could be downloaded with a single bundled request).

“At the moment, Google doesn’t support HTTP/2-only crawling (…) We are working on that, I suspect it will be ready by the end of this year (2015) or early next year (2016) (…) One of the big advantages of HTTP/2 is that you can bundle requests, so if you are looking at a page and it has a bunch of embedded images, CSS, JavaScript files, theoretically you can make one request for all of those files and get everything together. So that would make it a little bit easier to crawl pages while we are rendering them for example.”

Soon after, Twitter user Kai Spriestersbach also asked about HTTP/2 support:

His clients started dropping HTTP/1.1 connections optimization, just like most developers deploying HTTP/2, which was at the time supported by all major browsers.

After a few quiet months, Google Webmasters reignited the conversation, tweeting that Google won’t hold you back if you’re setting up for HTTP/2. At this time, however, we still had no definitive word on HTTP/2-only crawling. Just because it won’t hold you back doesn’t mean it can handle it — which is why I decided to test the hypothesis.

The experiment

For months as I was following this online debate, I still received questions from our clients who no longer wanted want to spend money on HTTP/1.1 optimization. Thus, I decided to create a very simple (and bold) experiment.

I decided to disable HTTP/1.1 on my own website (https://goralewicz.com) and make it HTTP/2 only. I disabled HTTP/1.1 from March 7th until March 13th.

If you’re going to get bad news, at the very least it should come quickly. I didn’t have to wait long to see if my experiment “took.” Very shortly after disabling HTTP/1.1, I couldn’t fetch and render my website in Google Search Console; I was getting an error every time.

My website is fairly small, but I could clearly see that the crawling stats decreased after disabling HTTP/1.1. Google was no longer visiting my site.

While I could have kept going, I stopped the experiment after my website was partially de-indexed due to “Access Denied” errors.

The results

I didn’t need any more information; the proof was right there. Googlebot wasn’t supporting HTTP/2-only crawling. Should you choose to duplicate this at home with our own site, you’ll be happy to know that my site recovered very quickly.

I finally had Robert’s answer, but felt others may benefit from it as well. A few weeks after finishing my experiment, I decided to ask John about HTTP/2 crawling on Twitter and see what he had to say.

(I love that he responds.)

Knowing the results of my experiment, I have to agree with John: disabling HTTP/1 was a bad idea. However, I was seeing other developers discontinuing optimization for HTTP/1, which is why I wanted to test HTTP/2 on its own.

For those looking to run their own experiment, there are two ways of negotiating a HTTP/2 connection:

1. Over HTTP (unsecure) – Make an HTTP/1.1 request that includes an Upgrade header. This seems to be the method to which John Mueller was referring. However, it doesn’t apply to my website (because it’s served via HTTPS). What is more, this is an old-fashioned way of negotiating, not supported by modern browsers. Below is a screenshot from Caniuse.com:

2. Over HTTPS (secure) – Connection is negotiated via the ALPN protocol (HTTP/1.1 is not involved in this process). This method is preferred and widely supported by modern browsers and servers.

A recent announcement: The saga continues

Googlebot doesn’t make HTTP/2 requests

Fortunately, Ilya Grigorik, a web performance engineer at Google, let everyone peek behind the curtains at how Googlebot is crawling websites and the technology behind it:

If that wasn’t enough, Googlebot doesn’t support the WebSocket protocol. That means your server can’t send resources to Googlebot before they are requested. Supporting it wouldn’t reduce network latency and round-trips; it would simply slow everything down. Modern browsers offer many ways of loading content, including WebRTC, WebSockets, loading local content from drive, etc. However, Googlebot supports only HTTP/FTP, with or without Transport Layer Security (TLS).

Googlebot supports SPDY

During my research and after John Mueller’s feedback, I decided to consult an HTTP/2 expert. I contacted Peter Nikolow of Mobilio, and asked him to see if there were anything we could do to find the final answer regarding Googlebot’s HTTP/2 support. Not only did he provide us with help, Peter even created an experiment for us to use. Its results are pretty straightforward: Googlebot does support the SPDY protocol and Next Protocol Navigation (NPN). And thus, it can’t support HTTP/2.

Below is Peter’s response:


I performed an experiment that shows Googlebot uses SPDY protocol. Because it supports SPDY + NPN, it cannot support HTTP/2. There are many cons to continued support of SPDY:

    1. This protocol is vulnerable
    2. Google Chrome no longer supports SPDY in favor of HTTP/2
    3. Servers have been neglecting to support SPDY. Let’s examine the NGINX example: from version 1.95, they no longer support SPDY.
    4. Apache doesn’t support SPDY out of the box. You need to install mod_spdy, which is provided by Google.

To examine Googlebot and the protocols it uses, I took advantage of s_server, a tool that can debug TLS connections. I used Google Search Console Fetch and Render to send Googlebot to my website.

Here’s a screenshot from this tool showing that Googlebot is using Next Protocol Navigation (and therefore SPDY):

I’ll briefly explain how you can perform your own test. The first thing you should know is that you can’t use scripting languages (like PHP or Python) for debugging TLS handshakes. The reason for that is simple: these languages see HTTP-level data only. Instead, you should use special tools for debugging TLS handshakes, such as s_server.

Type in the console:

sudo openssl s_server -key key.pem -cert cert.pem -accept 443 -WWW -tlsextdebug -state -msg
sudo openssl s_server -key key.pem -cert cert.pem -accept 443 -www -tlsextdebug -state -msg

Please note the slight (but significant) difference between the “-WWW” and “-www” options in these commands. You can find more about their purpose in the s_server documentation.

Next, invite Googlebot to visit your site by entering the URL in Google Search Console Fetch and Render or in the Google mobile tester.

As I wrote above, there is no logical reason why Googlebot supports SPDY. This protocol is vulnerable; no modern browser supports it. Additionally, servers (including NGINX) neglect to support it. It’s just a matter of time until Googlebot will be able to crawl using HTTP/2. Just implement HTTP 1.1 + HTTP/2 support on your own server (your users will notice due to faster loading) and wait until Google is able to send requests using HTTP/2.


Summary

In November 2015, John Mueller said he expected Googlebot to crawl websites by sending HTTP/2 requests starting in early 2016. We don’t know why, as of October 2017, that hasn’t happened yet.

What we do know is that Googlebot doesn’t support HTTP/2. It still crawls by sending HTTP/ 1.1 requests. Both this experiment and the “Rendering on Google Search” page confirm it. (If you’d like to know more about the technology behind Googlebot, then you should check out what they recently shared.)

For now, it seems we have to accept the status quo. We recommended that Robert (and you readers as well) enable HTTP/2 on your websites for better performance, but continue optimizing for HTTP/ 1.1. Your visitors will notice and thank you.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!


Moz Blog

Posted in IM NewsComments Off

Going Beyond Google: Are Search Engines Ready for JavaScript Crawling & Indexing?

Posted by goralewicz

I recently published the results of my JavaScript SEO experiment where I checked which JavaScript frameworks are properly crawled and indexed by Google. The results were shocking; it turns out Google has a number of problems when crawling and indexing JavaScript-rich websites.

Google managed to index only a few out of multiple JavaScript frameworks tested. And as I proved, indexing content doesn’t always mean crawling JavaScript-generated links.

This got me thinking. If Google is having problems with JavaScript crawling and indexing, how are Google’s smaller competitors dealing with this problem? Is JavaScript going to lead you to full de-indexing in most search engines?

If you decide to deploy a client-rendered website (meaning a browser or Googlebot needs to process the JavaScript before seeing the HTML), you’re not only risking problems with your Google rankings — you may completely kill your chances at ranking in all the other search engines out there.

Google + JavaScript SEO experiment

To see how search engines other than Google deal with JavaScript crawling and indexing, we used our experiment website, http:/jsseo.expert, to check how Googlebot crawls and indexes JavaScript (and JavaScript frameworks’) generated content.

The experiment was quite simple: http://jsseo.expert has subpages with content parsed by different JavaScript frameworks. If you disable JavaScript, the content isn’t visible — i.e. if you go to http://jsseo.expert/angular2/, all the content within the red box is generated by Angular 2. If the content isn’t indexed in Yahoo, for example, we know that Yahoo’s indexer didn’t process the JavaScript.

Here are the results:

As you can see, Google and Ask are the only search engines to properly index JavaScript-generated content. Bing, Yahoo, AOL, DuckDuckGo, and Yandex are completely JavaScript-blind and won’t see your content if it isn’t HTML.

The next step: Can other search engines index JavaScript?

Most SEOs only cover JavaScript crawling and indexing issues when talking about Google. As you can see, the problem is much more complex. When you launch a client-rendered JavaScript-rich website (JavaScript is processed by the browser/crawler to “build” HTML), you can be 100% sure that it’s only going to be indexed and ranked in Google and Ask. Unfortunately, Google and Ask cover only ~64% of the whole search engine market, according to statista.com.

This means that your new, shiny, JavaScript-rich website can cost you ~36% of your website’s visibility on all search engines.

Let’s start with Yahoo, Bing, and AOL, which are responsible for 35% of search queries in the US.

Yahoo, Bing, and AOL

Even though Yahoo and AOL were here long before Google, they’ve obviously fallen behind its powerful algorithm and don’t invest in crawling and indexing as much as Google. One reason is likely the relatively high cost of crawling and indexing the web compared to the popularity of the website.

Google can freely invest millions of dollars in growing their computing power without worrying as much about return on investment, whereas Bing, AOL, and Ask only have a small percentage of the search market.

However, Microsoft-owned Bing isn’t out of the running. Their growth has been quite aggressive over last 8 years:

Unfortunately, we can’t say the same about one of the market pioneers: AOL. Do you remember the days before Google? This video will surely bring back some memories from a simpler time.

If you want to learn more about search engine history, I highly recommend watching Marcus Tandler’s spectacular TEDx talk.

Ask.com

What about Ask.com? How is it possible that Ask, with less than 1% of the market, can invest in crawling and indexing JavaScript? It makes me question if the Ask network is powered by Google’s algorithm and crawlers. It’s even more interesting looking at Ask’s aversion towards Google. There were already some speculations about Ask’s relationship with Google after Google Penguin in 2012, but we can now confirm that Ask’s crawling is using Google’s technology.

DuckDuckGo and Yandex

Both DuckDuckGo and Yandex had no problem indexing all the URLs within http://jsseo.expert, but unfortunately, the only content that was indexed properly was the 100% HTML page (http://jsseo.expert/html/).

Baidu

Despite my best efforts, I didn’t manage to index http://jsseo.expert in Baidu.com. It turns out you need a mainland China phone number to do that. I don’t have any previous experience with Baidu, so any and all help with indexing our experimental website would be appreciated. As soon as I succeed, I will update this article with Baidu.com results.

Going beyond the search engines

What if you don’t really care about search engines other than Google? Even if your target market is heavily dominated by Google, JavaScript crawling and indexing is still in an early stage, as my JavaScript SEO experiment documented.

Additionally, even if crawled and indexed properly, there is proof that JavaScript reliance can affect your rankings. Will Critchlow saw a significant traffic improvement after shifting from JavaScript-driven pages to non-JavaScript reliant.

Is there a JavaScript SEO silver bullet?

There is no search engine that can understand and process JavaScript at the level our modern browsers can. Even so, JavaScript isn’t inherently bad for SEO. JavaScript is awesome, but just like SEO, it requires experience and close attention to best practices.

If you want to enjoy all the perks of JavaScript without worrying about problems like Hulu.com’s JavaScript SEO issues, look into isomorphic JavaScript. It allows you to enjoy dynamic and beautiful websites without worrying about SEO.

If you’ve already developed a client-rendered website and can’t go back to the drawing board, you can always use pre-rendering services or enable server-side rendering. They often aren’t ideal solutions, but can definitely help you solve the JavaScript crawling and indexing problem until you come up with a better solution.

Regardless of the search engine, yet again we come back to testing and experimenting as a core component of technical SEO.

The future of JavaScript SEO

I highly recommend you follow along with how http://jsseo.expert/ is indexed in Google and other search engines. Even if some of the other search engines are a little behind Google, they’ll need to improve how they deal with JavaScript-rich websites to meet the exponentially growing demand for what JavaScript frameworks offer, both to developers and end users.

For now, stick to HTML & CSS on your front-end. :)

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!


Moz Blog

Posted in IM NewsComments Off

Evidence of the Surprising State of JavaScript Indexing

Posted by willcritchlow

Back when I started in this industry, it was standard advice to tell our clients that the search engines couldn’t execute JavaScript (JS), and anything that relied on JS would be effectively invisible and never appear in the index. Over the years, that has changed gradually, from early work-arounds (such as the horrible escaped fragment approach my colleague Rob wrote about back in 2010) to the actual execution of JS in the indexing pipeline that we see today, at least at Google.

In this article, I want to explore some things we’ve seen about JS indexing behavior in the wild and in controlled tests and share some tentative conclusions I’ve drawn about how it must be working.

A brief introduction to JS indexing

At its most basic, the idea behind JavaScript-enabled indexing is to get closer to the search engine seeing the page as the user sees it. Most users browse with JavaScript enabled, and many sites either fail without it or are severely limited. While traditional indexing considers just the raw HTML source received from the server, users typically see a page rendered based on the DOM (Document Object Model) which can be modified by JavaScript running in their web browser. JS-enabled indexing considers all content in the rendered DOM, not just that which appears in the raw HTML.

There are some complexities even in this basic definition (answers in brackets as I understand them):

  • What about JavaScript that requests additional content from the server? (This will generally be included, subject to timeout limits)
  • What about JavaScript that executes some time after the page loads? (This will generally only be indexed up to some time limit, possibly in the region of 5 seconds)
  • What about JavaScript that executes on some user interaction such as scrolling or clicking? (This will generally not be included)
  • What about JavaScript in external files rather than in-line? (This will generally be included, as long as those external files are not blocked from the robot — though see the caveat in experiments below)

For more on the technical details, I recommend my ex-colleague Justin’s writing on the subject.

A high-level overview of my view of JavaScript best practices

Despite the incredible work-arounds of the past (which always seemed like more effort than graceful degradation to me) the “right” answer has existed since at least 2012, with the introduction of PushState. Rob wrote about this one, too. Back then, however, it was pretty clunky and manual and it required a concerted effort to ensure both that the URL was updated in the user’s browser for each view that should be considered a “page,” that the server could return full HTML for those pages in response to new requests for each URL, and that the back button was handled correctly by your JavaScript.

Along the way, in my opinion, too many sites got distracted by a separate prerendering step. This is an approach that does the equivalent of running a headless browser to generate static HTML pages that include any changes made by JavaScript on page load, then serving those snapshots instead of the JS-reliant page in response to requests from bots. It typically treats bots differently, in a way that Google tolerates, as long as the snapshots do represent the user experience. In my opinion, this approach is a poor compromise that’s too susceptible to silent failures and falling out of date. We’ve seen a bunch of sites suffer traffic drops due to serving Googlebot broken experiences that were not immediately detected because no regular users saw the prerendered pages.

These days, if you need or want JS-enhanced functionality, more of the top frameworks have the ability to work the way Rob described in 2012, which is now called isomorphic (roughly meaning “the same”).

Isomorphic JavaScript serves HTML that corresponds to the rendered DOM for each URL, and updates the URL for each “view” that should exist as a separate page as the content is updated via JS. With this implementation, there is actually no need to render the page to index basic content, as it’s served in response to any fresh request.

I was fascinated by this piece of research published recently — you should go and read the whole study. In particular, you should watch this video (recommended in the post) in which the speaker — who is an Angular developer and evangelist — emphasizes the need for an isomorphic approach:

Resources for auditing JavaScript

If you work in SEO, you will increasingly find yourself called upon to figure out whether a particular implementation is correct (hopefully on a staging/development server before it’s deployed live, but who are we kidding? You’ll be doing this live, too).

To do that, here are some resources I’ve found useful:

Some surprising/interesting results

There are likely to be timeouts on JavaScript execution

I already linked above to the ScreamingFrog post that mentions experiments they have done to measure the timeout Google uses to determine when to stop executing JavaScript (they found a limit of around 5 seconds).

It may be more complicated than that, however. This segment of a thread is interesting. It’s from a Hacker News user who goes by the username KMag and who claims to have worked at Google on the JS execution part of the indexing pipeline from 2006–2010. It’s in relation to another user speculating that Google would not care about content loaded “async” (i.e. asynchronously — in other words, loaded as part of new HTTP requests that are triggered in the background while assets continue to download):

“Actually, we did care about this content. I’m not at liberty to explain the details, but we did execute setTimeouts up to some time limit.

If they’re smart, they actually make the exact timeout a function of a HMAC of the loaded source, to make it very difficult to experiment around, find the exact limits, and fool the indexing system. Back in 2010, it was still a fixed time limit.”

What that means is that although it was initially a fixed timeout, he’s speculating (or possibly sharing without directly doing so) that timeouts are programmatically determined (presumably based on page importance and JavaScript reliance) and that they may be tied to the exact source code (the reference to “HMAC” is to do with a technical mechanism for spotting if the page has changed).

It matters how your JS is executed

I referenced this recent study earlier. In it, the author found:

Inline vs. External vs. Bundled JavaScript makes a huge difference for Googlebot

The charts at the end show the extent to which popular JavaScript frameworks perform differently depending on how they’re called, with a range of performance from passing every test to failing almost every test. For example here’s the chart for Angular:

Slide5.PNG

It’s definitely worth reading the whole thing and reviewing the performance of the different frameworks. There’s more evidence of Google saving computing resources in some areas, as well as surprising results between different frameworks.

CRO tests are getting indexed

When we first started seeing JavaScript-based split-testing platforms designed for testing changes aimed at improving conversion rate (CRO = conversion rate optimization), their inline changes to individual pages were invisible to the search engines. As Google in particular has moved up the JavaScript competency ladder through executing simple inline JS to more complex JS in external files, we are now seeing some CRO-platform-created changes being indexed. A simplified version of what’s happening is:

  • For users:
    • CRO platforms typically take a visitor to a page, check for the existence of a cookie, and if there isn’t one, randomly assign the visitor to group A or group B
    • Based on either the cookie value or the new assignment, the user is either served the page unchanged, or sees a version that is modified in their browser by JavaScript loaded from the CRO platform’s CDN (content delivery network)
    • A cookie is then set to make sure that the user sees the same version if they revisit that page later
  • For Googlebot:
    • The reliance on external JavaScript used to prevent both the bucketing and the inline changes from being indexed
    • With external JavaScript now being loaded, and with many of these inline changes being made using standard libraries (such as JQuery), Google is able to index the variant and hence we see CRO experiments sometimes being indexed

I might have expected the platforms to block their JS with robots.txt, but at least the main platforms I’ve looked at don’t do that. With Google being sympathetic towards testing, however, this shouldn’t be a major problem — just something to be aware of as you build out your user-facing CRO tests. All the more reason for your UX and SEO teams to work closely together and communicate well.

Split tests show SEO improvements from removing a reliance on JS

Although we would like to do a lot more to test the actual real-world impact of relying on JavaScript, we do have some early results. At the end of last week I published a post outlining the uplift we saw from removing a site’s reliance on JS to display content and links on category pages.

odn_additional_sessions.png

A simple test that removed the need for JavaScript on 50% of pages showed a >6% uplift in organic traffic — worth thousands of extra sessions a month. While we haven’t proven that JavaScript is always bad, nor understood the exact mechanism at work here, we have opened up a new avenue for exploration, and at least shown that it’s not a settled matter. To my mind, it highlights the importance of testing. It’s obviously our belief in the importance of SEO split-testing that led to us investing so much in the development of the ODN platform over the last 18 months or so.

Conclusion: How JavaScript indexing might work from a systems perspective

Based on all of the information we can piece together from the external behavior of the search results, public comments from Googlers, tests and experiments, and first principles, here’s how I think JavaScript indexing is working at Google at the moment: I think there is a separate queue for JS-enabled rendering, because the computational cost of trying to run JavaScript over the entire web is unnecessary given the lack of a need for it on many, many pages. In detail, I think:

  • Googlebot crawls and caches HTML and core resources regularly
  • Heuristics (and probably machine learning) are used to prioritize JavaScript rendering for each page:
    • Some pages are indexed with no JS execution. There are many pages that can probably be easily identified as not needing rendering, and others which are such a low priority that it isn’t worth the computing resources.
    • Some pages get immediate rendering – or possibly immediate basic/regular indexing, along with high-priority rendering. This would enable the immediate indexation of pages in news results or other QDF results, but also allow pages that rely heavily on JS to get updated indexation when the rendering completes.
    • Many pages are rendered async in a separate process/queue from both crawling and regular indexing, thereby adding the page to the index for new words and phrases found only in the JS-rendered version when rendering completes, in addition to the words and phrases found in the unrendered version indexed initially.
  • The JS rendering also, in addition to adding pages to the index:
    • May make modifications to the link graph
    • May add new URLs to the discovery/crawling queue for Googlebot

The idea of JavaScript rendering as a distinct and separate part of the indexing pipeline is backed up by this quote from KMag, who I mentioned previously for his contributions to this HN thread (direct link) [emphasis mine]:

“I was working on the lightweight high-performance JavaScript interpretation system that sandboxed pretty much just a JS engine and a DOM implementation that we could run on every web page on the index. Most of my work was trying to improve the fidelity of the system. My code analyzed every web page in the index.

Towards the end of my time there, there was someone in Mountain View working on a heavier, higher-fidelity system that sandboxed much more of a browser, and they were trying to improve performance so they could use it on a higher percentage of the index.”

This was the situation in 2010. It seems likely that they have moved a long way towards the headless browser in all cases, but I’m skeptical about whether it would be worth their while to render every page they crawl with JavaScript given the expense of doing so and the fact that a large percentage of pages do not change substantially when you do.

My best guess is that they’re using a combination of trying to figure out the need for JavaScript execution on a given page, coupled with trust/authority metrics to decide whether (and with what priority) to render a page with JS.

Run a test, get publicity

I have a hypothesis that I would love to see someone test: That it’s possible to get a page indexed and ranking for a nonsense word contained in the served HTML, but not initially ranking for a different nonsense word added via JavaScript; then, to see the JS get indexed some period of time later and rank for both nonsense words. If you want to run that test, let me know the results — I’d be happy to publicize them.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!


Moz Blog

Posted in IM NewsComments Off

Google begins mobile-first indexing, using mobile content for all search rankings

While called an ‘experiment,’ it’s actually the first move in Google’s planned shift to looking primarily at mobile content, rather than desktop, when deciding how to rank results.

The post Google begins mobile-first indexing, using mobile content for all search rankings appeared first on Search…



Please visit Search Engine Land for the full article.


Search Engine Land: News & Info About SEO, PPC, SEM, Search Engines & Search Marketing

Posted in IM NewsComments Off

Advert