Tag Archive | "Large"

Google: Large Sites Should Not Use The Google Crawl Setting

Google’s John Mueller said “really large sites” should not use the crawl setting because “setting maximum value is probably too small for what your need.” Larger sites need to go beyond that max value in Google Search Console and it is not possible to do that with the crawl setting.

Search Engine Roundtable

Posted in IM NewsComments Off

Internal Linking & Mobile First: Large Site Crawl Paths in 2018 & Beyond

Posted by Tom.Capper

By now, you’ve probably heard as much as you can bear about mobile first indexing. For me, there’s been one topic that’s been conspicuously missing from all this discussion, though, and that’s the impact on internal linking and previous internal linking best practices.

In the past, there have been a few popular methods for providing crawl paths for search engines — bulky main navigations, HTML sitemap-style pages that exist purely for internal linking, or blocks of links at the bottom of indexed pages. Larger sites have typically used at least two or often three of these methods. I’ll explain in this post why all of these are now looking pretty shaky, and what I suggest you do about it.

Quick refresher: WTF are “internal linking” & “mobile-first,” Tom?

Internal linking is and always has been a vital component of SEO — it’s easy to forget in all the noise about external link building that some of our most powerful tools to affect the link graph are right under our noses. If you’re looking to brush up on internal linking in general, it’s a topic that gets pretty complex pretty quickly, but there are a couple of resources I can recommend to get started:

I’ve also written in the past that links may be mattering less and less as a ranking factor for the most competitive terms, and though that may be true, they’re still the primary way you qualify for that competition.

A great example I’ve seen recently of what happens if you don’t have comprehensive internal linking is eflorist.co.uk. (Disclaimer: eFlorist is not a client or prospective client of Distilled, nor are any other sites mentioned in this post)

eFlorist has local landing pages for all sorts of locations, targeting queries like “Flower delivery in [town].” However, even though these pages are indexed, they’re not linked to internally. As a result, if you search for something like “flower delivery in London,” despite eFlorist having a page targeted at this specific query (which can be found pretty much only through use of advanced search operators), they end up ranking on page 2 with their “flowers under £30” category page:


If you’re looking for a reminder of what mobile-first indexing is and why it matters, these are a couple of good posts to bring you up to speed:

In short, though, Google is increasingly looking at pages as they appear on mobile for all the things it was previously using desktop pages for — namely, establishing ranking factors, the link graph, and SEO directives. You may well have already seen an alert from Google Search Console telling you your site has been moved over to primarily mobile indexing, but if not, it’s likely not far off.

Get to the point: What am I doing wrong?

If you have more than a handful of landing pages on your site, you’ve probably given some thought in the past to how Google can find them and how to make sure they get a good chunk of your site’s link equity. A rule of thumb often used by SEOs is how many clicks a landing page is from the homepage, also known as “crawl depth.”

Mobile-first indexing impacts this on two fronts:

  1. Some of your links aren’t present on mobile (as is common), so your internal linking simply won’t work in a world where Google is going primarily with the mobile-version of your page
  2. If your links are visible on mobile, they may be hideous or overwhelming to users, given the reduced on-screen real estate vs. desktop

If you don’t believe me on the first point, check out this Twitter conversation between Will Critchlow and John Mueller:

In particular, that section I’ve underlined in red should be of concern — it’s unclear how much time we have, but sooner or later, if your internal linking on the mobile version of your site doesn’t cut it from an SEO perspective, neither does your site.

And for the links that do remain visible, an internal linking structure that can be rationalized on desktop can quickly look overbearing on mobile. Check out this example from Expedia.co.uk’s “flights to London” landing page:

Many of these links are part of the site-wide footer, but they vary according to what page you’re on. For example, on the “flights to Australia” page, you get different links, allowing a tree-like structure of internal linking. This is a common tactic for larger sites.

In this example, there’s more unstructured linking both above and below the section screenshotted. For what it’s worth, although it isn’t pretty, I don’t think this is terrible, but it’s also not the sort of thing I can be particularly proud of when I go to explain to a client’s UX team why I’ve asked them to ruin their beautiful page design for SEO reasons.

I mentioned earlier that there are three main methods of establishing crawl paths on large sites: bulky main navigations, HTML-sitemap-style pages that exist purely for internal linking, or blocks of links at the bottom of indexed pages. I’ll now go through these in turn, and take a look at where they stand in 2018.

1. Bulky main navigations: Fail to scale

The most extreme example I was able to find of this is from Monoprice.com, with a huge 711 links in the sitewide top-nav:

Here’s how it looks on mobile:

This is actually fairly usable, but you have to consider the implications of having this many links on every page of your site — this isn’t going to concentrate equity where you need it most. In addition, you’re potentially asking customers to do a lot of work in terms of finding their way around such a comprehensive navigation.

I don’t think mobile-first indexing changes the picture here much; it’s more that this was never the answer in the first place for sites above a certain size. Many sites have tens of thousands (or more), not hundreds of landing pages to worry about. So simply using the main navigation is not a realistic option, let alone an optimal option, for creating crawl paths and distributing equity in a proportionate or targeted way.

2. HTML sitemaps: Ruined by the counterintuitive equivalence of noindex,follow & noindex,nofollow

This is a slightly less common technique these days, but still used reasonably widely. Take this example from Auto Trader UK:

The idea is that this page is linked to from Auto Trader’s footer, and allows link equity to flow through into deeper parts of the site.

However, there’s a complication: this page in an ideal world be “noindex,follow.” However, it turns out that over time, Google ends up treating “noindex,follow” like “noindex,nofollow.” It’s not 100% clear what John Mueller meant by this, but it does make sense that given the low crawl priority of “noindex” pages, Google could eventually stop crawling them altogether, causing them to behave in effect like “noindex,nofollow.” Anecdotally, this is also how third-party crawlers like Moz and Majestic behave, and it’s how I’ve seen Google behave with test pages on my personal site.

That means that at best, Google won’t discover new links you add to your HTML sitemaps, and at worst, it won’t pass equity through them either. The jury is still out on this worst case scenario, but it’s not an ideal situation in either case.

So, you have to index your HTML sitemaps. For a large site, this means you’re indexing potentially dozens or hundreds of pages that are just lists of links. It is a viable option, but if you care about the quality and quantity of pages you’re allowing into Google’s index, it might not be an option you’re so keen on.

3. Link blocks on landing pages: Good, bad, and ugly, all at the same time

I already mentioned that example from Expedia above, but here’s another extreme example from the Kayak.co.uk homepage:

Example 1

Example 2

It’s no coincidence that both these sites come from the travel search vertical, where having to sustain a massive number of indexed pages is a major challenge. Just like their competitor, Kayak have perhaps gone overboard in the sheer quantity here, but they’ve taken it an interesting step further — notice that the links are hidden behind dropdowns.

This is something that was mentioned in the post from Bridget Randolph I mentioned above, and I agree so much I’m just going to quote her verbatim:

Note that with mobile-first indexing, content which is collapsed or hidden in tabs, etc. due to space limitations will not be treated differently than visible content (as it may have been previously), since this type of screen real estate management is actually a mobile best practice.

Combined with a more sensible quantity of internal linking, and taking advantage of the significant height of many mobile landing pages (i.e., this needn’t be visible above the fold), this is probably the most broadly applicable method for deep internal linking at your disposal going forward. As always, though, we need to be careful as SEOs not to see a working tactic and rush to push it to its limits — usability and moderation are still important, just as with overburdened main navigations.

Summary: Bite the on-page linking bullet, but present it well

Overall, the most scalable method for getting large numbers of pages crawled, indexed, and ranking on your site is going to be on-page linking — simply because you already have a large number of pages to place the links on, and in all likelihood a natural “tree” structure, by very nature of the problem.

Top navigations and HTML sitemaps have their place, but lack the scalability or finesse to deal with this situation, especially given what we now know about Google’s treatment of “noindex,follow” tags.

However, the more we emphasize mobile experience, while simultaneously relying on this method, the more we need to be careful about how we present it. In the past, as SEOs, we might have been fairly nervous about placing on-page links behind tabs or dropdowns, just because it felt like deceiving Google. And on desktop, that might be true, but on mobile, this is increasingly going to become best practice, and we have to trust Google to understand that.

All that said, I’d love to hear your strategies for grappling with this — let me know in the comments below!

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Moz Blog

Posted in IM NewsComments Off

China’s New Large Solar-Powered Drone Reaches 20,000 Meters in Altitude

China’s first domestically designed large solar-powered unmanned plane reached above 20,000 meters in altitude on its test flight in the country’s northwest regions recently.

The drone was developed by the China Academy of Aerospace Aerodynamics (CAAA), it’s developers kept the exact size of the drone as a secret, but it is believed to be about 14 meters long with a 45 meter wingspan according to earlier prototypes.


Latest solar news

Posted in IM NewsComments Off

China’s New Large Solar-Powered Drone Reaches 20,000 Meters in Altitude


China’s first domestically designed large solar-powered unmanned plane reached above 20,000 meters in altitude on its test flight in the country’s northwest regions recently.

The drone was developed by the China Academy of Aerospace Aerodynamics (CAAA), it’s developers kept the exact size of the drone as a secret, but it is believed to be about 14 meters long with a 45 meter wingspan according to earlier prototypes.


Latest solar news

Posted in IM NewsComments Off

Scripting SEO: 5 Panda-Fighting Tricks for Large Sites

Posted by Corey Northcutt

For anyone that's experienced the joys of doing SEO on an exceedingly large site, you know that keeping your content in check isn't easy. Continued iterations of the Panda algorithm have made this fact brutally obvious for anyone that's responsible for more than a few hundred thousand pages.

As an SEO with a programming background and a few large sites to babysit, I was forced to fight the various Panda updates throughout this year through some creative server-side scripting. I'd like to share some with you now, and in case you're not well-versed in nerdspeak (data formats, programming, and Klingon), I'll start each item with a conceptual problem, the solution (so at least you can tell your developer what to do), and a few code examples for implementation (assumes that they didn't understand you when you told them what to do). My links to the actual code are in PHP/MySQL, but realize that these methods translate pretty simply into most any scenario.

OBLIGATORY DISCLAIMER: Although I've been successful at implementing each of these tricks, be careful. Keep current backups, log everything you do so that you can roll-back, and if necessary, ask an adult for help.

1.) Fix Duplicate Content between Your Own Articles

The Problem

Sure, you know not to copy someone else's content. But what happens when over time, your users load your database full of duplicate articles (jerks)? You can write some code that checks if articles are an exact match, but no two are going to be completely identical. You need something that's smart enough to analyze similarity, and you need to be about as clever as Google is at it.

The Solution

There's a sophisticated measure of how similar two bodies of text are using something called Levenshtein distance analysis. It measures how many edits would be necessary to transform one string into another, and can be translated into a related percentage/ratio of how similar one string is to another. When running this maintenance script on 1 million+ articles that were 50-400 words, deleting only duplicate articles with a 90% similarity in Levenshtein ratio, the margin of error was 0 in each of my trials (and the list of deletions was a little scary, to say the least).

The Technical

Levenshtein comparison functions are available in basically every programming language and are pretty simple to use. Running comparisons on 10,000 individual articles against one another all at once is definitely going to make your web/database server angry, however, so it takes a bit of creativity to finish this process while we're all still alive to see your ugly database.

levenshtein distance function

What follows may not be ideal practice, or something you want to experiment with heavily on a live server, but it gets this tough job done in my experience.

  1. Create a new database table where you can store a single INT value (or if this is your own application and you're comfortable doing it, just add a row somewhere for now). Then create one row that has a default value of 0.
  2. Have your script connect to the database, and get the value form the table above. That will represent the primary key of the last article we've checked (since there's no way you're getting through all articles in one run).
  3. Select that article, and check it against all other articles by comparing Levenshtein distance. Doing this in the application layer will be far faster than running comparisons as a database stored procedure (I found the best results occurred when using levenshteinDistance2(), available in the comments section of levenshtein() on php.net). If your database size makes this run like poop through a funnel (checking just 1 article against all others at once), consider only comparing articles by the same author, of similar length, posted in a similar date range, or other factors that might help reduce your data set of likely duplicates.
  4. Handle the duplicates as you see fit. In my case, I deleted the newer entry and stored a log in a new table with full text of both, so individual mistakes could later be reverted (there were none, however). If your database isn't so messy or you still fear mistakes after testing a bit, it may very well be good enough just to store a log and later review them by hand.
  5. After you're done, store the primary key of the last article that you checked in the database entry from i.). You can loop through ii.) – iv.) a few more times on this run if this didn't take too long to execute. Run this script as many times as necessary on a one minute cronjob or with the Windows Task Scheduler until complete, and keep a close eye on your system load.

2.) Spell-Check Your Database

The Problem

Sure, it would be best if your users were all above a third grade reading level, but we know that's not the case. You could have a professional editor run through content before it went live on your site, but now it's too late. Your content is now a jumbled mess of broken English, and in dire need of a really mean English teacher to set it all straight.

The Solution

Since you don't have an English teacher, we'll need automation. In PHP, for example, we have fun built-in tools like soundex(), or even levenshtein(), but when analyzing individual words, these just don't cut it. You could grab a list of the most common misspelled English words, but that's going to be hugely incomplete. The best solution that I've found is an open source (free) spell checking tool called the Portable Spell Checker Interface Library (Pspell), which uses the Aspell library and works very well.

The Technical

Once you get it setup, working with Pspell is really simple. After you've installed it using the link above, include the libraries in your code, and this function to return an array of suggestions for each word, with the word at array key 0 being the closest match found. Consider the basic logic from 1.) if it looks like it's going to be too much to tackle at once, incrementing your place as you step through the database, logging all actions in a new table, and (carefully) choosing whether or not you like the results well enough to automate the fixes or if you'd prefer to chase them by hand.

pspell example

3.) Implement rel="canonical" in Bulk

The Problem

link rel="canonical" is very useful tag for eliminating confusion when two URLs might potentially return the same content, such as when Googlebot makes its way to your site using an affiliate ID. In fact, the SEOmoz automated site analysis will yell at you on every page that doesn't have one. Unfortunately since this tag is page-specific, you can't just paste some HTML in the static header of your site.

The Solution

As this assumes that you have a custom application, let's say that you can't simply install ALL IN ONE SEO on your WordPress, or install a similar SEO plugin (because if you can, don't re-invent the wheel). Otherwise, we can tailor a function to serve your unique purposes.

The Technical

I've quickly crafted this PHP function with the intent of being as flexible as possible. Note that desired URL structures are different on different sites and scripts, so think about everything that's installed under a given umbrella. Use the flags that it mention in the description section so that it can best mesh with the needs of your site.
canonical link function

4.) Remove Microsoft Word's "Smart Quote" Characters

The Problem

In what could be Microsoft's greatest crime against humanity, MS Word was shipped with a genius feature that automatically "tilts" double and single quotes towards a word (called "smart quotes"), in a style that's sort of like handwriting. You can turn this off, but most don't, and unfortunately, these characters are not a part of the ASCII set. This means that various character sets used on the web and in databases that store them will often fail to present them, and instead, return unusable junk that users (and very likely, search engines) will hate.

The Solution

This one's easy: use find/replace on the database table that stores your articles.

The Technical

Here it is an example of how to fix this using MySQL database queries. Place a script on an occasional cron in Linux or using the Task Scheduler in Windows, and say goodbye to these ever appearing on your site again.

smart quotes mysql

5.) Fix Failed Contractions

The Problem

Your contributors are probably going to make basic grammar mistakes like this all over the map, and Google definitely cares. While it's important never to make too many assumptions, I've generally found that fixing common contractions is very sensible.

The Solution

You can use find/replace here, but it's not as simple as the solution fixing smart quotes, so you need to be careful. For example "wed" might need to be "we'd", or it might not. Other contractions might make sense while standing on their own, but find/replace by itself will also return results that are pieces of other words. So, we need to account for this as well.

The Technical

Note that there are two versions of each word. This is because in my automated proofreading trials, I've found it's common not only for an apostrophe to be omitted., but also for a simple typo to occur that puts the apostrophe after the last letter when Word's automated fix for this isn't on-hand. Words have also been surrounded by a space to eliminate a margin of error (this is key- just look at how many other words include 'dont' on one of these sites that people use to cheat in word games). Here's an example of how this works. This list is a bit incomplete, and leaves probably the most room for improvement in the list. Feel free to generate your own using this list of English contractions.

That should about do it. I hope everyone enjoyed my first post here on SEOMoz, and hopefully this stirs some ideas on how to clean up some large sites!

Do you like this post? Yes No

SEOmoz Daily SEO Blog

Posted in IM NewsComments Off