SEO audit for 1 million page plus websites

As an experienced SEO professional, I’ve often done SEO audits of websites with more than one million pages. Doing audits in such a scale requires some heavy lifting and the traditional SEO toolset simply can not handle the amount of data and the scale of these audits.

So where do you start your on-page audit?

There is no right or wrong answer to this, you could ask your crawler to go around the website the same way as search crawler do, but this might leave pages with wrong canonical, href language tags or nofollow links out of the audit.

You have to start with the assumption that everything is wrong until proven otherwise.

Therefore crawl EVERYTHING and archive your results as at this point you are mainly acquiring data for your analysis.

Should I use the clients’ sitemap files?

You should use it to add pages to your audit, but not as the primary source of urls / pages.

What is the most common SEO audit issue you find?

Indexing issues where clients confuse noindex, nofollow, canonical and robots rules. Also every client which had rel-alternate tags on page where incorrect. I’ve also never seen a sitemap file with less then 2% error rate (keep in mind that some search engines will completely ignore sitemaps with more than 2% error)

What type of analysis program do you use for this type of task?

I try to keep things simple as my team  should also be able to help me. Therefore I use Microsoft Power PI as Excel simply can not handle the amount of data. (keep in mind that it is not just the crawler data, we often call more than 50 other API’s directly in our audit to add data to our report).

Things to consider before performing a large on-page SEO audit for your client:

  • How much processing power/time will it cost the client as we will be calling a heavy number of pages
  • Make sure that the web-analytics solution of the client does not count your visits
  • Make sure to perform the audit at off-peak times
  • Advise the clients IT / infrastructure team that the crawl will take place and for them to whitelist your IP
    • This is because you will be calling 100 pages / second and the servers often consider this a bruce for an attack.

Tools I use for on-page SEO audit

Screaming Frog

Obviously one of the best crawlers around, but make sure that you increase the amount of memory allocated to the crawler. I use a 32GB workstation for crawling and have seen it take up 20 plus gigs of ram.


Xenu is also a very good tool for finding 404 errors and extracting meta title, description. It is also very good for performing the first crawl where you are setting up the overall page map.

Moz, SearchMetrics, Sistrix and SEOlytics

I use all of them to a) acquire the SERP positions of keywords and also backlink information. Each one has it’s own benefit and market focus (i.e. Sistrix and SEOlytics is a lot better for DACH markets then search metrics and Moz) Whereas SearchMetrics is good for Nordics and Moz is amazing for US market.

Whilst these tools are great, they are often very expensive. If you are crawling 1 million plus pages then it is no longer realistic to use such tools. Therefore, I would consider screaming frog.

Custom tools

I’ve created a number of custom crawlers for reviewing the first Xenu extract and custom crawlers for identifying rel-alternate issues and mobile deep-linking. Not to mention that paying for 1 million keywords SERP ranking reports from an external service is simply too expensive, that is why we use our own methods for acquiring this data, but only for large projects as it takes a lot from our team to accomplish the cost/benefit of doing it in-house.

Other general marketing tools

Again, automation and reporting is everything!

As this article is focused on 1 million plus pages, I would recommend using the Power Query instance within Excel, it can handle a considerable amount more of data when compared to the traditional Excel table, additionally if you use always the same raw CSV files which you export from your crawlers. You can create automated data processing tasks such as cleaning, filtering and creating summary tables automatically from said queries. With time you will see that you will have many automated reports at your hands saving you a considerable amount of time.


At the end of the day, all these tools do is give you the data you need to make your analysis and only an experienced SEO professional is able to identify the issues and prioritize/create an action plan that is tailored to the client. A data-driven website audit is the only way to really have an action plan to improve your search visibility. If you would like help in performing an SEO audit on your website, feel free to approach me at here.


0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.