HOW TO FIND ALL EXISTING AND ARCHIVED URLS ON A WEB SITE

How to Find All Existing and Archived URLs on a web site

How to Find All Existing and Archived URLs on a web site

Blog Article

There are lots of motives you may need to have to search out the many URLs on a website, but your specific aim will decide That which you’re seeking. For example, you may want to:

Detect every single indexed URL to investigate concerns like cannibalization or index bloat
Acquire current and historic URLs Google has found, specifically for site migrations
Discover all 404 URLs to Recuperate from put up-migration glitches
In Every situation, an individual Device gained’t Supply you with every little thing you need. Sadly, Google Lookup Console isn’t exhaustive, and also a “web page:case in point.com” research is proscribed and tough to extract info from.

On this write-up, I’ll wander you through some equipment to make your URL listing and just before deduplicating the information utilizing a spreadsheet or Jupyter Notebook, determined by your website’s size.

Previous sitemaps and crawl exports
In case you’re looking for URLs that disappeared from the live internet site lately, there’s a chance another person on your own crew could possibly have saved a sitemap file or maybe a crawl export prior to the modifications were produced. When you haven’t presently, look for these files; they will usually present what you would like. But, in the event you’re looking at this, you almost certainly didn't get so Blessed.

Archive.org
Archive.org
Archive.org is a useful Instrument for Search engine marketing jobs, funded by donations. When you try to find a site and choose the “URLs” selection, you may entry nearly ten,000 mentioned URLs.

However, Here are a few restrictions:

URL Restrict: It is possible to only retrieve around web designer kuala lumpur 10,000 URLs, which is insufficient for larger web-sites.
Excellent: A lot of URLs may be malformed or reference useful resource data files (e.g., pictures or scripts).
No export selection: There isn’t a developed-in way to export the checklist.
To bypass The shortage of the export button, utilize a browser scraping plugin like Dataminer.io. Even so, these limitations imply Archive.org may not offer a complete Option for much larger web-sites. Also, Archive.org doesn’t point out whether or not Google indexed a URL—but when Archive.org discovered it, there’s a good probability Google did, also.

Moz Pro
Though you could possibly typically utilize a hyperlink index to search out external web-sites linking for you, these resources also uncover URLs on your site in the method.


The best way to utilize it:
Export your inbound inbound links in Moz Pro to acquire a quick and simple listing of concentrate on URLs from your internet site. If you’re managing a massive Internet site, consider using the Moz API to export knowledge over and above what’s manageable in Excel or Google Sheets.

It’s essential to Observe that Moz Pro doesn’t confirm if URLs are indexed or found out by Google. On the other hand, due to the fact most internet sites utilize the identical robots.txt rules to Moz’s bots as they do to Google’s, this method commonly operates perfectly as being a proxy for Googlebot’s discoverability.

Google Lookup Console
Google Search Console provides several valuable resources for creating your listing of URLs.

One-way links studies:


Similar to Moz Pro, the Inbound links section supplies exportable lists of concentrate on URLs. Sadly, these exports are capped at 1,000 URLs Just about every. You can apply filters for particular webpages, but considering that filters don’t apply to the export, you could ought to depend on browser scraping applications—limited to 500 filtered URLs at a time. Not ideal.

Performance → Search Results:


This export gives you a list of webpages acquiring look for impressions. Whilst the export is restricted, You may use Google Look for Console API for bigger datasets. You can also find no cost Google Sheets plugins that simplify pulling far more substantial data.

Indexing → Webpages report:


This part offers exports filtered by situation style, while these are generally also confined in scope.

Google Analytics
Google Analytics
The Engagement → Internet pages and Screens default report in GA4 is a superb supply for collecting URLs, having a generous Restrict of one hundred,000 URLs.


Even better, you can implement filters to produce different URL lists, correctly surpassing the 100k Restrict. One example is, if you'd like to export only web site URLs, stick to these steps:

Stage 1: Include a segment to the report

Step two: Click on “Create a new phase.”


Phase three: Outline the phase that has a narrower URL sample, like URLs made up of /website/


Take note: URLs found in Google Analytics might not be discoverable by Googlebot or indexed by Google, but they provide precious insights.

Server log data files
Server or CDN log data files are perhaps the ultimate tool at your disposal. These logs seize an exhaustive listing of every URL route queried by people, Googlebot, or other bots in the recorded period of time.

Things to consider:

Knowledge dimension: Log data files might be huge, lots of internet sites only keep the final two months of knowledge.
Complexity: Examining log data files may be demanding, but many tools can be found to simplify the process.
Incorporate, and fantastic luck
As you’ve gathered URLs from each one of these resources, it’s time to combine them. If your internet site is sufficiently small, use Excel or, for larger datasets, resources like Google Sheets or Jupyter Notebook. Be certain all URLs are continually formatted, then deduplicate the list.

And voilà—you now have a comprehensive listing of present, previous, and archived URLs. Very good luck!

Report this page