HOW TO FIND ALL EXISTING AND ARCHIVED URLS ON A WEB SITE

How to Find All Existing and Archived URLs on a web site

How to Find All Existing and Archived URLs on a web site

Blog Article

There are several motives you may need to have to seek out all the URLs on a website, but your exact intention will decide Everything you’re searching for. By way of example, you might want to:

Establish each and every indexed URL to investigate issues like cannibalization or index bloat
Obtain recent and historic URLs Google has noticed, specifically for site migrations
Obtain all 404 URLs to Recuperate from post-migration faults
In Just about every situation, one Resource won’t Provide you everything you require. Unfortunately, Google Lookup Console isn’t exhaustive, and a “web-site:example.com” look for is limited and challenging to extract details from.

On this publish, I’ll wander you through some applications to create your URL record and prior to deduplicating the data using a spreadsheet or Jupyter Notebook, based upon your website’s measurement.

Previous sitemaps and crawl exports
For those who’re searching for URLs that disappeared through the Are living site not too long ago, there’s a chance an individual on your own staff can have saved a sitemap file or simply a crawl export ahead of the variations ended up built. If you haven’t currently, look for these data files; they can normally deliver what you need. But, when you’re studying this, you most likely didn't get so Fortunate.

Archive.org
Archive.org
Archive.org is an invaluable tool for Website positioning duties, funded by donations. Should you look for a website and choose the “URLs” solution, you may accessibility nearly ten,000 stated URLs.

However, there are a few constraints:

URL limit: You are able to only retrieve as many as web designer kuala lumpur 10,000 URLs, which is insufficient for more substantial web-sites.
Excellent: Quite a few URLs could possibly be malformed or reference resource files (e.g., visuals or scripts).
No export solution: There isn’t a created-in method to export the record.
To bypass the lack of the export button, use a browser scraping plugin like Dataminer.io. However, these restrictions necessarily mean Archive.org might not deliver a complete Resolution for larger sized web sites. Also, Archive.org doesn’t point out regardless of whether Google indexed a URL—however, if Archive.org found it, there’s an excellent possibility Google did, far too.

Moz Professional
Though you might usually use a hyperlink index to search out exterior internet sites linking for you, these resources also uncover URLs on your site in the method.


Tips on how to utilize it:
Export your inbound links in Moz Pro to get a fast and easy list of concentrate on URLs from your web-site. In the event you’re dealing with a huge Web-site, consider using the Moz API to export info over and above what’s manageable in Excel or Google Sheets.

It’s essential to Observe that Moz Professional doesn’t validate if URLs are indexed or uncovered by Google. On the other hand, since most web pages use the identical robots.txt principles to Moz’s bots since they do to Google’s, this process usually is effective nicely like a proxy for Googlebot’s discoverability.

Google Lookup Console
Google Lookup Console offers numerous precious sources for constructing your list of URLs.

Backlinks studies:


Much like Moz Pro, the Links portion supplies exportable lists of goal URLs. Regretably, these exports are capped at 1,000 URLs Each individual. You could apply filters for distinct web pages, but because filters don’t use towards the export, you could must rely on browser scraping tools—restricted to five hundred filtered URLs at a time. Not perfect.

General performance → Search engine results:


This export gives you a summary of webpages getting lookup impressions. When the export is proscribed, You should use Google Look for Console API for larger datasets. You can also find free Google Sheets plugins that simplify pulling much more in depth data.

Indexing → Webpages report:


This portion gives exports filtered by challenge type, though they are also constrained in scope.

Google Analytics
Google Analytics
The Engagement → Web pages and Screens default report in GA4 is a superb source for amassing URLs, that has a generous limit of a hundred,000 URLs.


Better yet, you may implement filters to produce different URL lists, correctly surpassing the 100k limit. For instance, if you wish to export only blog site URLs, adhere to these measures:

Action 1: Incorporate a section to your report

Action 2: Simply click “Make a new phase.”


Action 3: Determine the section by using a narrower URL sample, for example URLs containing /web site/


Notice: URLs located in Google Analytics might not be discoverable by Googlebot or indexed by Google, but they provide important insights.

Server log information
Server or CDN log information are Most likely the ultimate Instrument at your disposal. These logs capture an exhaustive list of each URL route queried by users, Googlebot, or other bots through the recorded time period.

Issues:

Details sizing: Log documents is often substantial, lots of internet sites only keep the last two weeks of data.
Complexity: Analyzing log information is often difficult, but a variety of applications are available to simplify the process.
Combine, and excellent luck
When you finally’ve gathered URLs from every one of these resources, it’s time to combine them. If your site is small enough, use Excel or, for more substantial datasets, resources like Google Sheets or Jupyter Notebook. Ensure all URLs are continually formatted, then deduplicate the record.

And voilà—you now have a comprehensive list of recent, outdated, and archived URLs. Superior luck!

Report this page