What is content? Content can be many things, a file, video, picture, backup, a website feature. When we talk about content discovery, we’re not talking about the obvious things we can see on a website; it’s the things that aren’t immediately presented to us and that weren’t always intended for public access.
This content could be, for example, pages or portals intended for staff usage, older versions of the website, backup files, configuration files, administration panels, etc.
There are three main ways of discovering content on a website which we’ll cover. Manually, Automated and OSINT (Open-Source Intelligence).
Manual Discovery – Robots.txt
The robots.txt file is a document that tells search engines which pages they are and aren’t allowed to show on their search engine results or ban specific search engines from crawling the website altogether. It can be common practice to restrict certain website areas so they aren’t displayed in search engine results. These pages may be areas such as administration portals or files meant for the website’s customers. This file gives us a great list of locations on the website that the owners don’t want us to discover as penetration testers.
Unlike the robots.txt file, which restricts what search engine crawlers can look at, the sitemap.xml file gives a list of every file the website owner wishes to be listed on a search engine. These can sometimes contain areas of the website that are a bit more difficult to navigate to or even list some old webpages that the current site no longer uses but are still working behind the scenes.
When we make requests to the web server, the server returns various HTTP headers. These headers can sometimes contain useful information such as the webserver software and possibly the programming/scripting language in use. In the below example, we can see the webserver is NGINX version 1.18.0 and runs PHP version 7.4.3. Using this information, we could find vulnerable versions of software being used. Try running the below curl command against the web server, where the -v switch enables verbose mode, which will output the headers (there might be something interesting!).
OSINT – Google Hacking / Dorking
Google Hacking / Dorking
Google hacking / Dorking utilizes Google’s advanced search engine features, which allow you to pick out custom content. You can, for instance, pick out results from a certain domain name using the site: filter, for example (site:target.com) you can then match this up with certain search terms, say, for example, the word admin (site:target.com admin) this then would only return results from the target.com website which contain the word admin in its content. You can combine multiple filters as well. Here is an example of more filters you can use:
|site||site:target.com||returns results only from the specified website address|
|inurl||inurl:admin||returns results that have the specified word in the URL|
|filetype||filetype:pdf||returns results which are a particular file extension|
|intitle||intitle:admin||returns results that contain the specified word in the title|
More information about google hacking can be found here: https://en.wikipedia.org/wiki/Google_hacking
Wappalyzer (https://www.wappalyzer.com/) is an online tool and browser extension that helps identify what technologies a website uses, such as frameworks, Content Management Systems (CMS), payment processors and much more, and it can even find version numbers as well.
The Wayback Machine (https://archive.org/web/) is a historical archive of websites that dates back to the late 90s. You can search a domain name, and it will show you all the times the service scraped the web page and saved the contents. This service can help uncover old pages that may still be active on the current website.
What is Automated Discovery?
Automated discovery is the process of using tools to discover content rather than doing it manually. This process is automated as it usually contains hundreds, thousands or even millions of requests to a web server. These requests check whether a file or directory exists on a website, giving us access to resources we didn’t previously know existed. This process is made possible by using a resource called wordlists.
What are wordlists?
Wordlists are just text files that contain a long list of commonly used words; they can cover many different use cases. For example, a password wordlist would include the most frequently used passwords, whereas we’re looking for content in our case, so we’d require a list containing the most commonly used directory and file names. An excellent resource for wordlists that is preinstalled on the THM AttackBox is https://github.com/danielmiessler/SecLists which Daniel Miessler curates.
Although there are many different content discovery tools available, all with their features and flaws, we’re going to cover three which are preinstalled on our attack box, ffuf, dirb and gobuster.
[email protected]$ ffuf -w /usr/share/wordlists/SecLists/Discovery/Web-Content/common.txt -u http://target.com/FUZZ
[email protected]$ dirb http://target.com/ /usr/share/wordlists/SecLists/Discovery/Web-Content/common.txt
[email protected]$ gobuster dir --url http://target.com/ -w /usr/share/wordlists/SecLists/Discovery/Web-Content/common.txt
Using this information we gather we can find confidential files, server IP’s ,etc. Which can cause Information Disclosure Vulnerability for company.
Account Takeover Labs Link: https://bepractical.tech/account-takeover-labs/