March 30th, 2010 | Published in Google Online Security
To help protect you from a wide array of Internet scams you may encounter while searching, we analyze millions of webpages daily for phishing behavior. Each year, we find hundreds of thousands of phishing pages and add them to our list that we use to directly warn users of Firefox, Safari, and Chrome via our SafeBrowsing API. How do we find all these phishing pages? We certainly don’t do it by hand!
Instead, we have taught computers to look for certain telltale signs, as we describe in a paper that we recently presented at the 17th Annual Network and Distributed System Security Symposium. In a nutshell, our system looks at many of the same elements of a webpage that you would check to help evaluate whether the page is designed for phishing. If our system determines that a page is being used for phishing, it will automatically produce a warning for all of our users who try to visit that page. This scalable design enables us to review all these potentially “phishy” pages in about a minute each.
What we look for
Our system analyzes a number of webpage features to help make a verdict about whether a site is a phishing site. Starting with a page’s URL, we look to see if there is anything unusual about the host, such as whether the hostname is unusually long or whether the URL uses an IP address to specify the host. We also look to see if the URL contains any phrases like “banking” or “login” that might indicate that the page is trying to steal information.
We don’t just look at the URL, though. After all, a perfectly legitimate site could certainly use words like “banking” or “login.” We collect a snapshot of the page’s content to examine it closely for phishing behavior. For example, we check to see if the page has a password field or whether most of the links point to a common phishing target, as both of these characteristics can be a sign of phishing. Additionally, we pick out some of the most characteristic terms that show up on a page (as defined by their TF-IDF scores), and look for terms like “password” or “PIN number,” which also may indicate that the page is intended for phishing.
We also check the page’s hosting information to find out which networks host the page and where the page’s servers are located geographically. If a site purporting to be an American bank runs its servers in a different country and is hosted on a local residential ISP’s network, we have a strong signal that the site is bad.
Finally, we check the page’s PageRank to see if the page is popular or not, and we check the spam reputation of the page’s domain. We discovered in our research findings that almost all phishing pages are found on domains that almost exclusively send spam. You can observe this trend in the CCDF graph of the spam reputation scores for phishing pages as compared to the graph of other, non-phishing pages.
How we learn to recognize phishing pages
We use a sample of the data that our system generates to train the classifier that lies at the core of our automatic system using a machine learning algorithm. Coming up with good labels (phishing/not phishing) for this data is tricky because we can’t label each of the millions of pages ourselves. Instead, we use our published phishing page list, largely generated by our classifier, to assign labels for the training data.
You might be wondering if this system is going to lead to situations where the classifier makes a mistake, puts that mistake on our list, and then uses the list to learn to make more mistakes. Fortunately, the chain doesn’t make it that far. Our classifier only makes a relatively small number of mistakes, which we can correct manually when you report them to us. Our learning algorithms can handle a few temporary errors in the training labels, and the overall learning process remains stable.
How well does this work?
Of the millions of webpages that our scanners analyze for phishing, we successfully identify 9 out of 10 phishing pages. Our classification system only incorrectly flags a non-phishing site as a phishing site about 1 in 10,000 times, which is significantly better than similar systems. In our experience, these “false positive” sites are usually built to distribute spam or may be involved with other suspicious activity. While phishers are constantly changing their strategies, we find that they do not change them enough to reliably escape our system. Our experiments showed that our classification system remained effective for over a month without retraining.
If you are a webmaster and would like more information about how to keep your site from looking like a phishing site, please check out our post on the Webmaster Central Blog. If you find that your site has been added to our phishing page list ("Reported Web Forgery!") by mistake, please report the error to us.