How Google system identify phishing pages?

Google Online Security blog described Googles’ system for identifying phishing pages. “Of the millions of webpages that our scanners analyze for phishing, we successfully identify 9 out of 10. Our classification system only incorrectly flags a non-phishing as a phishing site about 1 in 10,000 times, which’s significantly better than similar systems. In our experience, […]

Google Online Security blog described Googles’ system for identifying phishing pages. “Of the millions of webpages that our scanners analyze for phishing, we successfully identify 9 out of 10. Our classification system only incorrectly flags a non-phishing as a phishing site about 1 in 10,000 times, which’s significantly better than similar systems. In our experience, these “false positive” sites’re usually built to distribute spam or may be involved with other suspicious activity. If your site has been added to phishing page list (”Reported Web Forgery!”) by mistake, please report the error to us. On the other hand, if your site has been added to malware list (”This site may harm your computer”), you should follow the instructions. Our team tries to address all complaints within one day, and we usually respond within a few hours,” notes Google. “Our system analyzes number of webpage features; starting with a page’s URL, we look to see if there’s anything unusual about the host, such as whether hostname is unusually long or whether URL uses an IP address to specify the host. If a site purporting to be an American bank runs its servers in a different country and is hosted on a local residential ISP’s network, that site is bad.We also look to see if URL contains any phrases like “banking” or “login” that might indicate the page’s trying to steal information. Additionally, we pick most characteristic terms on a page (as defined by their TF-IDF scores), and look for terms like “password” or “PIN number,” which also indicate intended for phishing. Finally, we check page’s PageRank, and we check spam reputation of the page’s domain.”
 

More info: Phishing phree | Will the Real Please Stand Up?