Web spam and its counter measures
by Alex Goh Kwang Leng and Associate Professor Dr. Ashutosh Kumar Singh
In 1991, Tim Berners-Lee founded the World Wide Web with the idea of sharing data with no common machines and no common representative software. Today, his creation has become the platform to share and exchange information. With the recent exponential growth in the mobile communication industry, more users are connecting to the World Wide Web to retrieve information.
According to a survey conducted by NetCraft, an Internet service company, some 629,939,191 websites are scattered in the World Wide Web (Netcraft 2013).
Nowadays, the Web search engine has become the default information retrieval tool to ease Web users’ need to extract relevant information. However, searching for relevant data in this information warehouse can be a challenging task since the World Wide Web is known to be the largest knowledge repository mankind has ever created.
Traditionally, Web search engines did not take the ranking order of Web documents into serious consideration. They employed computer programmes known as Web crawlers or Web spiders to find and download Web pages, and incorporated another programme to sort them based on information such as domain names, headings of Web pages, page titles, anchor text and meta data.
In recent years, Web search engines have incorporated hyperlinks into the ranking mechanism. Authors of Web pages create hyperlinks as references to link up with other Web pages. These referrals provide valuable information between documents and records of user behaviour. The idea of studying these referrals in information retrieval is commonly known as link analysis.
Link analysis is an emerging technology that tries to comprehend the relationships between Web documents, thus providing an order of search results according to their importance and relevance based on users’ queries.
From this technology, algorithms were developed. The first link analysis algorithm developed by Li YanHong, the RankDex technology, was incorporated in search engines to measure the quality of Websites (“About Rankdex” 1997).
PageRank, developed by Sergey Brin and Larry Page, which was used in the famous Google search engine, modelled its algorithm on probabilities of a random surfer for the search engine. Meanwhile, Jon Kleinberg proposed a hyperlink-induced topic search (HITS), which introduced the authorities and hubs of Web pages to rate Web pages.
Lastly, stochastic approach for link-structure analysis also known as SALSA proposed by Lempel and Moran examined random walks on graphs derived from the link-structure to rank Web pages.
In recent times, a lot of tricks have been used by content providers to get their pages ranked higher than they deserve. This is because the order of the results is highly correlated to the profit gain of one business model. In 2007, the cost of Web spam was estimated at US100 billion globally and the United States alone suffered an estimated cost of US35 billion.
The intention of Web spam is to mislead search engines by boosting pages to undeserved rank. Consequently, it leads Web users to irrelevant information. This kind of exploitation degrades the Web search engines by providing inappropriate or biased query results.
It has been identified that Web spam is one of the most important challenges in the Web search engine industry. Search engine companies generally employ human experts who specialise in detecting Web spam, constantly scanning the Web looking for spamming activities. However, the spam detection process is time-consuming, expensive and difficult to automate.
Gyongyi et al. raised the interest of the anti-Web spam community by writing a comprehensive taxonomy of all spamming techniques including boosting and hiding techniques. Boosting techniques refer to methods that achieve high relevance or importance for pages; hiding techniques refer to methods that do not influence the rankings in search engines but assist boosting techniques from the view of the user. One example is the manipulation of the colour scheme of the anchor text.
Boosting techniques can be further expanded into term spamming (which is also referred to as content spamming) and link spamming, while hiding techniques can be expanded into content hiding, cloaking and redirection. Cloaking gives the Web user different content from what a search engine sees.
Redirection, on the other hand, is the sending of the Web user to another URL (Uniform Resource Locator) while loading a current URL.
Content hiding refers to spam terms or links in a Web page that are invisible to the user.
Understanding spamming techniques is important in order to propose appropriate counter-measures.
Trust or badness based method (or trust and distrust model) algorithms have shown significant results in eliminating Web spam. Initially, the algorithms run a seed selection process, under which a portion of a large Web is selected and evaluated as spam or non-spam to form seed sets. Based on the evaluated seed sets, spam and non-spam are used to propagate distrust for detection, and trust for demotion, of Web spam.
The trust and distrust model can be categorised into two types of algorithms: Web spam detection and Web spam demotion. Both detection and demotion of Web spam are equally important in combating Web spam.
Demotion of Web spam can act as a counter-bias in reducing possible rank boosts from spam whereas detection of Web spam can help in removing them at the earliest stage.
Besides the trust or badness based method, machine learning methods have been actively used for detection of Web spam in recent years. The machine learning approach in the anti-Web spam community can be divided into two sections: feature and structure.
A feature is an individual specification of an attribute whereas structure is a machine learning model that takes features for classification purposes. Some of the aforementioned trust or badness based algorithms are used as features to assist machines to learn the underlying patterns of Web spam.
In summary, information retrieval is growing at an exponential pace and link analysis provides an efficient way to sort documents based on their link structures. However, Web spammers also take advantage of link analysis to boost their pages to undeserved rank.
Such unethical acts are known as Web spamming. Web spamming techniques can be categorised into different groups such as boosting techniques and hiding techniques. The trust and distrust model and machine learning model are the two most well-known models used to counter Web spam.
Alex Goh Kwang Leng is a Master of Philosophy (Computer Science) student at Curtin Sarawak. His research interests include machine learning models, adversarial information retrieval, and learning to rank. Alex Goh can be contacted by e-mail to alex.goh@curtin.edu.my.
Dr. Ashutosh Kumar Singh is an Associate Professor and Head of the Department of Electrical and Computer Engineering at Curtin Sarawak’s School of Engineering and Science. His research interests include verification, synthesis, design and testing of digital circuits. He has published over 70 research papers on the subjects in various conference and research journals. He co-authored two books, ‘Digital Systems Fundamentals’ and ‘Computer System Organisation and Architecture’, and has delivered talks on computer engineering in several countries including Australia, the United Kingdom and the United States. Dr. Ashutosh can be contacted by e-mail toashutosh.s@curtin.edu.my.