Proceedings of The 2nd International Conference on Innovative Research in Science and Engineering
A Multi-Threaded Algorithm for Mining the Arabic Web Structure
Mohammad A.R. Abdeen and Sami Albouq
The worldwide web has become the default destination of acquiring information in just about any knowledge domain. This range of knowledge covers a wide spectrum; from the latest technological advances in medical procedures to most effective drugs and most popular sports cars and personal accessories. The way the web is structured embodies implicit information within it that turns to be useful in information retrieval and in search engine applications. This paper presents an initial attempt at mining the structure of the Arabic portion of the web for the purpose to produce better results of search engines in the Arabic webspace. We are presenting a multi-threaded algorithm that provides an initial attempt to reveal the structure of the Arabic web. The algorithm crawls the web and collects interlink information on as many as one million Arabic websites. The algorithm provides an initial analysis of the most “cited” websites and presents a list of the top 20 websites. We have used the website citation count as the measure for ranking a website. It is worth noting that the top-ranked websites do not belong to specific categories such as sports, news, lifestyle, or others. Rather the websites on the top of the list represent more of directory websites. Some news websites are included in the top 20 list despite not at the very top of the list. Other categories are represented such as the technology category although represented with only one website at the bottom of the list.
Keywords: Web Structure Mining, Web Graph, Multi-threading, Web Crawling, Arabic, Text Processing.