A website is defined as a set of interrelated web pages which share a single domain. The pages which make up a website are usually interlinked using hyperlinks. Websites have unique content to set themselves apart from one another. The definition above does not take into account the sub-site and sub-domain relationships. For instance, the St. Petersburg State University www.spbu.ru contains another sub-site called the Faculty of Applied Mathematics and Control Processes apmath.spbu.ru. This paper does not take into account such relationships and both sites are treated as different. There are various reasons why taking such an approach is justified. However, the main reason for taking such an approach is that some websites are usually administered by different individuals. These groups of people may administer different sections or departments and in most cases, the departments serve different functions, therefore, have different goals.
In order to come up with a website model, there are various details which need to be provided. They include site documents and interconnecting hyperlinks. The functionalities of the various HTML pages should also be well-defined. Webometrics is a science that is usually used to analyze the structures of websites. The science utilizes web crawlers which scan through websites and then gather various kinds of information. These crawlers are very useful since they not only provide useful statistical information but are also used to preserve site resources (Pant et al., 2004). Different web crawlers are used to serve different functions. The JSpider crawler can be used to check errors on a website and also analyze the outgoing and internal links. It is basically used to create sitemaps and can be used to download complete websites.
Even though the JSpider crawler has been shown to be a very useful program, the most preferred program is the RCC crawler. RCC stands for Rapid Configurable Crawler. The use of the RCC crawler is borne of the fact that there is a constant need of improved web crawlers. The interpretation of accumulated information leads to the creation of new tasks and their corresponding functions. In some instances, there is a lack of documentation which creates more difficulties in the implementation of the new tasks and functions. The RCC crawler is specially configured to serve specific needs. It takes into account specific research approaches. Before settling for a particular crawler, researchers consider several factors. These include the crawling convenience and the type of results required. Different crawlers produce different quality of results. Different crawlers are also structured differently. Therefore, researchers basically choose crawlers which are best-aligned with their needs.
The RCC crawler has outstanding capabilities. It navigates through a site in a specific order. The navigation process starts at the homepage and moves through the various internal hyperlinks following a specific order. This process is referred to as breadth-first search (Baeza-Yates and Castillo, 2004). The crawler provides the researcher with the ability to set the depth of a given scan. Once the crawler finishes navigating through the site, it produces SCV files. On these files, there are various lists which contain information about the internal hyperlinks. Pechnikov and Lankin (2016) discuss the various functionalities of the RCC crawler. They discuss its structure and the various ways through which it can be used. There are also various examples and illustrations of its usage.
The webgraph of a particular website can be constructed using HTML page data, internal hyperlinks and web documents. The RCC crawler is usually used to provide this information. In a webgraph, there is usually a set of nodes and a set of arcs. The HTML pages and other documents make up a set of nodes while the discovered hyperlinks make up a set of arcs. A webgraph has three properties. It contains a dedicated source node, otherwise called the index page, it makes it possible to determine its level structure and it is usually an interconnected graph. The distance separating the homepage from other pages is what is used to determine the level structure of a webgraph.
Broder et al. (2000) discusses the various laws pertaining to graph theory, for instance, he states that the source node levels of a Web graph is always zero. Navigating one step from the source node enables one to access one level of nodes. Similarly, moving two steps from the source node enables one to access two levels of nodes. The number of steps moved from the source node always corresponds with the access levels of the nodes. This ability to interconnect holds for all web graphs which have been built from a set of HTML pages and web documents. This connectivity also enables the generation of subsequent pages simply by performing an analysis of the previous pages. There are various nodes which match with particular documents. These are used as leaf nodes and even though they have the incoming arcs, they lack the outgoing ones. However, it worth noting that not all web graphs has strong interconnections. Graphs which contain leaf nodes are usually not strongly connected. On the other hand, webgraphs that are solely built from html pages are usually strongly connected. A strongly connected graph is one which has several links pointing or connecting to the pages' homepage.
During a research process, there is usually the need to analyze webgraphs. One of the software which can be used to perform such an analysis is called Gephi. The software is very useful as it has the ability to analyze any kind of graph. It ensures there is a correct implementation of various visualizations and drawings. The software is also used to calculate various characteristics of a site. For instance, it can be used to find the graph density, modularity, and even PageRank values. When creating a website, it is important not to lose sight of the original objectives. In webometrics, a correctness criterion is usually used to prevent individuals from running into such methodological problems. Before deciding to come up with a website, it is important to formulate a set of rules and objectives as they help in coming up with the compliance criteria. The correctness criterion used by authors depends on their experience, preferences, and research conducted on various real websites.
Different pages on the web have different value or usefulness. The value or quality of a page can be determined using a tool known as PageRank (Babak and Babak, 2009). Different pages on a website serve different functions and therefore have different degrees of importance. PageRank uses a site's internal structure to determine the importance of different pages. It also calculates the PR of the various webgraph pages. Pages with the highest PR's are usually regarded as the most valuable pages. This is one of the most common correctness criteria used when developing the structure of a website. The homepage usually stands out as one of the most valuable pages. Other valuable pages are those which correspond to large sections of the website. In this paper, these valuable pages will be referred to as valuable nodes.
On most pages, the PR values are arranged in descending order. This is however common in modern websites than in earlier websites. There have been unusual circumstances where valuable nodes exhibit low PRs and other circumstances where non-valuable nodes exhibit high PRs. When such issues occur, developers believe the correctness criterion would have been violated. Whenever such a violation occurs, the issue needs to be corrected. There is an elaborate procedure of going about this process. Some of the control measures which can be taken to rectify a particular error on a website include the removal and/or addition of specific pages or internal hyperlinks. There are various control actions. Their suitability depends on the desired results and the convenience of using some over others. More research is needed regarding the applicability of various control actions and measures.
Websites are usually made in correspondence with the objectives, goals, and activities of organizations. For instance, a university may hire the services of a new professor or lecturer. When this happens, the university will have to update its website accordingly. It will create the new professor's profile on the website and create the corresponding or appropriate links. This is the reason why the addition or subtraction of pages or links is said to be affected by the behaviors of the external environment. Behaviors of the external environment are highly unpredictable which makes it difficult to formalize or model them. However, one thing that can always be done is the deletion of web pages, links or documents. Before removing or adding a web page, it is important to analyze a site and determine which pages need to be removed. Developers use different criteria in determining the pages to be deleted. For instance, whenever a given university fires one of its lecturers, it would have to update its records accordingly. The profile and any links connecting the professor's page with other pages would have to be deleted. In some instances, a professor may have several pages, usually referred to as a directory (Cormen et al., 2001). These too are usually deleted. The aim of such a measure is usually to clear all the links and directory of the person in question.
In the process of eliminating a given directory, a few complications are bound to arise. For instance, a professor may have decided to create a personal website instead of simply creating a directory on the university's website. This makes the deletion process quite hard. The process of deleting such a website is quite elaborate. It involves the creation of a new site while ensuring the old site is deleted and replaced with the new one. This ensures that the old directory is completely removed from the site. The advantage of this method is that it is easy to formalize. The removal process for outdated pages, irrelevant pages or even empty pages is much easier compared to the process of deleting directories. It is worth noting that the action of removing a web page does not necessarily mean that the page is physically removed. In most instances, it simply means that an existing page is transformed into another site. The new site is usually independent and has its own domain name. The process makes sure that the old hyperlinks are retained and keep connecting to the parent site.
Many university web resources experience this type of development. The sites are easily restructured depending on the prevailing circumstances. One web resource which has undergone these type of changes is the St. Petersburg State University website. Pechnikov (2016) discusses how the university's website has been in constant development for a period spanning over 20 years. During the earlier days, it only served as the official university website. However, over time, the site has grown to include over 280 independent sites. The new sites include different departments, faculties, and administrative units. Before the 280 sites became independent, they were only sections on the university's main site. From the information presented, it may be argued that control actions are what determine whether specific pages of a university website get to be deleted. Deleting parts of a real website can be a daunting task. Such actions can only be successfully carried out by professional web developers.
Cite this page
Free Essay: Webgraph Analysis and Webometrics - Enhancing the Structure of Academic Websites. (2022, Jul 15). Retrieved from https://speedypaper.com/essays/webgraph-analysis-and-webometrics-enhancing-the-structure-of-academic-websites
If you are the original author of this essay and no longer wish to have it published on the SpeedyPaper website, please click below to request its removal: