Introduction

The Semantic Web is an extension of the traditional Web beyond documents and applications into the realm of data. Tim Berners-Lee describes the Semantic Web as a system that would "open up the knowledge and workings of humankind to meaningful analysis by software agents" [1]. The Semantic Web differs from the Web in two key ways: there can be relations between URIs, and URIs can be used to describe objects, rather than simply documents. With these two additional tools, information on the Web could be used to make statements about objects in both the physical world and online. The Semantic Web refers to the specific case in which these statements are machine-readable, and therefore machine-explorable. In such a Web, computer-agents could answer user-queries via inference, using the links between statements to traverse the search space. Though Tim introduced the idea of the Semantic Web more than fifteen years ago, content that supports Semantic Web makes up a tiny fraction of the Web. Implementation of the Semantic Web has been slow, and there many obstacles - technical, logical, and practical - to its widespread implementation. The Web, as it is now, is not really a Semantic Web - though a number of sites do support Semantic content - but it could become one.

The advent of the Web changed the way documents were shared. It leveraged existing technologies, and by integrating those technologies it provided a common ground for sharing information. There are many reasons the Web succeeded, but one of the most significant - and deeply consequential to the implementation of a Semantic Web - was that it was designed for human agents. This is apparent in the Web's methods of content transmission (HTTP/1.0 headers were text/plain [2]; URIs are plaintext [3] and, though they are not required to, "are primarily intended for human interpretation" [4]) and in its content itself (HTML is declarative and self-describing). Lowering the barriers to entry makes adopting a technology easier, encouraging network effects. The emphasis on human-readability in Tim's original Web played a key role in the success of the internet.

In 1991, when Tim announced the first version of his World Wide Web, the idea of a global, distributed, and human-explorable information network was novel and revolutionary [5]. Though it might have been possible, the merit of adding further features to the Web to make it machine-explorable would have appeared questionable at best; there wasn't yet enough information online to make that type of exploration attractive, and even if there had been, the kinds of programs that might explore in this way didn't exist yet. As the amount of information on the Web grew, services rooted in machine-exploration of the Web, like Search, began to flourish.

These tools took advantage of the self-describing nature of HTML to extract meaning from webpages, examining only the "raw textual content... in the HTTP response body" [6]. With the introduction of JavaScript in 1995, the variety of data the Web could support increased dramatically [7]. Though JavaScript made it possible to create robust Web applications, it simultenously made machine-interpretation of Web content more difficult and the value of such exploration more valuable. JavaScript lets you do more, this power comes at the cost of transparency. The goal of the Semantic Web is to make data on the modern Web, in spite of its profusion of technologies, clear and accessible.

The value in the Semantic Web lies in its ability to reduce vast quantities of data on the Web in a way that is meaningful to a user. As the amount of data on the Web increases [8], as we spend more time on the Web (7.5 hours per day, 186% more in 2014 than 2010 [9]), and as that data is more meaningfully integrated into our lives [10], the value a World Wide Semantic Web would provide increases.


What is the Semantic Web and What Does it Do?

The Semantic Web refers to the idea of an extension to the World Wide Web in which there are standardized methods for sharing data. that would provide common methods for sharing data. The endgoal of the Semantic Web is to "enable computers to do more useful work" via enabling machine-exploration of 'Linked Data' across the web [11]. There can be no Semantic Web without Linked Data, and as such the standardization of Linked Data is the primary objective for proponents of the Semantic Web.

Linked Data

Tim specifies four qualities necessary for data to be considered linked:

  1. "Use URIs as names for things.

  2. Use HTTP URIs so that people can look up those names.

  3. When someone looks up a URI, provide useful information using the standards

  4. Include links to other URIs. so that they can discover more things." [12]

This definition of Linked Data specifies the conditions necessary for Semantic Web-supporting data. The first condition guarantees names on the Semantic Web, just as on the Web, dereference uniquely, regardless of whether or not they indicate a resource. However, it also expands the types of things URIs can represent beyond Web documents; on the Semantic Web, a URI can refer to any kind of information, whether it be a person, a document, an application, or a value. The second condition applies to content on the Web and on the Semantic Web equally, and ensures the Semantic Web is compatible with the World Wide Web.

The final two conditions for Linked Data are what differentiate the Semantic Web from the Web. A Semantic Web URI dereference must by default provide the dereferencer with both standardized information about the dereferenced URI (condition three) and a set of URIs related to the dereferenced URI (condition four). A dereference on the Semantic Web does not necessarily link to a document; instead, it provides information about that URI and how it is connected to other content on the Semantic Web. These characteristics make Linked Data self-describing.

Queries

Linked Data is not a goal in and of itself; rather, it is a prerequisite for query-resolution by machines, the core feature of the Semantic Web. Query-resolution upon Linked Data is significantly different from Query-resolution in traditional relational databases: over Linked Data, a result is obtained by recursively exploring connections in an instance of Linked Data to other distributed instances of Linked Data; over relational databases, queries are assessed by accessing tables within that database. Relational databases are 'flat' and can be thought of linearly; Linked Data forms a graph [13].

Query-resolution over Linked Data has a number of advantages over Relational Database query-resolution, particularly in a decentralized and distributed environment like the Web. First, querying data spread across multiple sources is handled easily by the graph-structure of Linked Data, as opposed to requiring joining different relational databases. Second, because it Linked Data query-resolution occurs over HTTP, caching can be implemented easily [14].

Ontologies

The set of rules that define the validity of a statement and the logic of the inference rules for some Linked Data is called an Ontology or a Vocabulary [15]. Multiple Vocabularies can exist on the Semantic Web, though the value of a Vocabulary is closely related to the amount of information within it. Per Metcalfe's Law, a single large Vocabulary provides more value than two Vocabularies half the size [16].

An Ontology must be internally consistent to be useful, but there is no requirement that different ontologies be consistent between each other. This issue is discuess further in the Implementation Ontologies section.


Components of the Semantic Web

There are two components to an Ontology: Knowledge Representation, and Inference Rules. In coordination with Linked Data, these two components make machine-resolution of queries possible.

Knowledge Representation

The Semantic Web differs from the Web by attaching meaning to content. The meaning of this content is used to facilitate machine-exploration and query-resolution. A Knowledge Representation system sets out a method to associate attributes with objects. Given the distributed nature of the Semantic Web, several special considerations must be taken in structuring a knowledge representation system.

  • Vagueness

    Vague statements pose an interesting challenge for any knowledge representation system, and the way these statements are handled has a deep effect on what that system can be used for. A statement is vague when "there are possible states of things concerning which it is intrinsically uncertain whether [those states are] excluded or allowed by the proposition... because the speaker's habits of language were indeterminate" [17]. The sentence "Hunter is young" is a vague sentence, as the meaning of 'young' is contenxt dependent.

Knowledge Representation systems that are able to describe an unbounded number of subjects, objects, and relations are realistically guaranteed to contain vague statements. On the other hand, a bounded Knowledge Representation system might avoid the issue of vague statements by sufficiently restricting the set of allowable statements.

Inference

Inference is the last component of the Semantic Web necessary for automatic Query-resolution. In the Semantic Web, inference is the process by which "automatic procedures can generate new relationships based on the data and based on some additional information in the form of a vocabulary" [18]. There are three issues related to Inference which require special attention:

  • Inconsistency

    Any Inference rule for use on the Semantic Web must deal with inconsistencies. This is a deep and fundamental logical issue, as in Aristotelian Logic, a contradiction entails anything, a result that is unacceptable on the Semantic Web [19]. A logical system is described as 'explosive' if, in that system, "A and not A entails B, for every A and B" [20]. Deafeasible Reasoning Systems, like paraconsistent logic, are non-explosive, promoting a weaker for of logic in which contradictions are permissible [21]. These kinds of non-explosive inference systems are generally used on the Semantic Web.

  • Uncertainty

    Regardless of the Knowledge Representation System, there are some queries for which there is more than one possible answer. As an example, if the data is 'a polygon with four sides of equal length meeting at right angles' and the query is 'what shape is this polygon?', there are two equally valid answers, though they differ in specificity: 'square' or 'rectangle'. Probabilistic analysis can be used to determine the most likely answer in these situations [22].

  • Vagueness

    If vague statements are allowed in the Knowledge Representation system, the inference rules must have some means of assessing their truth-value in order. Fuzzy Logic, where truth statements are assigned a decimal value between 0 and 1 (as opposed to a binary one in traditional logic), can be applied in this situation [23].


Implementation of the Semantic Web

There are a number of implementations of Semantic Web technologies, and each implementation is administered by a different group. Each of these systems makes different design decisions which impact their ease of use, expressiveness, and extensibility. Many of the Semantic Web technologies implemented feature a tight coupling between Inference Rules, Knowledge Representation, Ontology, and Query Language.

Knowledge Representation Implementations

  • RDFa (Resource Description Framework in Attributes)

    The W3C proposes RDFa as the standard format for Semantic Web knowledge representation, with the aim of solving "the problem of marking up machine-readable data in HTML documents" [24]. RDFa uses subject-predicate-object triples, is capable of describing arbitrary relations and IRIs [25]. The subject identifies what the statement is about; the predicate identifies the type of relation in the statement; and the object indicates the value that the subject takes for the predicate. The W3C spec specifies that subjects and predicates must be IRIs, but objects may be literals. This requirement stipulates that subjects and predicates be unique references, while the object can have any value.

  • Microformats

    Microformats are an alternative to RDFa. The key difference between microformats and RDFa is that microformats, unlike RDFa, are not "Infinitely extensible and open-ended" [26]. Microformats are meant to reuse existing data formats, like hCard, for Semantic Web markup and have a low barrier to entry.

  • Microdata

    Microdata is another alternative, and supported by many of the major Search engines [27]. This format defines relations on the schema.org website. Like microformats, the microdata format does not allow arbitrary attributes to be defined. However, unlike microformats, which aims to reuse existing formats, the creators of microdata define their own attributes and relations.

A more complete overview on the differences between RDFa, Microdata, and Microformats can be found here.

W3C Ontology Implementations

The W3C has developed a number of ontologies, each for a different purpose.

  • RDFS (Resource Description Framework Schema)

    RDFS is an ontology that extends the RDF Knowledge Representation Format, providing a "mechanisms for describing groups of related resources and the relationships between these resources." [28]

  • OWL (Web Ontology Language)

    OWL, unlike RDF, RDFa, or RDFS, which define abstract ways objects can relate, is an ontology with a "formally defined meaning" [29]. This ontology is specifically intended to be used for information that "needs to be processessed, as opposed to situations where the content only needs to be presented to humans" [30].

  • SKOS (Simple Knowledge Organization System)

    SKOS is an ontology which "aims to provide a bridge between different communities of practice within the library and information sciences involved in the design and application of knowledge organization systems" [31].

Other Technologies in the W3C Semantic Web Stack

There are a number of other tools in the W3C Semantic Web Stack.

  • SPARQL (Simple Protocal and RDF Query Language)

    SPARQL is the W3C standard Semantic Query Language. SPARQL supports "aggregation, subqueries, negation, creating value expressions, extensible value testing, and constraining queries by source RDF graph" [32].

  • RIF (Rule Interchance Format

    RIF is a W3C initiative which "aims to develop a standard for exchanging rules among disparate systems, especially on the Semantic Web" [33]. The initiative aims to develop a group of interoperable languages, using those language's rigorously defined semantics and syntax to accomplish this task.


Isues Facing the Semantic Web

There are many obstacles to the realization of the Semantic Web, the most substantial of which are: the sheer quantity of information to be described; intentionally fraudulent or false data; high barriers to Semantic Web content creation; and the possible misuse of Semantic Resources.

Quantity

The massive volumne of pages and data on the Web is the single greatest challenge to the spread of the Semantic Web. There are 4.54 billion pages as of May 2015, and marking up each of those pages with Semantic Web metadata is far from trivial [34]. That said, the size of the Web is a double-edged sword, increasing the value of a widespread Semantic Web.

Barriers to Entry

The high learning curve of most Semantic Web technologies makes widespread adoption difficult. Compare the learning curve of HTML to the Semantic Web: to learn basic HTML, a user need only learn a few dozen tags; to learn 'basic' Semantic Markup, the user must have knowledge of how statements are written in the Knowledge Representation format, be familiar with the ontology they are working with right now.

Complaints about the high barriers to entry are levied against the W3C Semantic Web Stack in particular, and one of the reasons some choose to markup their pages with the simpler Microdata or Microformat Knowledge Representation formats. Ultimately, the decision to implement Semantic Content is a value judgement of the user. Given the endgoal of a global web of data, the ease with which content can be added to the Semantic Web is serious point.

Implementing Semantic Web markup does place a burden on the user, but this is a tradeoff for the gain in value that markup information provides. There's no free lunch - in order to get more out of a system, you have to put more in. The trick is to strike the right balance between expressiveness and ease of implementation.

Deception

Intentionally incorrect or misleading data poses a significant problem for the Semantic Web. The Semantic Web's key offering - increased convenience and access to data via automatic machine-exploration - is dependent on total user trust. As a result, the Semantic Web's margin for error when it comes to deceit is quite small, as even a small mistake can undermine the user's confidence in the System. Cryptographic techniques can be used to address the issue of authorship of a Semantic Web statement.

Censorship

The Semantic Web would make machine-understanding easier and more widespread, and it is possible this could be used to more aggressively censor content. I think this is one of the strongest points against the Semantic Web, as it presents a real potential negative effect of implementing such a global system. However, many of the arguments that advocate against its implementation could be made against the Web itself. Many powerful systems that provide real and significant value also open up new avenues for malicious actions; any tool can be used for good or bad. That said, this is a serious issue, and adequate provisions protecting the freedom of information need to be laid out. As the amount of content on the Semantic Web grows, this issue will become more serious, as it too obeys Metcalfe's Law.


Progress Towards the Semantic Web

When describing it in 2002, Tim described the Semantic Web as "currently in a state comparable to that of hypertext before the advent of the Web: it is clearly a good idea, and some very nice demonstrations exist, but it has not yet changed the world" [1]. Though the number of Semantic Web pages is increasing, the majority of webpages do not have Semantic Web content, and the idea has yet to 'change the world' [35].

The sheer amount of content on the Web makes the feasibility of implementing a Knowledge Representation format able to handle all Semantic Content a monumentally difficult task. Many of those who support Semantic Web content choose to do so in restricted domains, avoiding the scaling problems that come with encompassing the Web.


Conclusion

As content on the Web began to use more and more powerful tools, the ability to automatically analyze that content decreased. The Semantic Web is a response to that complexity. The aim is to make the content on the Web available and understandable again, to not just humans but machines.

The Semantic Web extends the Web by making machine-exploration and inference possible. On the World Wide Web, URIs can be found in two ways: through a reference on another page, or by human-exploration. On the Semantic Web, in addition to these two methods, dereferencing a URI produces a set of related URIs paired with a description of their relationship to the original URI. This simple modification to the result of a URI dereference gives machine-agents all the information needed to explore the web.

The difference between the Web as it is now and Tim Berners-Lee's vision for the Semantic Web does not lie in the amount of information online, but in how that information is stored, and how it can be accessed. With two simple ideas - subject-predicate-object metadata in HTML and the inclusion of related URIs on dereference - the Semantic Web changes the nature of the Web and opens up a world opportunity for information exchange.

There are many hurdles to overcome before the Web is a Semantic Web, but I believe the value of the Semantic Web far overshadows these obstacles. It isn't a question of whether or not the Semantic Web will exist or not - its a question of when.


Bibliography

  1. The Semantic Web

    - Tim Berners-Lee, James Hendler and Ora Lassila, Scientific American, 2002.

  2. Hypertext Transfer Protocol -- HTTP/1.0

    - T Berners-Lee, R. Fielding, H. Frystyk, EITF, 1996.

  3. Hypertext Transfer Protocol -- HTTP/1.0

    - T Berners-Lee, R. Fielding, H. Frystyk, EITF, 1996.

  4. Uniform Resource Identifier (URI): Generic Syntax

    - T Berners-Lee, R. Fielding, L. Masinter, EITF, 2005.

  5. Aug. 7, 1991: Ladies and Gentlemen, the World Wide Web

    - Tim Berners-Lee (original post), Tony Long (Wired article author), 2008.

  6. Understanding web pages better

    - Erik Hendriks, Michael Xu, Kazushi Nagayama, Google, 2014.

  7. Netscape and Sun announce JavaScript, the Open, Cross-Platform Object Scripting Language for Enterprise Networks and the Internet

    - Netscape Communications, 1995.

  8. 2014 Bot Traffic Report: Just the Droids You were Looking for

    - Igal Zeifman, Incapsulata, 2014.

  9. Average Daily Media Use in the United States from 2010 to 2014 (in minutes)

    - Statistica, 2015.

  10. The Internet of Things Is Far Bigger Than Anyone Realizes

    - Daniel Burrus, Wired Magazine, 2014.

  11. Semantic Web

    - W3C.

  12. Linked Data

    - Tim Berners-Lee, 2006.

  13. Semantic Queries in Databases: Problems and Challenges

    - Lipyeow Lim, Haixun Wang, and Min Wang, ACM, 2009.

  14. Advantages of RDF over Relational Databases

    - Semantic Web Forum, 2012.

  15. Vocabularies

    - W3C.

  16. Metcalfe's Law, Web 2.0, and the Semantic Web

    - James Hendler and Jennifer Golbeck, University of Maryland, 2008.

  17. Vagueness

    - Roy Sorensen, Stanford Encyclopedia of Philosophy, 2013.

  18. Inference

    - W3C.

  19. Ex Contradictione Non Sequitur Quodlibet

    - Walter A. Carnielli and Jaoa Marcos, BARK, 2001.

  20. Paraconsistent Logic

    - Graham Priest, Koji Tanaka, and Zach Weber, Stanford Encyclopedia of Philosophy, 2015.

  21. Defeasible Logic

    - Donald Nute, University of Georgia, 2003.

  22. Calculating Semantic Uncertainty

    - Gunther Wirsching, CogInfoCom, 2012.

  23. Mathematics of Fuzzy Logic

    - Francis Jeffry Pelletier, 2000.

  24. HTML+RDFa 1.1

    - Ben Adida, Mark Birbeck and Steven Pemberton, W3C, 2012.

  25. RDFa Core 1.1

    - W3C, 2011.

  26. About Microformats

    - microformats.org, 2015.

  27. Introducing schema.org: Search engines come together for a richer web

    - Google, 2011.

  28. RDF Schema 1.1

    - Dan Brickley and R.V. Guha, W3C, 2014.

  29. OWL 2 Web Ontology Language Documentation Overview (Second Edition)

    - W3C Owl Working Group, 2012.

  30. Owl Web Ontology Language Overview

    - Deborah L. McGuinness and Frank van Harmelen, W3C, 2004.

  31. SKOS Simple Knowledge Organization System Reference

    - Alistair Miles and Sean Bechhofer, W3C, 2009.

  32. SPARQL 1.1 Query Language

    - Steve Harris, Andy Seaborne, W3C, 2013.

  33. Rule Interchange Format: The Framework

    - Michael Kifer, 2008.

  34. The size of the World Wide Web (The Internet)

    - worldwidewebsize.com, 2015.

  35. Deployment of RDFa, Microdata, and Microformats on the Web - A Quantitative Analysis

    - Christian Bizer, Kai Eckert, Robert Meusel, Hannes Muhleisen, Michael Schumacher and Johanna Volker, University of Mannheim, 2013.