Tuesday, August 4, 2009

Can Librarians Be Put Directly Onto the Semantic Web?


The professor who taught "Introduction to Computer Programming" my freshman year of college told us that it was easier to teach a (doctor, lawyer, architect) to program a computer than it was to teach a computer programmer to be a (doctor, lawyer, architect). I was never really sure whether he meant that it was easy to teach people programming, or whether he meant that it was impossible to teach programmers anything else. Many years later, I met the doctor he collaborated a lot with, and decided that my professor's conclusion was based on an unrepresentative data set, because the doctor had the personality of a programmer who accidentally went to medical school.

I was reminded of that professor by one of Martha Yee's questions in her article "Can Bibliographic Data Be Put Directly Onto the Semantic Web?":
Do all possible inverse relationships need to be expressed, or can they be inferred? My model is already quite large, and I have not yet defined the inverse of every property as I really should to have a correct RDF model. In other words, for every property there needs to be an inverse property; for example, the property isCreatorOf needs to have the inverse property isCreatedBy; thus "Twain" has the property isCreatorOf , while "Adventures of Tom Sawyer" has the property isCreatedBy. Perhaps users and inputters will not actually have to see the huge, complex RDF data model that would result from creating all the inverse relationships, but those who maintain the model will need to deal with a great deal of complexity. However, since I'm not a programmer, I don't know how the complexity of RDF compares to the complexity of existing ILS software.
Although there are many incorrect statements in this passage, the most important one to correct here is in the last sentence. Whether she likes it or not, Martha Yee has become a programmer. Congratulations, Martha!

In many respects, the most important question for the library world in examining semantic web technologies is whether librarians can successfully transform their expertise in working with metadata into expertise in working with ontologies or models of knowledge. Whereas traditional library metadata has always been focused on helping humans find and make use of information, semantic web ontologies are focused on helping machines find and make use of information. Traditional library metadata is meant to be seen and acted on by humans, and as such has always been an uncomfortable match with relational database technology. Semantic web ontologies, in contrast, are meant to make metadata meaningful and actionable for machines. An ontology is thus a sort of computer program, and the effort of making an RDF schema is the first step of telling a computer how to process a type of information. Martha Yee's development of an RDF class to represent an Author is precisely analogous to a Java programmer's development of a Java class to do the same thing.

RDF is the first layer of the program; OWL (Web Ontology Language) is the next layer. In OWL, you can describe relationships and constraints on classes and properties. For example, an ontology could contain the statement:
<owl:ObjectProperty rdf:ID="isCreatorOf">
<owl:inverseOf rdf:resource="#isCreatedBy" />
</owl:ObjectProperty>
which defines isCreatorOf as the inverse of isCreatedBy. With this definition, a reasoning engine that encounters an isCreatorOf relationship will know that it can simplify the data graph by replacing it with the inverse isCreatedBy relationship. This does NOT MEAN that a good ontology should have inverses of all properties that it defines- in fact quite the opposite is true. The OWL ObjectProperty inverseOf (and sameAs) are meant to make it easier to link separate ontologies, not to encourage ontologies to have redundant property definitions.

I'm not sure where the notion that "for every property there needs to be an inverse property" came from, but I'll venture two guesses. It's true that if you want to browse easily in both directions from one entity to a related entity, you need to have the relationship expressed at both ends, particularly in a distributed data environment. Most application scenarios for RDF data involve gathering the data into large datastores for this reason. But you don't need an inverse property to be defined for this purpose.

Another possible source for the inverse property confusion comes from the way that relational databases work. In order to efficiently display sorted lists using a relational databases, you need to have prepared indices for each field you want to use. So if you want to display authors alphabetically by book title, and also books alphabetically by author name, you need to have relationships defined in both directions. If you're using an RDF tuple store by contrast, all the data goes in a single table, and thus indices are all predefined.

The fact that ontologies are programs that encode domain knowledge should remove a lot of mechanical drudgery for "users and inputters". To take a trivial example, the cataloguer of a new version of "Adventures of Tom Sawyer" would not have to enter "Samuel Clemens" as an alternate author name for "Mark Twain" once the isCreatedBy relationship has been made. In fact, if the ontology contained a relationship "isVersionOf", then the cataloguer wouldn't even need to enter the title or create a new isCreatedBy relationship. A library catalog that used semantic web technologies wouldn't need separate programming to make these relationships, they would be come directly from the ontology being used.

To some extent, the success of the semantic web in any domain is predicated on the successful embodiment of that domain's knowledge in ontological code. Either coders need to learn the domain knowledge, or domain experts need to learn to code. People need to talk.

1 comment:

  1. I see your perspective on the encoding of knowledge and that is certainly one way to use Semantic Web technologies and techniques in libraries. However, there are others.

    The approach you propose has been called "Big S" semantics by Prof. Jim Hendler of RPI. The alternative, which uses a bare bones minimum of RDF to express critical relationships between resources of interest at the time of consumption, was called "little s" semantics. 'Little s' semantics is analogous to hyperlinking on the World Wide Web in that not all possible relationships need to be defined in advance for value to be obtained.

    The Library of Congress, for example, is working with Zepheira on a system based upon 'little s' semantics to curate distributed metadata from external partners. See http://zepheira.com/publications/news/#LC_Recollection for an overview.

    Regards,
    Dave

    ReplyDelete

Note: Only a member of this blog may post a comment.