You don't need to be signed in to read BMJ Blogs, but you can register here to receive updates about other BMJ products and services via our site.

A brave new world for PDF’s? Utopia Documents explained.

21 Jan, 11 | by BMJ

Following on from last week’s discussion of information-seeking behaviour, today we’ll be exploring one way of transforming individual articles into portals to greater information; Utopia Documents .

What is Utopia Documents?

At its most basic level, Utopia Documents is a PDF reading tool that allows articles to be augmented with interactive content, and helps the reader explore data associated with a particular paper. It’s a desktop application for reading and exploring papers, and functions in many respects like a normal PDF reader. Its real potential becomes  clear when configured with appropriate domain-specific ontologies and plugins. Once these are in place, the software transforms PDF versions of articles from static facsimiles of their printed counterparts into dynamic gateways to additional knowledge, linking both explicit and implicit information embedded in the articles to online resources, as well as providing access to auxiliary data and interactive visualisation and analysis tools. For a thorough demonstration of the software, take a look at the video below:

Why concentrate on PDF’s and not HTML?

Given the huge investment in XML/HTML versions of articles, are we taking a step backwards by semantically-tagging our PDF’s? Professor Teresa K. Attwood, who led the bio-informatics component of the EPSRC/DTI-funded UTOPIA(d) project, argues that:

“Utopia Documents was developed in response to the realization that, in spite of the benefits of ‘enhanced HTML’ articles online, most papers are still read, and stored by researchers in personal archives, as PDF files. Several factors likely contribute to this reluctance to move entirely to reading articles online: PDFs can be ‘owned’ and stored locally, without concerns about web sites disappearing, papers being withdrawn or modified, or journal subscriptions expiring; as self-contained objects, PDFs are easy to read offline and share with peers (even if the legality of the latter may sometimes be dubious); and, centuries of typographic craft have led to convergence on journal formats that (on paper and in PDF) are familiar, broadly similar, aesthetically pleasing and easy to read.”

Further authors have responded to reservations regarding the semantically-limited nature of PDF’s as being a non-issue.

“We argue that PDFs are merely a mechanism for rendering words and figures, and are thus no more or less ‘semantic’ than the HTML used to generate web pages. Utopia Documents is hence an attempt to provide a semantic bridge that connects the benefits of both the static and the dynamic online incarnations of published texts.”

What are the main features of Utopia Documents?

In an interview at the Guardian, Utopia’s Phillip McDermott says:

“Utopia Documents links scientific research papers to the data and to the community. It enables publishers to enhance their publications with additional material, interactive graphs and models. It allow the reader to access a wealth of data resources directly from the paper they are viewing, makes private notes and start public conversations. It does all this on normal PDFs, and never alters the original file. We are targeting the PDF, since they still have around 80% readership over online viewing.”

Explore article content

An integrated semantic search bar enables users to explore the biological content of an article from within a PDF reader. This offers readers the opportunity to investigate aspects of a scientific article further or clarify given terms.

Discover published metadata

If a publisher has invested in the appropriate domain-specific ontologies and plugins,  Utopia Documents can provide access to additional context, from database entries to golssary definitions. All new articles in the Semantic Biochemical Journal, for example, include publisher-curated annotations of the most salient facts.

Comment on articles

The software allows readers to annotate their PDF’s, either privately for personal reference or publicly as part of an online discussion.

Interact with live data

Utopia Documents allows users to interact directly with curated database entries. Within the familiar setting of a PDF reader, they can play with molecular structures; edit sequence and alignment data and even plot curated tabular data.

Many scholars of research behaviour argue that for electronic journals to survive and thrive, they must be different from their print antecedents. Although it is certainly true that online journals must offer added functionality, it would be more appropriate to refer to the printed versions as competitors rather than predecessors. Designers and publishers must therefore fully exploit the electronic medium’s basic properties, with ‘interactivity’ as the primary characteristic of new technologies. Utopia Documents allow the user to search through an integrated search bar, play with molecular structures and annotate documents for online collaboration. While reading electronic journals is not the same as reading a print copy, it’s time to fully exploit the opportunity of these electronic documents by offering users advanced features and novel forms of functionality beyond what is possible in print.

Utopia Documents is free and can be downloaded here: http://getutopia.com/documents/

Semantic publishing: how to create richer metadata

10 Dec, 10 | by BMJ

Following a previous post on the Semantic Web, this week we’ll be exploring the implications of this web of data for the publishing world. Semantic web technologies, as opposed to the grander idea of the Semantic Web itself, offer tools that can help publishers assemble and distribute their content more efficiently.

What is semantic publishing?

Fundamentally, semantic web publishing refers to information published on the web, accompanied by semantic markup. Semantic publication makes information search and data integration more effective by equipping computers with the ability to understand the structure and even the meaning of the published information. In the Semantic Web, published information is accompanied by metadata describing the information, thereby providing a ‘semantic’ context.

What difference could this make to the publishing world?

Many believe that semantic publishing has the potential to revolutionise scientific publishing. Tim Berners-Lee predicted in 2001 that the Semantic Web “will likely profoundly change the very nature of how scientific knowledge is produced and shared, in ways that we can now barely imagine”. Revisiting the Semantic Web in 2006, he and his colleagues argued that it “could bring about a revolution in how, for example, scientific content is managed throughout its life cycle”. Researchers could directly self-publish their experiment data in ‘semantic’ format on the web and semantic search engines could then make these data widely available.

Creating richer metadata – the technical bit

Metadata is used by most publishers in some capacity. The majority also use taxonomies (a hierarchy of terms used to categorise content), although they might not be aware of this name. The next step towards richer metadata is the use of ontologies. Mimicking the relationship between taxonomies and metadata, ontologies make taxonomies look ‘flat’. Ontologies describe more detailed relationships among concepts and provide a higher level of richness in the metadata.

Taxonomies are very similar to the animal and plant kingdom taxonomies, in which every species is located in a particular branch. However, more conceptual objects don’t always fit so nicely into this basic lineage. If a publisher created a taxonomy based on colours with the following—red, yellow, and blue—as the top nodes, purple would need to be related to both red and blue. In a simple taxonomy, the term ‘purple’ would probably be repeated under both, but in a technical sense they would actually be two distinct nodes that have the same name.

In an ontology, however, purple can be represented as the same concept appearing in multiple nodes on the tree. However, rather than being tree-like, ontologies are a complex mapping of concepts with defined relationships between those concepts (such as ‘subclass of’ or ‘part of’).

In the video below, Louise Tutton, COO at Publishing Technology, talks about the Semantic Web and its opportunities at Online Information, London (30th November).

http://www.youtube.com/watch?v=Ky_JUDWXEDU

The Semantic Web: what’s the point?

19 Nov, 10 | by BMJ

Much of the data we use on a daily basis is not part of the Web. We can see bank statements and photographs online, as well as appointments in a calendar. But can we view our photos in a calendar to ascertain what we were doing when we took them? Or on a map so we know where we took them? Can we see bank statement lines in a calendar to help us put our purchases into context? The answer, currently, is no.

But why not? The simple answer is that we don’t have a web of data. Data is controlled by applications, and each application keeps its data to itself; applications don’t often like to share.

What’s different about the Semantic Web?
The Semantic Web (sometimes referred to as Web 3.0, Web 2.1 or Web 2.0++) is a web of data. The original Web mainly concentrated on the interchange of documents. The Semantic Web, however, is about more than that. It concentrates on common formats for integration and the combination of data drawn from diverse sources. It is also about language for recording how the data relates to real world objects. This allows a person, or a machine, to start off in one database, and then move through an unending set of databases which are connected not by wires but by being about the same thing.

Tim Berners-Lee described the Semantic Web vision in the following terms:

I have a dream for the Web [in which computers] become capable of analysing all the data on the Web, the content, links, and transactions between people and computers. A Semantic Web, which should make this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The intelligent agents people have touted for ages will finally materialise. (1999)

Whereas Web 2.0 is focused on people, the Semantic Web is focused on machines. The Web requires a human operator, using computer systems to perform the tasks required to find, search and aggregate its information. It’s impossible for a computer to do these tasks without human guidance because Web pages are specifically designed for human readers. The Semantic Web is a project that aims to change that by presenting Web page data in such a way that it is understood by computers, enabling machines to do the searching, aggregating and combining of the Web’s information — without a human operator. So what are the real benefits offered by this web of data?

Intelligent search results
The major advantage of the Semantic Web is more intelligent searches, either across the web or in large-scale data repositories, where intelligence is referred to in contrast to the conventional keyword-based search methods employed by search engines. For instance, when performing a search in Google for  ‘medical publishing’ you will notice that among the first pages of the results returned, the vast majority contain the keywords ‘medical publishing’ in the respective page text. That is because the search engine does not process the content available semantically and therefore the results, though accurate, will be far from complete.

This is where the semantic web comes in to play. The vision is to get a list of what you asked for even if your keyword does not exist within the web page. In the example above, a page with BMJ Group articles will not be considered relevant if the words ‘medical publishing’ do not exist within our page. In the semantic web world the system would ‘know’ that the BMJ Group publishes medical articles and therefore our articles would be returned to the user performing the query.

Inferring knowledge
Another benefit is the capacity to infer knowledge from existing data. A system built using semantic web technologies, with the support of reasoning procedures could then logically (and independently) deduce information. A classic example is that from the statements ‘all men are mortal’ and ‘Socrates is a man’, we can deduce that ‘Socrates is mortal’. This property (transitive property) in combination with a wider set of properties can augment the knowledge inserted in a system, without requiring human insertion of each and every fact, thereby reducing both error and workload.

By stating 5 facts to a system, using an ontology (a glossary) and a reasoner, the system will be able to deduce 15 facts by applying rules of logic (reasoning). This is precisely what allows the intelligent queries mentioned in the medical publishing example. Such a system, when asked “is Socrates mortal”? will return a YES. Systems without reasoning would produce the answer NO (or UNKNOWN in other cases). Similarly, Socrates would be included in a search like “show me all the mortals in the system”. This is, in fact, what is meant by ‘machine understandable’ information; the ability for a machine to process information independently.

For a good basic introduction to the Semantic Web, take a few minutes to watch the following video. No previous knowledge required!

http://www.youtube.com/watch?v=OGg8A2zfWKg

BMJ Journals Development blog homepage

BMJ Web Development Blog

Keep abreast of the technological developments being implemented on the BMJ journal websites.



Creative Comms logo