The Wellcome Player, the Universal Viewer and IIIF

Digirati have been working with the Wellcome Library, The British Library and the National Library of Wales to further develop the viewing technolgies introduced in the Wellcome Player.

Please see IIIF and the Universal Viewer before implementing the data model described on these pages.

Search within

Many books on the Wellcome Library site are searchable. This feature is present in the open source player, which will display a search box for a given assetSequence if its supportSearch property is true.

It's important to appreciate that "search within" is a very different technical problem than "search across my catalogue". In many ways it's a lot simpler.

  • Search Across... is maybe something that you'd see at the library discovery layer level. Users will expect it to work like Google. It will potentially be searching across millions of pages of text. It's the kind of thing you'd build on top of Solr. Your search solution would use techniques like lemmatisation and word proximity to give users meanigful results, ranked by relevance.
  • Search within is confined to the text of the current book. Users (whether consciously or not) may expect it to work like searching a document in a word processor - it's about finding strings within a larger string. The larger string is the full text of the work. Even for a long book this is going to be a string that can be manipulated in code very quickly; computers are good at strings. You might use some more advanced search engine techniques, but people will still find it useful even if does the most primitive kind of string matching. You don't need Solr to provide a useful implementation.

The current Wellcome implementation of Search Within relies on the fact that the work has been OCRed to produce ALTO files. Each image in an assetSequence has a corresponding ALTO file that describes the words OCRed from the image. An ALTO file is an XML format that in simplest terms contains an XML element for every word on the page, giving you not only the text of the word but also the position and size of the rectangle that coincides with the word on the image.

This allows us to build a map of all the words in the document.

It also allows us to obtain the full text of the document as a single string, obtained by walking through the full set of ALTO files for every image in the work:

We can also normalise this, to remove capitalisation, punctuation etc, to make it easier to search:

While we are walking all the word elements in all the ALTO files to build this string, we can store each word in a map (named Words on this diagram:

In this simplified view of our C# implementation, each Word is an entry in a Dictionary; the key of the dictionary is the position of that word in the normalised text string. So if I search the normalised text string for the string "dysentery" I can find that there are 8 occurences (try it on, and I can find the start position of each occurrence in the string (.indexOf("dystentery")). I can look up each of these start indexes in the dictionary to obtain the Word I stored earlier, which then gives me the position and size, and the original "unnormalised" text for that word.

In practice we need to take into account spaces, search terms that are more than one word, hyphenation, words running between pages and other atrefacts of typography. We also need to coalesce results into single rectangles where words are contiguous.

If we have a list of words in the document, we can also reduce that list to the set of distinct words, and then we can partition that set into smaller sets. This allows us to provide a fast autocomplete service. The AutoCompleteBuckets Dictionary uses three-letter keys to partition the words (all words beginning with "cat" and so on). As the user provides 4 or more letters it's fast enough just to reduce the set of three letter words, we don't need further structure.

We find that we can parse a book worth of ALTO files into these structures in a few seconds. We then serialise this structure to disk, using binary serialisation, on the grounds that binary deserialisation is a faster way to recover the object than re-parsing the ALTO files. We also keep the sturcture in memory in the server for a short time, on the grounds that a user searching within a book is likely to perform several searches in a short space of time.

Together this gives us autocomplete and search within:



search within returns the rectangles (as pixel coordinates on the original image) that should be highlighted:


    "index": 464,
    "rects": [
        "x": 1774,
        "y": 2673,
        "w": 269,
        "h": 43,
        "hit": 0,
        "before": "t is adequate. The principal clinical feature of iron deficiency is hypochromic, microcytic anaemia. It comes on insidiously and will respond to iron ",
        "word": "medication.",
        "after": " The bone marrow shows micronormoblasts and absence of haemosiderin. Serum iron is reduced, while total iron-binding capacity (TIBC) is increased so t"
    "index": 466,
    "rects": [
        "x": 870,
        "y": 2997,
        "w": 293,
        "h": 44,
        "hit": 1,
        "before": "ome of the tissue changes seen with iron deficiency may be caused by zinc deficiency (Hallberg, 1964). T races of zinc accompany iron in foods and in ",
        "word": "medications.",
        "after": " Zinc-deficient animals show hyperkeratinization of the oesophagus and changes in the nails (Follis et al., 1941; Nishimura, 1953). In diagnosing iron"