Ask for what you want: Tips for creating speech-based content search

Another couple wots watching telly
Charles Dawes, Rovi
June 14th 2016 at 2:48PM : By

The widespread adoption of portable devices has led to a renewed interest in the technology of speech-based searching for content. Rovi’s Charles Dawes explains the essentials of designing a speech search system that works

For many years, effective voice-based search technologies have eluded businesses trying to bring next-generation input methods to customers. Confined to basic navigation and so-called “magic words”, command-based speech systems have been ineffective and hard for consumers to use.

The widespread adoption of smartphones and tablets, and their minimised keyboards, has led to a renewed interest in this technology however, with Apple’s Siri, Amazon’s Alexa and Google Now’s ‘OK Google’ progressing beyond basic menu navigation functions and striking a chord with consumers and businesses alike. In fact, any device with a microphone has potential for speech based commands, and can become an intelligent discovery system that uses a sophisticated entertainment brain to understand customer desires.

This technology is important and under-explored by the TV industry, which often appears to have been left behind in terms of intuitive discovery functionality. Hundreds of channels and a bewildering array of programmes present consumers with a complex picture and a grid-based TV guide, which is still navigated with a clunky remote control. For content providers, giving customers ease of access to their favourite shows and genres should be paramount, and voice based search and recommendation should be a core part of their customer service provision.

This technology is important and under-explored by the TV industry, which often appears to have been left behind in terms of intuitive discovery functionality

Speaking the viewer’s language

Video is a difficult medium to search, and people examine video content in a unique fashion, combining preferred selections and considerations across cast, plot, and genre, all of which differ depending on the user and their preferences. For conversational interfaces, which are interfaces that simulate natural communication qualities and remove the need to conform to hierarchical menu structures, the technology must understand when a user is drilling into a particular genre in detail, or when they have lost interest and have completely switched topics. o be successful, natural language search needs to encompass a variety of different points, each crucial to success:

  • Disambiguation: Natural language technology must understand and interpret the user’s intent. For example, the phonetic sound “Kroos” can be interpreted to apply to Tom Cruise or Penelope Cruz, and the system should be able to understand what the user is looking for in relation to the original query. “City” can apply to Manchester City or Norwich City in a sporting context, so again, the system must learn the users’ preference.
  • Statefulness: In the course of a dialogue with a user, the system should be able to maintain context, and understand that people often jump from one item to another. For example, the user could say that they are “in a mood for thrillers”, then jump to “Bond” and then to “old ones”. Ideally, the system should understand these requests, and serve up a series of older James Bond films for the viewer to select from. If you then say “The Young Ones” the system needs to know it should jump to the cult British comedy and not newer James Bond films.
  • Personalisation: Conversational systems need to understand their users on an individual basis. For example, the system should learn that a user based in Manchester who asks “when is the game tonight” wants to know about their local team, and if they say, “when is the City game” they mean Manchester City.

Taking understanding to the next level

Behind successful natural language technology lies excellent search capabilities. Search providers have blazed a trail in harnessing new technologies to better provide for their customers. In 2012, Google announced its “Knowledge Graph”, which was designed to understand that keywords weren’t just strings of characters but that they referred to real things in the world that are related to each other in meaningful ways.

In 2013, Facebook revealed “Graph Search”, which trawls for results based on the searcher’s friends, content and relationships, as well as wider trends on the site. Unlike a traditional database, a graph is much more scalable and flexible because it allows the connection of all sorts of, possibly unexpected kinds of information to records, without the reliance on "tables". These technologies have introduced high-quality and relevant search results to consumers everywhere, and have set a benchmark across industries.

In the context of TV, most consumers have viewing patterns that can be mapped to provide highly personalised results to searches. This is more accurate than user-based profile creation or ‘thumbs up/down’ ratings that are both error-prone and do not automatically take into account users’ changing tastes and preferences over time. The ability to make personalisation precise and extremely relevant – what the industry is now terming hyper-personalisation – is correlated to the knowledge graph’s semantic capabilities.

Getting graphic

At its core, a quality conversational search engine should include the following aspects:

  • Knowledge graph: This makes it possible to map search results to intention, and prioritise those results based on the weight of their connection, rather than simply keywords and search terms. A knowledge graph focused on entertainment should be able to:
    • Look at named entities in media, entertainment and geography and extract, de-duplicate and disambiguate the entities across sources
    • Recognise similarities and build relationships between entities
    • Identify a multidimensional view of popularity and how audience interest in the entities shift over time
    • Generate a large vocabulary such as keywords and sub-genres to help search systems identify relevant content
  • Personal graph: Crucial to true conversational systems, the personal graph tunes the conversational system to individuals to enable natural conversations that have a deep understanding of that individual user’s preferences and context. The personal graph is:
    • Based on statistical machine learning
    • Able to learn individual behavioural patterns and interests
    • Learns how time and device affect recommendations

At the front end of the system, the conversational query engine is required to bind all aspects together. This brings together key algorithms to map and learn linguistic features and provide content discovery features to customers.

Intuitive search and recommendation

Natural language technology backed with knowledge graphs can provide a revolution in TV search and recommendation. Based on excellent metadata that covers actors and actresses, content synopsis and even famous quotations from films, TV providers can create a second to none entertainment brain that offers customers speedy and accurate access to their favourite shows, and similar content that they might enjoy. Voice based discovery around knowledge graphs is no gimmick – it is set to change the way that people interact with their TV sets – as long as providers make it personalised, intuitive, and natural.