Generative AI resets the playing field
Generative AI is an umbrella term that references artificial intelligence models that possess the capability to generate content. Generative AI can generate text, code, images, video, music and, of course, voices. Examples of generative AI include Midjourney, DALL-E, and ChatGPT.
Large language models (LLMs) are a type of generative AI that are trained on text and produce textual content. ChatGPT is the most well-known example of generative text AI. Add in existing technology for text-to-speech and speech-to-text, and companies then possess the capacity to create virtual assistants based upon generative AI.
This is a primary reason why the virtual assistant market stands at a crossroads in technical innovation. Advancements in large language models and generative AI are now considered crucial fundamental technologies necessary to expand the efficacy of voice interfaces and virtual assistant platforms. Natural and fluid conversational interfaces coupled with the prospect for AI to provide contextual awareness and greater personalisation now appear to be the key characteristics necessary to unlock future potential.
Despite the efforts of assistant platform owners, user engagement with traditional “intent-based” solutions had stalled significantly. Although the number of interactions has expanded, the industry never quite solved the challenges in influencing consumers towards using monetizable products and services using voice alone. Instead, “command and control” interaction dominated, “voice first” interfaces never quite reached maturity, and it became doubtful that the financial investment in virtual assistants as a standalone product could ever be fully justified.
Fortuitously, the development effort achieved throughout those years was not entirely sacrificed: instead, it has formed the baseline of genuinely groundbreaking solutions constructed around AI and neural computation. Indeed, generative AI offers a welcome disruption in the technology landscape, with large language models promising to improve the accuracy of understanding user intents while significantly enhancing the construction of vocalised responses in a new generation of virtual assistant platforms. This has essentially reset the playing field for voice interfaces.
Innovation pathway
All leading assistant platform vendors are now following broadly the same innovation pathway, with AI viewed as the catalyst for more intelligent voice and visual interfaces. Microsoft, Amazon and Google were arguably the first companies to illustrate the way forward here. Microsoft introduced Copilot on PCs as an AI-based productivity tool in late 2023; Amazon are integrating large language models into Alexa; and Google is migrating users of Google Assistant over to their Gemini platform.
This momentum continued throughout 2024: Apple introduced a new version of Siri based upon generative AI; we see Baidu further developing ERNIE, their own large language model, for products in China. Meantime, Samsung are busy expanding Bixby’s capabilities, employing their own generative AI, called Samsung Gauss; and both Huawei and Yandex are following broadly similar engagements with Pangu for Celia, and YandexGPT for Alice.
Each of these efforts are already demonstrating promising new capabilities in voice interfaces. Nevertheless, there are difficulties to overcome.
Challenges ahead
The enormous size of large language models is causing concern for platform owners. The costs involved in both training and providing the server infrastructure necessary to handle such AI leviathans appears completely disproportionate to the revenue opportunity these assistants may eventually deliver. Therefore, platform owners are seeking new methods of monetisation, often via subscription tiers that unlock access to the largest and most capable models, plus a free variant that may be suitable for simple interactions, or benign “command and control” functions.
The eradication of hallucinations – incorrect or fabricated responses generated by LLMs – remains awkwardly unresolved. This must be addressed because trust is vital for user engagement with new virtual assistant platforms. If that trust is eroded with haphazard responses that are worse than intent-based solutions, then usage will dwindle.
Moreover, voice has always promised to create a natural, frictionless interface for humans to interact with apps and services, although this dream has never fully materialised. Large language models are no doubt helping to solve this by presenting countless ways for users to express intent and phrase queries, with improved accuracy in task recognition. However, outside of the enterprise and industrial segments, the “killer app” for voice is arguably yet to be discovered.
Why large language models are desirable
Large language models will form the basis of future virtual assistants because they can present information in a clear, conversational style, easy for users to comprehend. These models excel at linguistic tasks including language translation, sentiment analysis, sentence completion and formulating answers to questions. As such, they can succinctly extract user intent from a sequence of verbal exchanges; moreover, they can frame additional questions that establish dialogue with users and help qualify meaning when the original intent is unclear.
Large language models can exhibit what is known as "in-context learning." Once a model has been pretrained, it can be tuned from further prompts. This enables these models to quickly adapt because they don’t require as many training examples to converge on the desired outcomes. In-context learning is valuable because it can be combined with historic context to maintain knowledge of past conversations and remember user preferences, helping to personalise virtual assistant platforms. This might prove a difficult pathway to pursue, given the privacy concerns and regulation that could ensue and thwart the potential here. But by appealing to a database of previous interactions, large language models, and generative AI more widely, can utilise this to construct far better reasoning and situational awareness.
The performance of large language models is expanding because the AI improves when more data and parameters are introduced. As highlighted earlier, the costs of compute and storage are the prohibiting factors governing model size. But this can be offset to some extent through optimisation – known across the industry as “quantisation” – where AI models are reduced in complexity at the expense of accuracy. Consequently, a parallel activity in developing “small” language models (SLMs) – those optimised for on-device or specific tasks – will broaden the opportunity for new assistant platforms to emerge, especially those designed for dedicated verticals and applications where precision is paramount, such as in healthcare settings or the automotive sector.
Further considerations
Large language models give the impression that they understand meaning and can respond accurately. But fundamentally, they are nothing more than an enormous array of statistical computations, the output of which is sometimes indeterminate, hence hallucinations are perhaps the most documented concern. Today’s large language models simply predict the next syntactically correct word or phrase; they cannot wholly interpret human meaning or emotions, therefore sometimes the output is incorrect or misleading.
Moreover, generative AI can present critical security risks when triggered with certain inputs. There are examples of Generative Pretrained Transformer (GPT) language models revealing large cohesive parts of the training data verbatim when asked to produce infinite output, although it is fair to state that this type of prompting has now been closed off by the industry.
Training bias is another important consideration. The types of data used to train language models heavily influence the outputs any given model creates. Therefore, if the training dataset lacks appropriate diversity, or is insufficiently representative, the outputs produced by a virtual assistant based upon a large language model will also exhibit bias.
Much has been written on the topic of consent. By their nature, large language models are trained on a plethora of datasets, some of which may not have been used with consent. When scraping data from the internet for the training of AI, this has been documented to ignore copyright licenses, copy written content, and repurpose proprietary content without permission from the original owners. It becomes almost impossible to track data sources; often no credit is conferred to the creators, which can expose users to copyright infringement issues. This issue can be partly resolved by using synthetic data created specifically for training, although the rules of bias again apply here insofar as synthesized data must also present a sufficiently sizable and unbiased training set.
Finally, it can be difficult to scale and maintain these platforms, given the significant compute and storage necessary for training, improvement and optimisation. The deployment of a virtual assistant using a large language model demands significant expertise in deep learning, an appropriate transformer model, plus distributed software and hardware.
Technology changes, but virtual assistants endure
In terms of market sizing, the challenging economic situation in most regions means that consumers are not replacing products as frequently. Nevertheless, shipments of voice-enabled consumer products in 2024 are projected to rise by 16% to approach 2.4 billion “built-in” units globally. The increase is broadly attributable to the introduction of Microsoft Copilot into PCs and laptops, boosting the number of devices with integrated virtual assistant technology. Correspondingly, the installed base rises to 7.2 billion units, an increase of 7% year-over-year.
Given the investment in generative AI and large language models presently underway, Futuresource anticipate several product iterations in the next twelve to eighteen months that significantly expand the capabilities of all popular virtual assistant platforms in use today. Voice technology shows no sign of vanishing entirely from the consumer electronics ecosystem, but whether platform vendors maintain their existing and well-known assistant brands during the transition to generative AI is still to be decided.
Futuresource Consulting is a market research and consulting company, providing its clients with expertise in Professional AV, Consumer Electronics, Education Technology, Content & Entertainment, Professional Broadcast and Automotive. Combining strong methodologies and unsurpassed data refinement with in-depth market knowledge and forecasting, Futuresource deliver the latest insights and technological developments to drive business decision-making.
For more information on the Futuresource coverage in this area please visit our website or to discuss the latest market and end user reports in detail contact benedict.greenwood@futuresource-hq.com for more information.
Multi sector AI will be a topic in the new 2024 Future Sessions by Futuresource Consulting. Sign up to be the first to hear the latest on multi sector AI.
Cookie Notice
Find out more about how this website uses cookies to enhance your browsing experience.