AI: Are we nearly there yet?

Leading companies are making transformative steps in Artificial Intelligence (AI) throughout 2024, as the technology continues to penetrate virtually every aspect of computing. The advent of pre-trained transformer models has revolutionised the AI discipline. Meanwhile, the semiconductor industry is making astonishing progress in designing silicon chips with processors specifically engineered to execute AI-based tasks more efficiently.

This combination of factors is creating the foundation for AI to develop even faster. Today, the technology is rapidly developing as innovation in neural computation methods intersect with more powerful computer technology. This progress was deftly illustrated by OpenAI and Google, who both demonstrated significant step-changes in AI performance during announcements in May this year: OpenAI’s GPT-4o, and advancements to Google’s Gemini programme.

Arrival of OpenAI GPT-4o

OpenAI’s new multi-modal AI model, GPT-4o, offers a significant increase in performance and capability over GPT 3.5. According to OpenAI, the new model is designed around a single neural network that can accept any combination of text, audio and image as input. This approach allows GPT-4o to retain critical information and context that were otherwise forfeited in the independent model pipeline employed in earlier versions. For example, GPT 3.5 employs speech-to-text engines to convert dialogue into text. However, the subtle emotional nuances found in voices are lost during that process. In contrast, GPT-4o handles vocal inputs with intonation and can associate this with visual cues to further improve contextual understanding. In achieving this, OpenAI delivers a tool that is adept at handling natural human interaction. In vocal exchanges, this feels more empathetic, more engaging, more emotionally tuned and better able to hold conversations; frictionless, even.

Multi-modality enables the AI model to extract deeper context and meaning by chaining together those inputs before generating results that are remarkably more useful and coherent. It can perform more complex tasks such as identifying objects in video, reading hand-written text, and providing guidance on solving mathematical problems. Alongside, the conversational aspects are significantly improved, demonstrating real-time translation, and even producing verbal responses containing expressive elements such as intonation, emotions and laughter.

GPT-4o appears to be more efficient in its use of cloud compute resources and is faster to respond overall. OpenAI are therefore offering users limited access to GPT-4o for free within their desktop application and smartphone app, whereas access to more capable (and expensive) models was previously a paid service. Any reduction in compute requirements is a welcome aspect – since AI must become sustainable and economically viable – given the direction for AI development has been emphatically towards creating ever larger models, with increasing load on data centres.

Announcements at Google I/O

Google's announcement comprised of extensive updates to its Gemini programme. Several new AI-powered features are being integrated into existing Google products, including Gmail, Google Docs, Google Photos and of course the Google search engine. Most are not yet available worldwide, with access restricted to either within the USA or through the developer beta programme.

The main emphasis is on multi-modality and long context windows to enhance Gemini’s AI capabilities. Gemini 1.5 Pro now has a one million token context window – consider this as the input capacity for the neural model, or how much information it can be given to work on. This window is equivalent to the text in a 1,500-page document, or the data in around an hour of video; and Google announced this will be doubled to 2 million tokens, allowing Gemini to handle even more data later this year. Concurrently, a new smaller Gemini 1.5 Flash AI model was revealed. This is designed to be lightweight and fast, with lower latency for simpler tasks. And a multi-modal version of Gemini Nano will target Google Pixel smartphones later in 2024, capable of processing audio-visual data alongside text and speech. Google will clearly use Pixel smartphones as a demonstration platform for Gemini, so we expect AI elements to penetrate deeper into flagship Android smartphones thereafter, with extensions in computational photography, image manipulation, and voice search functions. Another new model, called Veo, can generate convincing HD video from just a text prompt. This is being furnished with creative tools to enable finer control of the output and editing of the results. This is likely to reside in the cloud initially, given the application in professional video production, but could eventually be ported to consumer devices as AI models become smaller and more efficient over time.

But perhaps of most interest is Project Astra. This is a programme being run by Google’s DeepMind AI unit, showcasing Google’s latest advancement toward an all-encompassing AI assistant. This uses Gemini to combine video and voice inputs in real-time to identify objects, answer questions on the purpose of computer code, read and comprehend text. But more impressive is the ability to remember past conversations or things that Gemini has seen in previous interactions: the demonstration showed the AI correctly remembering where the presenter had placed their reading glasses and explaining how to find them. There was suggestion of an AR wearable for Astra, which infers that a product similar to Google Glass might be revealed at a later date. Alternatively, Google might instead elect to enable Gemini-based AR features in Android or WearOS for other manufacturers.

The announcements accentuate Google’s focus on AI as it aims to rise above competitors, including OpenAI. However, the increased attention on AI undoubtedly brings financial risks to a decades-old ecosystem that depends heavily on digital advertising, especially where internet “search” is superseded by “generation”.

Where next?

The capabilities demonstrated this week are truly astonishing, with 2024 already confirmed as the year where AI shifts beyond intriguing science projects and into seriously useful tools. Enormous models underpin AI today, yet companies across the sector acknowledge the necessity to quickly pivot towards smaller, more efficient models to make AI-based tools more cost-effective and accessible to all.

In the future, the narrative around AI will focus less upon the models themselves and instead transition to what each AI agent can achieve. As such, Futuresource expects to see more development around AI for specific applications or industry verticals, such as education, healthcare and automotive. In fact, any organisation that regularly processes large data sets will be compelled to implement AI in their workflows, surfacing analysis and insights that would ordinarily be hidden or challenging to extract.

In the meantime, issues with AI bias and hallucinations remain unresolved. So, while generative AI is undoubtedly useful for creative processes and conversational activities, work must be achieved to make knowledge-based outcomes using AI more truthful and accurate. Only then can human trust in the results be improved.

In the short term, the demonstrations illustrate a revolution in virtual assistant capabilities. Gemini is clearly on a path to replace Google Assistant, initially on Android smartphones before being rolled out to other platforms. And it’s known that Amazon is already working on newer AI-based versions of Alexa. News reports appear to confirm a partnership between OpenAI and Apple for generative AI in iOS, hence Futuresource expects Apple to reveal a completely revised version of Siri during its Worldwide Developer Conference in June. And in achieving this, these companies have the ideal platform to showcase new conversational AI capabilities to millions of consumers worldwide.

Date Published: 20 May 2024

News
Consumer Electronics

About the author

Simon Forrest

As Principal Technology Analyst for Futuresource Consulting, Simon is responsible for identifying and reporting on transformational technologies that have propensity to influence and disrupt market dynamics. A graduate in Computer Science from the University of York, his expertise extends across broadcast television and audio, digital radio, smart home, broadband, Wi-Fi and cellular communication technologies.

He has represented companies across standards groups, including the Audio Engineering Society, DLNA, WorldDAB digital radio, the Digital TV Group (DTG) and Home Gateway Initiative.

Prior to joining Futuresource, Simon held the position of Director of Segment Marketing at Imagination Technologies, promoting development in wireless home audio semiconductors, and Chief Technologist within Pace plc (now Commscope) responsible for technological advancement within the Pay TV industry.