Seeing through the Hype: How AI Is Shaping Web 4.0
The Incoming Emergence of Conversational AI, Web 4.0, and the Hype around Them
TL;DR
We’ve had smartphones for over 15 years, and most daily activities now involve user interfaces that were unimagined when the iPhone was first invented. These developments form the context for recent media interest in ChatGPT and reinvestment in AI research by tech firms, as well as venture capitalists, which in turn begs two questions: (1) what will be the next iteration of the web, beyond the currently imagined Web 3.0? and (2) what technologies, particularly within the realm of AI, will be necessary to build that next iteration?
“Web 4.0” will be built on the intersection of evolving AI technologies and Web 3.0: the intersection of internet, mobile, visual, and audio interfaces, combined to produce a single, seamless user experience: conversational AI. Conversational AI makes a meaningful next great iteration of the web because, whether spoken or typed, conversation reduces friction better than any other interface. AI and the web both function to facilitate the interaction of humans and their environment, and the more intuitive and facile technology becomes, the more alluring it is. Converted into products, primarily via software with some hardware potentially involved, such advantages command market power and thus attract investment capital. Market forces will push tech firms towards developing conversational AI.
I’ll first provide a business case for conversational AI and Web 4.0, by briefly summarizing the market advantages of conversational AI and outlining areas of potential demand and investment. Next, I will discuss conversational AI in some depth, to better understand how it might evolve over the next ten years. Finally, I will outline areas of potential demand and investment: the more complex a technology becomes, the harder it is to create, challenging the identification of early key technologies on which later developments will depend.
1. Identifying Opportunities: Why Understanding Conversational AI is Important
To enable users to harness AI without conscious effort, conversational AI will require a complex network of technology, all orchestrated out of sight. The current iteration of user technologies, especially the smartphone, only serve to focus the transaction costs in the user interface. Consumers, both young and old, increasingly want to avoid endless swiping, scrolling, and media selection. Current solutions tend towards changing the “human” side of that interface, by, for example, urging habit change or the curating of data feeds, while some users argue for technology holidays. Tech will never be remove the need for humans to adapt themselves to an interface, but the success of the iPhone and Facebook demonstrate that consumers prefer those technologies that minimize such an adaptation—the more a new technology is perceived to “adapt” to its user, instead of the other way around, the more desirable it becomes. Likewise, the more security and availability the technology demonstrates over time, the more sustainable its market success becomes.
Conversational AI represents the logical next step in this evolution, as it targets a state in which users converse with a technology on their own terms, and, through burgeoning forms of generative AI, the interface adapts seamlessly to their preferences. Such capabilities offer many overall benefits, including:
Reducing friction in marginal or discretionary transactions, producing an advantage in all markets involving consumer spending (which drives ~70% of GDP);
Increasing innovation by removing control over technological changes from coders, enabling everyone to be an innovator by interacting verbally with software; and
Driving overall productivity growth by enabling workers across all industries to better control and improve operations and streamline processes.
Of course, the complexity underfoot to enable these changes makes selecting winning technologies harder. Oftentimes future technological change depends on some key early process or adoption. Railways would never have been possible without the steam engine, which in turn was invested to power cotton spinners (not locomotives!). The steam engine, in turn, was a necessary “keystone” predecessor to the internal combustion engine and the automobile. Compressing those technological changes into the tech lifecycle here is the necessary key step, which includes the seamless integration and abstraction of user interfaces, the democratization of coding, and natural language processing.
Code suggesting and debugging technologies like Github Copilot, GPT, possibly Bard and a few more are already tackling the latter two key early processes. Companies like Figma and Bubble are likewise introducing interesting apps on the no-code/low-code application design and development front. And at the infrastructure level, AWS (for instance) provides a crucial infrastructure layer that most modern businesses use for their cloud computing power, database storage, and content delivery services. Many new startups in the new Web 4.0 and AI space will be using AWS to provide a lot of their backend capabilities. But the integration and abstraction of user interfaces—from the modern graphical UI to the futuristic conversational UI—is the necessary next step to accelerate the adoption of conversational AI and the next iteration of the web more generally. This has yet to occur, but as more founders and venture capitalists pour into the generative AI space, a small number of successors are bound to emerge through the forming bubble.
2. Understanding Conversational AI by Predicting its Development
What exactly is conversational AI, and how does it work? According to Robb Wilson, author of Age of Invisible Machines (2023), it is the combination of three key technologies: conversational user interfaces, composable architecture, and no-code (“code-free”) rapid application programming. Instead of focusing on using software, code-free enables creating software using composable architecture. This composable architecture requires Natural Language Processing and Understanding, Large Language Models, probabilistic reasoning, reinforcement learning – indeed, those and every modern and new AI and ML model in between. Let’s explore these concepts first by imaging the future, and then walking through some of the most important of them in more detail.
A. Let’s Imagine the Future
The sci-fi movie film Her depicts a future conversational AI in action. “Samantha” is the identity given the artificially intelligent virtual assistant for the film's protagonist, Theodore, and “she” becomes his gateway to other new technologies he encounters. Theodore rapidly adapts to Samantha because he converses with her everywhere he goes, and Samantha “sees” what he experiences via the camera on his cellphone. He speaks to her using an earpiece, and when she wants to show him something, she sends an image to any nearby device. Through Samantha, Theodore easily interacts with other technologies – his personalized interface.
We are growing closer to this imagined future already. Let’s say you FaceTime your friend, Charlie, who tells you about a new jacket just purchased on an e-commerce website. They cannot remember the website’s name, but recall wearing the jacket at a concert last weekend, they shared pictures on Instagram. Wanting to buy the same jacket, you have the following conversation with your virtual assistant: “Hey, Sam, I want to buy the same jean jacket that Charlie and I were just talking about over FaceTime. I think they were wearing it at the concert on Saturday. If you don’t know the one I’m talking about, search through our photos together in the cloud from this past weekend. I don’t remember the brand, but when you find it, search for in size Large first on Levi’s before looking at Amazon. If you can’t find the exact same jacket there, search for the cheapest options available online.” Shortly thereafter, Sam sends you a text message with a link showing the same jean jacket on levi.com in size Large. By replying, “Yes. Please buy” via text, Sam places the order. Similar interactions would empower employees in their interactions with workplace technologies, across all industries.
Such changes are only possible through an umbrella conversational interface that becomes the user’s primary interaction point with other technologies. Once embedded, users will no longer think of technology in terms of different apps, because they will rarely need to open and interact with separate applications. Critically, because human conversation extends well beyond spoken or written word and includes other factors such as speed, gesture, and tone, as well as metaphor and other visual or oral aids, having a text conversation with a conversational AI interface might include that AI providing a video or drawing a graph to illustrate a point. In this way conversational AI becomes multimodal, allowing interaction across talk, text, images, and videos.
B. The Technologies Needed for that Imagined Future
Moving towards this future state is impossible when users must dig through nested tabs or apps—as many leading tech companies have discovered. When Salesforce acquired Slack, for example, its CEO admitted the company needed Slack to integrate divergent internal technologies. Microsoft is taking a similar approach with Teams, betting that an integrated communication platform and unifying conversational interface—one app that connects to everything—will benefit both its own employees and its customers. ChatGPT maintains this focus by keeping users in a single interface, removing the need to enter other applications. Alexa, Siri, Google Home, and other technologies seek to provide this connectivity for users in their homes and on their smartphones, albeit at a very basic level.
None of these technologies have solved either for ease of user interface, or the need to protect user privacy or control data. Nonetheless, as natural conversation becomes the objective for the interface between users and machines, the machine becomes less visible (and less intrusive) as the transaction costs of using the interface diminish.
This line of thinking should be familiar to UX designers. One of the hallmarks of successful UX design is an interface that gets out of the way. The further the interface recedes into the background during an experience, the more frictionless that experience becomes, lightening the user’s cognitive load and helping them get what they need from the technology more effectively. The goal of conversational AI is thus to remove the visible user interface and replace it with a seamless, personalized UX. No more swiping; just more integration, hyper-automation, and connectivity.
C. Current Technologies Essential to Conversational AI: AI-driven User Interfaces and Coding
Conversational AI and the first iterations of applications in what can be called Web 4.0 are taking place today. For example, Ex-Human is building an interactive platform that allows anyone to create fully customized “Digital Humans” from scratch, with fully customizable personality and engaging, emotionally-driven chat features. ChatGPT works by using machine learning algorithms to analyze and understand the meaning of text input, and then generating a response based on that input; its model is trained on large amounts of text data, allowing it to learn the patterns and structures of natural language.
One of the greatest use cases of ChatGPT, though, has been its ability to produce and debug code. Microsoft, for example, is already using GPT-3’s natural language model in its no-code/low-code Power Apps service to translate spoken text into code, in its recently announced Power Fx language. In partnership with OpenAI, Microsoft also recently announced their integration with ChatGPT as a “copilot for the web”: the new version of Bing will show search results both as links and as an AI-generated summary on a sidebar. Additionally, ChatGPT can now cite its sources and uses a new OpenAI model custom-designed for search. In other words, ChatGPT is now connected to the internet.
Microsoft is using this technology to improve its Microsoft Edge product, recently announcing that it is building an AI virtual assistant into its proprietary web browser that can read any web page, even a PDF, and combine information from sources across the internet. For example, the new Edge can summarize a product brochure, then ask it to do pros and cons versus another brand in a provided table. Microsoft wants AI to play an everyday role in the UX of the internet, not just something used in a search engine.
Google, for its part, recently announced its own ChatGPT clone, a conversational AI agent called Bard, using LaMDA (Language Model for Dialogue Applications) and Imagen, an image-generating AI. According to the Financial Times, Google pumped roughly $300 million into OpenAI competitor Anthropic for a 10% stake in late 2022. The deal requires Anthropic to use Google to train its models. Its general-purpose chatbot Claude looks similar to ChatGPT but employs a model called 'Constitutional AI' that supposedly enhances safety and transparency.
Microsoft and Google are not alone in their adoption of GPT-like technology to their advantage: GitHub Copilot uses the OpenAI Codex to turn natural language prompts into coding suggestions across dozens of languages and entire functions in real-time, right from one’s editor. In short, one can write a comment describing the logic one wants, and GitHub Copilot will immediately suggest code to implement the solution.
While these changes are already widely known, what has not been well-publicized is GPT’s ongoing effort to go multimodal. Just read this quote from Sam Altman, co-founder of OpenAI, on what to expect soon from GPT:
I think we’ll get multimodal models [soon], and that’ll open up new things. I think people are doing amazing work with agents that can use computers to do things for you, use programs and this idea of a language interface where you say a natural language – what you want in this kind of dialogue back and forth. You can iterate and refine it, and the computer just does it for you. You see some of this with DALL-E and CoPilot in very early ways. I think this is going to be a massive trend, and very large businesses will get built with this as the interface, and more generally [I think] that these very powerful models will be one of the genuine new technological platforms, which we haven’t really had since mobile. And there’s always an explosion of new companies right after, so that’ll be cool.
Altman has yet to specify whether GPT-4 will be multimodal, but clearly there is now a trend beyond just generative AI; we’re now seeing entire companies spring up around this multimodal AI theme, hinting at the serious disruption that is already underway.
Toucan, for example, is looking to bridge the gap into the multimodal AI world. For one thing, although the startup has a team of human translators, it also relies on machine learning and natural language processing to understand the context of each word and make sure it’s being translated properly. Nieman said that the company also takes an intelligent, personalized approach to the translations that appear over time, allowing them to become more complex to keep challenging users. The ultimate vision of Toucan, though, is to be “layered wherever you are.” According to its Co-Founder and CEO, Taylor Nieman: “We want to be this augmented layer of learning on the web, on mobile browsing, in the most popular social apps and even in the physical world,” she said, predicting that in the future, you might be “wearing a crazy cool contact lens that can translate a sign on the subway and provide you with those same micro-moments of learning.”
On the design front, “trained on thousands of top user experience designs” and leveraging large language models, Galileo AI turns a simple text description in natural language prompts into graphical UI interfaces in “lightning speed” that are editable in Figma. Galileo can then populate your design with “carefully curated AI-generated illustrations and images” to match your design needs. Though not conversational in nature, it seems that Galileo could be one step away from conversational capabilities that could allow any individual to become a UI/UX designer and developer, without the typical prerequisite knowledge of computer programming.
Another key company in this AI space is Attention. The startup’s co-founders faced similar pain points while working in sales: making sure Salesforce was consistently and accurately updated, and getting sales reps started quickly. To address these issues, Attention takes unstructured data from sales calls (whether they’re on Zoom, Meets, Microsoft Teams, or dialing software) and uses generative AI to turn them into a utility for the sales team. The app drafts follow-up emails based on call data and the sales rep’s instructions, and can also present the rep with battlecards that help them handle sales conversations in real-time. For example, if a prospect asks about how their product stacks up against the competition, a battlecard could appear with talking points. This gives it that multimodal feel that’s beginning to emerge in the tech industry and highlights a couple of new key trends.
First, the company shows how NLU and generation will allow humans to interact with computing without changing their behavior—instead of requiring users to use a computing interface, they can communicate naturally and have computing understand them. This should grow the entire addressable market for computing.
Second, as with NLU, there is a growing proliferation of AI generation tooling, which, in turn, offers lower level hooks in order to attract developers. Once that happens, startups like Attention can build competitive moats by customizing these models for their particular use case and training them on proprietary datasets. For example, as Attention establishes itself as a leader and has the most usage, it should be able to write the best follow-up emails from a sales call.
Not surprisingly, Attention has competitors, including Gong and Chorus, both of which analyze customer conversations. According to one of the co-founders, Anis Bennaceur, Attention’s advantage is its ability to flexibly understand conversations, display real-time prompts during calls and provide A/B testing for its coaching. “We haven’t seen any of these players flexibly export conversations into CRMs, and this is a strong edge that we currently have,” he said.
D. Current Technologies Essential to Conversational AI: Infrastructure
Ongoing related innovations are not limited to user interfaces or the largest tech companies. For example, at the infrastructure layer of some of the new technologies mentioned above is Hugging Face. This startup provides an open-source library for NLP technologies it calls Transformers (which can be found on GitHub). Transformers enables users to leverage popular NLP models (e.g., BERT, GPT, XLNet, T5 or DistilBERT) to manipulate text by classifying it, extracting information from it, automatically answering questions, summarizing text, generating text, and more. Nubank Monzo uses Hugging Face to answer customer inquiries via a chatbot, one of approximately 5,000 companies using Hugging Face to improve business operations and scale via AI/NLP, including Microsoft with its search engine Bing. In terms of its business model, the startup has recently launched a way to get prioritized support, manage private models, and host the inference application programming interface (API), with new clients including Bloomberg and Typeform.
Developers think of APIs as the future of integrating technology, but they might die away (at least conceptually) within a few years: once conversational AI technologies start communicating with one another using natural language, coding the exchange of information will not be a requirement. Such infrastructure developments facilitate communication between and among various users without human intervention, easing (for example) inter-vehicle communications to prevent collisions.
According to Robb Wilson, the same way that conversational interfaces will replace Graphical User Interfaces (GUIs) for end-users, they will also replace APIs as an interface between machines. Not only will this paradigm make it easier for technologies to share information, it will make it far simpler for users to supervise how the technologies are sharing information, presenting a massive change to the ways software integrations are designed and maintained.
Robb Wilson’s own company, OneReach.ai, has created a number of practical examples of conversational AI bots that can connect to multiple other bots and external integrations behind the scenes to provide a seamless user experience. In the first example, a user requested a password reset. In this case, the virtual assistant uses NLU from both LUIS and ServiceNow to give the user the best response to their question, tells the user which engine it came from, and initiates the password reset by creating a ticket on ServiceNow, an external NLU engine, to get the user the help they need. Other examples of conversational AI bots created by OneReach.ai include roadside assistance and order management systems, fraud detection, medical assistance (i.e. collecting symptoms and assisting to get medical help), HR technologies, and reducing calls to customer service centers by more than 40%.
Another company to watch is Brain Technologies. This Series B startup recently announced, along with $50 million in fresh funding, the release of Natural, an iOS app that founder and CEO Jerry Yue claims will be the world’s first “generative computer interface.” Natural is intended as a natural language search engine and “super app”; based on user requests or commands, the app provides a solution that might be in the form of links or other existing apps. For example, the statement “I’d like sushi tonight” produces options for ordering sushi (including past favored dishes) from a selection food ordering apps, restaurants to eat sushi, and recipe options for making sushi (including options for buying ingredients online). Similarly, travel searches return results that dip into multiple silos from airlines and airline aggregators, enabling rapid purchases using existing payment details – removing multi-site navigation steps that Google (for example) currently requires. Natural’s business model anticipates that the more one uses Natural, the more the app learns one’s preferences and questions, as “a tool that learns to use other tools.”
3. Web 4.0: Anticipating Areas of Potential Demand and Investment
What can we take away from Natural, OneReach.ai, and the rest? We are in the midst of the development of the key early processes and technology adoptions on which the future of technological change in artificial intelligence and the web depends. The infrastructure layer is being laid down, and the democratization of AI tools and the integration of user interfaces and backends is opening up new, fillable and/or disruptable gaps in diverse markets. A burgeoning number of founders and venture capitalists are moving in to try to fill those gaps. Though a bubble is forming and will eventually burst, a few key players will inevitably survive. Regardless of whether any of the companies mentioned in this article will emerge successfully, the way we interact with AI and the web will fundamentally change over the next few decades, even the rest of the 2020s alone.
We are transitioning from using AI not as a replacement for current workflows and tasks to using AI as “a tool that learns to use other tools.” While voice-based and other personal assistant apps—which use natural language and hefty AI engines in the backend to source information to address your various questions, do your e-commerce bidding, or control one electronic device or another in your home—have been around for years, too often they have come up short when it comes to user experience, failing to nail the right solutions to your queries. In other words, we have yet to achieve the hyper-automation portrayed in movies like Her or Iron Man.
So, ironically, the end result of a successful AI like this is not to make us feel more technologically powerful, but to get us away from our devices and the time spent on them, and into the world. In this way, conversation AI is all about abstracting the current GUIs we have come to know so well and interact with so regularly to the point that users will have no awareness of what goes on behind the scenes; we’ll simply be conversing with a single Conversational User Interface—a CUI, if you will.
Given that trend, the next iteration of the web needs focus and investment, so that it can support these powerful emergent technologies with simplified design and deployment, to elevate those who can benefit most from their use. As a result, we won’t just be jumping on the generative AI hype train, letting our ethics and user privacy slip our mind; instead, we will be attempting to turn anyone with a problem to solve into a software designer, with full integration throughout the suite of devices and tools at their disposal, all at initiation of simple voice commands.
I can only see this as possible through open-source software. Collaboration is key, and building off of each other in an open ecosystem is likely the only way we can make the whole ecosystem fully integrated, conversational, hyper-automated, and capable of building whatever solution is necessary for the current user’s problem at hand. For example, if I ask my device to help me crop a photo, I just want quick access to the best cropping tool available; I don’t care if it is part of Photoshop’s suite of tools. This could make Photoshop’s GUI rather meaningless, as I just want to access photo cropping software easily via a CUI. An open ecosystem orchestrating modular and multimodal technologies has no real use for licensed bundles of technology; open ecosystems will value flexibility and frictionless software deployment over all else.
If Web 3.0 is about owning, in addition to the reading and writing capabilities instilled by Web 1.0 and 2.0, respectively, Web 4.0 will be about reading, writing, owning, and creating. In Web 4.0, anyone will have the capability and tools at their disposal to design software and build a world around them in which they want to live, all within a single, seamless user experience.
References
To access the original written document for this thesis, use this link: https://docs.google.com/document/d/1IqmQ9N50ybCOaKlBrh4xyV4iQ8OgPPDDq-5R3iUxX4s/edit?usp=sharing
All references are linked or provided in-text.