Forwards
2023 was a pivotal year for the advancement of AI technology, particularly in the realm of Generative AI, with the notable rise of ChatGPT (Generative Pretrained Transformer) and Large Language Models (LLMs). These models have demonstrated remarkable capabilities in understanding human languages and making decisions that closely emulate human intelligence.
ChatGPT achieved an extraordinary milestone, amassing 1 million users within just five days. Since then, major technology companies have swiftly joined the competition, launching numerous LLMs, both open source and proprietary. Notable examples include LaMDA (Google AI), Megatron-Turing NLG (NVIDIA), PaLM (Google AI), Llama-2 (Meta AI), Bloom (Hugging Face), Wu Dao 2.0 (Beijing Academy of Artificial Intelligence), Jurassic-1 Jumbo (AI21 Labs), and Bard (Google AI).
In parallel with this competitive landscape, the business adoption of ChatGPT and LLMs is rapidly increasing. According to the Master of Code Global report, “Statistics of ChatGPT & Generative AI in Business: 2023 Report,” 49% of companies currently use ChatGPT, while 30% plan to implement it in the future. Additionally, a Forbes report indicates that 70% of organizations are exploring generative AI, including LLMs. This trend highlights the growing recognition of LLMs’ potential to transform businesses.
Dr. Dao Huu Hung, our Chief AI Scientist, provides insights into the future of AI and its profound impact on businesses and society.
Multimodal Generative AI
While ChatGPT and most other LLMs have excelled in understanding text-based human language, text represents only one type of data humans encounter daily. Multimodal data, encompassing images, audio, and video, is pervasive in the real world and presents substantial challenges for AI systems, including data heterogeneity, alignment, fusion, representation, model complexity, computational costs, and evaluation metrics. Therefore, the AI community often addresses unimodal data challenges before tackling the complexities of multimodal data.
Inspired by the success of LLMs, the AI community has been developing Large Multimodal Models (LMMs) capable of achieving similar levels of generality and expressiveness across various data types. LMMs leverage vast amounts of multimodal data to perform diverse tasks with minimal supervision. These models can handle tasks involving text, images, audio, and videos, such as image captioning, visual question answering, and editing images using natural language commands.
OpenAI has led the way in developing GPT-4V, an upgraded multimodal version of GPT-4 that can process both text and image inputs. GPT-4V excels in tasks such as generating images from textual descriptions, answering questions about images, and editing images using natural language commands.
LLaVA-1.5 is another model that processes both text and images, performing tasks like image captioning, visual question answering, and image editing through natural language instructions. Meanwhile, Adept is aiming to create an AI model that interacts with all software on a computer, turning plain language goals into actions.
The race among Big Tech companies to develop LMMs is accelerating, though it may take several years for LMMs to reach the current levels of LLMs.
Generating vs. Leveraging Large Foundation Models
Creating AI applications for diverse tasks has become significantly easier and more efficient. In the past, developing a sentiment analysis application could take months, but with LLMs, such applications can now be developed in a matter of days by simply formulating a prompt to evaluate text.
In computer vision, visual prompting techniques introduced by Landing AI leverage Large Vision Models (LVMs) to address various tasks like object detection, recognition, and semantic segmentation. Visual prompting uses visual cues to reprogram pretrained LVMs for new tasks, reducing the need for extensive data labeling and model training.
Generating pre-trained Large Foundation Models (LFMs), including LLMs and LVMs, requires significant AI expertise and infrastructure investment. The race to create LFMs among Big Tech companies will continue into 2024 and beyond, with both proprietary and open-source models offering diverse options for enterprises. SMEs and AI startups will primarily focus on creating LFM applications, driving commercialization.
Agent Concept in Generative AI
The agent concept represents a new trend in Generative AI with the potential to revolutionize human-computer interactions. Agents are software modules that autonomously or semi-autonomously manage sessions, such as language models and workflow-related tasks, to achieve specific goals. They can automate many tasks currently performed by humans, allowing people to focus on more strategic and creative activities.
Here are some trends related to the agent concept in Generative AI:
- Increased Use of Agents to Automate Tasks: As Generative AI advances, agents will increasingly automate tasks, such as creating and deploying AI models.
- Accessibility: Agents will make Generative AI more user-friendly and accessible, enabling a wider range of users to leverage this technology.
- New Tools and Platforms: The development of new agent-based Generative AI tools and platforms will simplify the creation and deployment of AI applications.
Examples of agent-based Generative AI tools include Auto-GPT and BabyAGI, while platforms like Google’s AI Platform and AWS’s SageMaker facilitate the deployment and management of these applications. Agent-based Generative AI applications are already being used to create new products and services, automate tasks, and enhance accessibility.
AI at the Edge
‘At-the-edge’ AI involves deploying AI models on devices such as laptops, smartphones, cameras, drones, robots, and sensors. This approach is gaining momentum due to its benefits in speed, privacy, security, and energy efficiency. It reduces reliance on cloud servers by bringing AI processing closer to the data source.
NVIDIA has been a pioneer in edge AI with its Jetson platform, leveraging its early investment in high-performance GPU technology. NVIDIA’s comprehensive software ecosystems and tools, such as TensorRT and Deepstream, support efficient AI model development and acceleration. Despite higher GPU costs, NVIDIA remains a mainstream choice in the AI community.
Several competitors offer cheaper and faster alternatives to Jetson. These include Google’s Edge TPU, Intel’s Movidius Myriad X, Xilinx’s Zynq UltraScale+ MPSoC, NXP’s i.MX 8M Plus, and Qualcomm’s Snapdragon 865, all of which focus on hardware design and software ecosystems to optimize AI model utilization.
Apple has also entered this field with its M1 and M2 chips, featuring powerful Neural Engines ideal for AI tasks like image recognition and natural language processing. Apple’s A16 Bionic chip in the iPhone 14 series and the A17 chip in the iPhone 15 Pro further enhance AI performance with low power consumption.
Qualcomm is set to release the Snapdragon Elite Gen 3 in early 2024, promising significant advancements in AI processing speed and efficiency. The competition in edge AI devices is expected to intensify in 2024 and beyond.