Google Gemini

8th May 2024

What is Google Gemini?

Google Gemini is a family of AI models, like OpenAI’s GPT. The major difference: while Gemini can understand and generate text like other LLMs, it can also natively understand, operate on, and combine other kinds of information like images, audio, videos, and code. For example, you can give it a prompt like “What’s going on in this picture?” and attach an image, and it will describe the image and respond to further prompts asking for more complex information.

Because we’ve now entered the corporate competition era of AI, most companies are keeping pretty quiet on the specifics of how their models work and differ. Still, Google has confirmed that the Gemini models use a transformer architecture and rely on strategies like pretraining and fine-tuning, much as other LLMs like GPT-4 do. The main difference between it and a typical LLM is that it’s also trained on images, audio, and videos at the same time it’s being trained on text; they aren’t the result of a separate model bolted on at the end.

In theory, this should mean it understands things in a more intuitive manner. Take a phrase like “monkey business”: if an AI is just trained on images tagged “monkey” and “business,” it’s likely to just think of monkeys in suits when asked to draw something related to it. On the other hand, if the AI for understanding images and the AI for understanding language are trained at the same time, the entire model should have a deeper understanding of the mischievous and deceitful connotations of the phrase. It’s ok for the monkeys to be wearing suits—but they’d better be throwing poo.

While this all makes Google Gemini more interesting, it doesn’t make it unique: GPT-4 Vision (GPT-4V) is a similar multimodal model from OpenAI that adds image processing to GPT-4’s LLM capabilities. (Although it did fail my “monkey business” test.)

Google Gemini comes in three sizes

Gemini is designed to run on almost any device. Google claims that its three versions—Gemini Ultra, Gemini Pro, and Gemini Nano—are capable of running efficiently on everything from data centers to smartphones.

Gemini Ultra is the largest model designed for the most complex tasks. In LLM benchmarks like MMLU, Big-Bench Hard, and HumanEval, it outperformed GPT-4, and in multimodal benchmarks like MMMU, VQAv2, and MathVista, it outperformed GPT-4V. It’s still undergoing testing and is due to be released next year.
Gemini Pro offers a balance between scalability and performance. It’s designed to be used for a variety of different tasks. Right now, a specially trained version of it is used by the Google Gemini chatbot (formerly called Bard) to handle more complex queries. In independent testing, Gemini Pro was found to achieve “accuracy that is close but slightly inferior to the corresponding GPT 3.5 Turbo” model.
Gemini Nano is designed to operate locally on smartphones and other mobile devices. In theory, this would allow your smartphone to respond to simple prompts and do things like summarize text far faster than if it had to connect to an external server.

How does Google Gemini work?

According to Google, before Gemini, most multimodal AI models were developed by combining multiple separately trained AI models. The text and image processing, for example, would be trained separately and then combined into a single model that could approximate the features of a true multimodal model.

This all allows the Gemini models to respond to prompts with both text and generatively created images, much like ChatGPT can do using a combination of DALL·E and GPT.

Aside from having a greater capacity to understand different kinds of input, actual text generation works much the same with Gemini as it does with any other AI model. Its neural network tries to generate plausible follow-on text to any given prompt based on the training data it’s seen in the past. The version of Gemini Pro fine-tuned for the Gemini chatbot, for example, is designed to interact like a chatbot, while the version of Gemini Nano embedded in the Pixel 8 Pro’s Recorder app is designed to create text summaries from the automatically generated transcripts.

What are Gemini’s limitations?

Training data: Like all AI chatbots, Gemini must learn to give correct answers. To do this, the models must be trained on correct information that’s not inaccurate or misleading.

Originality and creativity: There are limits on how original and creative the content Gemini produces can be. This is particularly the case with the free version, which has had trouble processing complicated prompts, with multiple steps and nuances, and producing adequate output.

Bias and potential harm: AI training is an endless, compute-intensive process because there’s always new information to learn. Across all Gemini models, Google has claimed it has followed responsible development practices, including extensive evaluation to help limit the risk of bias and potential harm.

Related blog

Google Gemini

UseRecoil for state management

A Beginner’s Guide to Setting Up a Project in Laravel

Google Gemini

What is Google Gemini?

Google Gemini comes in three sizes

How does Google Gemini work?

What are Gemini’s limitations?

Related posts

Devin AI: Software Engineer as we know so far

Understanding How ChatGPT Maintains Context

Exploring OCR with OpenCV and PyTesseract