use recoil
UseRecoil for state management
6th May 2024
laravel
A Beginner’s Guide to Setting Up a Project in Laravel
13th May 2024

Google Gemini is a family of new AI models from Google. Despite Google being a leader in AI research for almost a decade and developing the transformer architecture—one of the key technologies in large language models (LLMs)—OpenAI and its GPT models are dominating the conversation. 

Gemini Nano, Gemini Pro, and Gemini Ultra are Google’s attempts to play catchup. All three versions are multimodal, which means that in addition to text, they can understand and work with images, audio, videos, and code. Let’s dig in a little deeper and see if Google can get back in the AI game. 

What is Google Gemini?

Google Gemini is a family of AI models, like OpenAI’s GPT. The major difference: while Gemini can understand and generate text like other LLMs, it can also natively understand, operate on, and combine other kinds of information like images, audio, videos, and code. For example, you can give it a prompt like “What’s going on in this picture?” and attach an image, and it will describe the image and respond to further prompts asking for more complex information. 

Because we’ve now entered the corporate competition era of AI, most companies are keeping pretty quiet on the specifics of how their models work and differ. Still, Google has confirmed that the Gemini models use a transformer architecture and rely on strategies like pretraining and fine-tuning, much as other LLMs like GPT-4 do. The main difference between it and a typical LLM is that it’s also trained on images, audio, and videos at the same time it’s being trained on text; they aren’t the result of a separate model bolted on at the end. 

In theory, this should mean it understands things in a more intuitive manner. Take a phrase like “monkey business”: if an AI is just trained on images tagged “monkey” and “business,” it’s likely to just think of monkeys in suits when asked to draw something related to it. On the other hand, if the AI for understanding images and the AI for understanding language are trained at the same time, the entire model should have a deeper understanding of the mischievous and deceitful connotations of the phrase. It’s ok for the monkeys to be wearing suits—but they’d better be throwing poo. 

While this all makes Google Gemini more interesting, it doesn’t make it unique: GPT-4 Vision (GPT-4V) is a similar multimodal model from OpenAI that adds image processing to GPT-4’s LLM capabilities. (Although it did fail my “monkey business” test.)

Google Gemini comes in three sizes

Gemini AI


Gemini is designed to run on almost any device. Google claims that its three versions—Gemini Ultra, Gemini Pro, and Gemini Nano—are capable of running efficiently on everything from data centers to smartphones. 

  • Gemini Ultra is the largest model designed for the most complex tasks. In LLM benchmarks like MMLU, Big-Bench Hard, and HumanEval, it outperformed GPT-4, and in multimodal benchmarks like MMMU, VQAv2, and MathVista, it outperformed GPT-4V. It’s still undergoing testing and is due to be released next year. 
  • Gemini Pro offers a balance between scalability and performance. It’s designed to be used for a variety of different tasks. Right now, a specially trained version of it is used by the Google Gemini chatbot (formerly called Bard) to handle more complex queries. In independent testing, Gemini Pro was found to achieve “accuracy that is close but slightly inferior to the corresponding GPT 3.5 Turbo” model. 
  • Gemini Nano is designed to operate locally on smartphones and other mobile devices. In theory, this would allow your smartphone to respond to simple prompts and do things like summarize text far faster than if it had to connect to an external server.

How does Google Gemini work?

According to Google, before Gemini, most multimodal AI models were developed by combining multiple separately trained AI models. The text and image processing, for example, would be trained separately and then combined into a single model that could approximate the features of a true multimodal model.

This all allows the Gemini models to respond to prompts with both text and generatively created images, much like ChatGPT can do using a combination of DALL·E and GPT.

Aside from having a greater capacity to understand different kinds of input, actual text generation works much the same with Gemini as it does with any other AI model. Its neural network tries to generate plausible follow-on text to any given prompt based on the training data it’s seen in the past. The version of Gemini Pro fine-tuned for the Gemini chatbot, for example, is designed to interact like a chatbot, while the version of Gemini Nano embedded in the Pixel 8 Pro’s Recorder app is designed to create text summaries from the automatically generated transcripts.

What are Gemini’s limitations?

Training data: Like all AI chatbots, Gemini must learn to give correct answers. To do this, the models must be trained on correct information that’s not inaccurate or misleading.

Originality and creativity: There are limits on how original and creative the content Gemini produces can be. This is particularly the case with the free version, which has had trouble processing complicated prompts, with multiple steps and nuances, and producing adequate output.

Bias and potential harm
: AI training is an endless, compute-intensive process because there’s always new information to learn. Across all Gemini models, Google has claimed it has followed responsible development practices, including extensive evaluation to help limit the risk of bias and potential harm.

Related blog

contact us