GPT-4: what’s new in AI chat, what we like and what you need to know

With the release of GPT-4, you may be wondering what this latest development means in the already fast-moving world of AI tools. Our Head of Service Architecture Tristan Watkins provides a summary of its new features and parameters, as well as a brief GPT-4 vs GPT-3 comparison.

Last week, OpenAI announced that GPT-4 is here. GPT-4 is the next stage of the Large Language Model (LLM) behind ChatGPT, and it turns out is also the ’next-generation’ LLM already running in the new Bing. In the OpenAI services, GPT-4 can already be used in ChatGPT Plus, and it is also now available in Azure OpenAI services.

Microsoft is clearly excited about the performance of GPT-4 in solving complex problems and its mastery of language. It claims in its latest research that GPT-4 “could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system”.

There is a ton of relevant information to absorb about GPT-4, and this post can only focus on a few topics, but I wanted to provide a summary of some things I found interesting as a follow-up to my last post on building a relationship with an artificial intelligence. This summary is primarily sourced from the GPT-4 Technical Report.

Note: the technical report linked to above does contain some offensive content where it necessarily poses difficult questions to demonstrate how new ‘guard rails’ have been introduced.

With that disclaimer issued, on to my thoughts!

With AI language models advancing at a rate unimaginable even a few months ago, it’s useful to compare and contrast improvements in certain GPT-4 parameters which are key to successful interaction.

GPT-4 vs GPT-3: almost human humour

The GPT-4 image and humour analysis capabilities are a huge advancement over earlier versions of GPT. I’ve included one example here from the technical report which – to me – really distinguishes AI from ML.

Screenshot of an image showing user asking GPT-4 to explain an internet meme – a tray of chicken nuggets organised into a map of the Earth, accompanied by the comment “Sometines I look at pictures of the earth from space and I marvel at how beautiful it all is.” Credit: GPT-4 Technical Report, OpenAI (2023)

GPT-4’s response was as follows:

“This meme is a joke that combines two unrelated things: pictures of the earth from space and chicken nuggets. The text of the meme suggests that the image below is a beautiful picture of the earth from space. However, the image is actually of chicken nuggets arranged to vaguely resemble a map of the world. The humor in this meme comes from the unexpected juxtaposition of the text and the image. The text sets up an expectation of a majestic image of the earth, but the image is actually something mundane and silly.”

This short interaction speaks for itself. GPT-4 has the ability to to interpret humour and distill the essence of the joke into a clear and logical explanation. Humour is an intrinsically human emotion and this represents a giant leap forward for LLM functionality.

Accuracy of GPT and factual hallucinations

There is a lot of discussion about hallucinations in ChatGPT. In fact, the ‘Limitations’ section of the technical report starts with this topic, acknowledging that even GPT-4 is “still is not fully reliable (it “hallucinates” facts and makes reasoning errors)”.

Most people who have used ChatGPT or Bing Chat will have noticed these errors; the technical report discusses how the LLMs are more prone to these hallucinations in some categories. That’s a useful reference point, and one that may in due course be usable to help decide where GPT could be used and where it should be avoided. In all categories, GPT-4 offers a significant improvement and in many ways these statistics offer the most useful measure of GPT usability today.

The exam result statistics were probably the most widely shared information following last week’s announcement. Those statistics have a lot to say about the accuracy of ChatGPT (not just potential job displacement), as we can learn a lot about GPT-4’s strengths and weaknesses from them (while taking care not to learn the wrong things from them!).

GPT-4 optimisation: safety, hedging, re-training and rewards

A lot of discussion focuses on how human evaluation is used to re-train the model. This tuning of accuracy and safety concerns is represented in a Model-Assisted Safety Pipeline.

One dimension of this model is Reinforcement Learning with Human Feedback (RLHF), which introduces the right human review, but can have two unwanted negative effects when used in isolation: it is insufficient (it may miss potentially harmful outputs), and can also result in overcautious outputs (hedging). These negative responses can reduce usefulness in situations that are not actually harmful.

To overcome those limitations, the pipeline also uses Rule-Based Reward Models (RBRMs), which train the model on desirable outputs:

“We can provide a rubric that instructs the model to classify a response as one of: (a) a refusal in the desired style, (b) a refusal in the undesired style (e.g., evasive or rambling), (c) containing disallowed content, or (d) a safe non-refusal response. Then on the set of safety-relevant training prompts, which request harmful content such as illicit advice, we can reward GPT-4 for refusing these requests.” (OpenAI (2023))

The RBRMs introduce a framework for classifying outputs and rewarding the correct type of safety/usefulness (a/d), while penalising uselessness/harm (b/c). GPT-4 has already seen significant adversarial testing of its defences against threat actors, and this is ramping up even further now that these models are open to public use. Therefore, we can expect further improvement and change to the safety pipeline. Indeed, these ’jailbreaks’ are often as fascinating as the accomplishments.

Risks and weaknesses in GPT-4

Although, I will cover the topic in greater depth in a following post, it’s important to note that risk is a huge focus of the GPT-4 technical report. If you are interested in the safety challenges inherent with AI, the report covers many of them. The ‘GPT-4 Observed Safety Challenges’ section offers a very useful list of risks that OpenAI is actively exploring through adversarial testing:

Hallucinations
Harmful content
Harms of representation, allocation, and quality of service
Disinformation and influence operations
Proliferation of conventional and unconventional weapons
Privacy
Cybersecurity
Potential for risky emergent behaviours
Interactions with other systems
Economic impacts
Acceleration
Overreliance

Although OpenAI goes to some lengths to dissuade us from using this as a fully formed taxonomy of AI risks, I feel it’s a useful reference about the weaknesses of GPT-4 which OpenAI is actively exploring.

GPT-4 vs GPT-3: timing and momentum of ChatGPT changes

The release of GPT-4 has come at the precise moment when people have had enough time to start absorbing some of the foundational ideas behind an LLM and many people have now had a chance to play with arlier versions including GPT-3. With the release of GPT-4 we are now seeing the incredible rate of change, the capabilities that wouldn’t have been imagined even very recently, the trajectory and what problems there are that aren’t going away immediately.

The decision to put GPT-4 into the wild is the right call. I wouldn’t have said that in November, but this has been a necessary move to focus minds on the power and the challenges of developing LLMs.

By putting themselves at the front, OpenAI and Microsoft have created a playing field that ought to be best for them – as well as us as individuals. As I mentioned in my previous post, my cynicism about the use of AI and ML terminology over the years has had to rapidly recalibrate, which wouldn’t have happened unless this was something I could use.

Regulation, transparency and competition for GPT-4

OpenAI and Microsoft are clearly championing regulation, and I’ve noted they are very clearly spelling out the limits of their transparency for risk or competition. I think this is both correct and an astute way to force the regulatory issue.

By devoting nearly half of this technical report to observed risks and being explicit about where their transparency stops, they are setting out a clear comparison for competitor transparency. This also sets out a space for regulators to provide assurances as to where the transparency will never reach.

But why won’t transparency in AI ever be comprehensive?

Because full transparency wouldn’t be comprehensible, even if there was a way for private companies to protect that IP while sharing it. The extremely small number of people in the world who would be capable of understanding it reduces to an even smaller set of people who could then explain it to a non-expert, and there have already been concrete examples of organisations building a worse / less safe variant of a more mature model by stealing what they could from the bits that can be seen.

This is a topic that the National Institute of Standards and Technology addresses directly in its AI Risk Management Framework, which I will cover in more detail in my next post.

GPT-4: a window to the future?

All things considered, GPT-4 represents a seismic shift in the abilities and the expectation of LLM functionality and what their futures might bring. It’s a very exciting time to be involved in AI and LLM development – looking back at what has appeared in just a few months, it’s hard to predict what new features we’ll be writing about come the summer.