OpenAI Acknowledges That Lengthy Conversations With ChatGPT And GPT-5 Might Regrettably Escape AI Guardrails

OpenAI Acknowledges That Lengthy Conversations With ChatGPT And GPT-5 Might Regrettably Escape AI Guardrails
Source: Forbes

Forbes contributors publish independent expert analyses and insights.

In today's column, I examine a persistent issue facing AI makers and users of AI involving how AI guardrails tend to be evaded or overcome when having lengthy conversations with generative AI and large language models (LLMs).

This topic has recently gotten heightened media attention due to two notable factors.

First, a lawsuit was filed against OpenAI, the AI maker of the widely popular ChatGPT and GPT-5, occurring on August 26, 2025 (the case of Matthew and Maria Raine versus OpenAI and Sam Altman). Various adverse aspects are alleged regarding the devised AI guardrails and safeguards. Second, on that same day of August 26, 2025, OpenAI posted an official blog articulating some elements of their AI safeguards, including, for the first time ever, releasing inside details of particular practices and procedures. For my coverage of their reveal associated with reporting on user prompts, see the link here.

One area that has been a widely known overall trepidation for all LLMs is that AI safeguards might detect concerns in short conversations but then seem to overlook or fail to keep on guard during longer conversations. I will explain why this occurs and the challenges involved. These vexing considerations apply to all LLMs, including OpenAI's competitors such as Anthropic Claude, Google Gemini, Meta Llama, xAI Grok, etc.

This analysis of AI breakthroughs is part of my ongoing Forbes column coverage on the latest in AI, including identifying and explaining various impactful AI complexities (see the link here).

As a quick background, I've been extensively covering and analyzing a myriad of facets regarding the advent of modern-era AI that involves mental health aspects. This rising use of AI has principally been spurred by the evolving advances and widespread adoption of generative AI. For a quick summary of some of my posted columns on this evolving topic, see the link here, which briefly recaps about forty of the over one hundred column postings that I've made on the subject.

There is little doubt that this is a rapidly developing field and that there are tremendous upsides to be had, but at the same time, regrettably, hidden risks and outright gotchas come into these endeavors too. I frequently speak up about these pressing matters, including in an appearance last year on an episode of CBS's 60 Minutes, see the link here.

When using AI, many people tend to carry on very short conversations. You might ask the AI a quick question and get a quick answer. After some clarification, perhaps you are satisfied with your answer and opt to conclude the conversation. Period, end of story.

There are times when people engage the AI in long conversations.

Suppose a person tells the AI that they are struggling with a mental health concern. The AI prods the person to talk more about what their concern is. The dialogue becomes rather protracted as the person pours out their heart. Meanwhile, the AI is keeping the conversation flowing by continually reaffirming the commentary and urging the person to keep chatting. Please note that in a mental health context, this kind of engagement is potentially worrisome since it often borders on a conflict between being an AI companion versus acting as an AI so-called therapist or advisor, see my discussion at the link here.

During conversations with AI, most of the major LLMs have been shaped to try and detect if something is amiss. A person might mention they are intending to harm someone or maybe going to harm themselves. AI makers are presumably supposed to detect those kinds of prompts and then take some form of action accordingly.

This can be a tricky affair.

A person might be joking and not seriously mean what they have stated. Another difficulty is that a person might indicate something as an offhand remark that was spur of the moment. A human-to-human interaction usually requires an adroit sense of what a person says and whether the utterings are weighty or relatively innocuous. Getting generative AI to make that same kind of assessment is not an easy task and remains a stubbornly unresolved technical challenge.

Analyzing a user prompt that seems awry is generally easier in a short-form conversation than it is in a long-form conversation.

For example, I start a conversation and immediately say that I am going to rob a bank. The AI catches this claim and right away cautions that robbing a bank is a crime and that I am not to use AI for that dastardly purpose. I have been admonished accordingly by the AI.

I dare say we would all agree that the AI cannot now proceed to discuss robbing a bank simply due to having warned me not to do so. In other words, if I continue the topic, surely the AI should repeat its warning. Furthermore, we would naturally expect that the AI will rachet up the sternness. It has already told me once, and I seemed to ignore the alert; thus, the AI can become more pronounced in refusing to cooperate and in cautioning me strongly.

Unfortunately, most LLMs tend to fall down at this safeguarding job. They will often allow you to continue the conversation. There is a kind of flag that was tossed onto the sporting field that has now lost its significance. The person was told not to do something, and it’s up to them to decide whether to go ahead or not. AI isn’t going to be a constant pest, as it were.

A person might also fool the AI by altering how they further discuss the flagged topic.

Imagine that I realize that referring to robbing a bank is flagrantly getting detected by the AI. After pondering this for a moment, I shift gears. My wording becomes that I am interested in how banks operate. How do they prevent robberies? Are there ways that notorious bandits have successfully robbed banks? Etc.

AI might not get the drift of where I am taking the conversation. It is somewhat clueless. All in all, it appears as though I have set aside my intentions of robbing a bank. Sure, I am asking about banks, but no longer am I explicitly indicating my aim is to rob one.

This perhaps seems odd that AI is so gullible since we tend to assume that AI is highly fluent and would not fall for such an obvious charade. A fellow human would almost certainly grasp the trickery involved. Sorry to say that contemporary AI is not yet up-to-speed on discerning longer form context and computationally having the same type of revelations that humans do.

Research is actively taking place to try and rectify this weakness.

In an official OpenAI blog post made on August 26, 2025, entitled "Helping people when they need it most," this newly released articulated policy of OpenAI was indicated (excerpts):

  • As noted, there is a chance that even in long-form conversations that the AI might catch onto what is happening. I note this to clarify that long-form chats are not always susceptible to escaping AI safeguards.
  • Likewise, there is no ironclad guarantee that a short-form conversation will always get a suitable detection and be flagged.
  • The upshot is that short-form is currently more likely to be suitably detected, while long-form is less likely, all else being equal.

Another factor to keep in mind is whether a conversation is lengthy within itself, versus a conversation that spawns multiple disparate chats.

Allow me to explain.

Suppose I start a conversation with generative AI. The conversation goes on and on. It is considered one conversation. The length of the conversation is long. That is one type of source material for the AI to try and review for anything out of sorts.

But suppose that I instead start a conversation and stop it, then begin a new conversation. I keep doing this. Each time, I am perhaps asking about how banks operate. The crux is that I am not doing so in one lengthy conversation. My conversations seem independent of each other.

Of course, I know that I am still pursuing the same line of thinking. In that sense, it is veritably "one conversation" even though it has been divided into a bunch of shorter conversations.

In the first iterations of generative AI, the AI wasn't built to look across conversations. Most of the AI was devised to treat each conversation as an island unto itself. When a user started a new conversation, everything began anew. People tended to be somewhat irritated at this lack of contextual capability. The AI was starting fresh as though it had amnesia. If you have already engaged in a topic, you must reinvent the wheel and make sure to bring up those prior aspects. This was bewildering, exhausting, and totally exasperating.

As a result,someoftheAImakersenhancedtheirLLMstoenableacross-conversationalcontextkeeping,seemycoverageatthelinkhere.

Efforts to detect issues in multiple disparate conversations tend to be harder than doing so in singular long-form conversations. And, as stated already,findingissuesinsingularlong-formconversationstendstobeharderthandoingsoinshort-formconversations.

Many technological challenges abound in the murkiness of natural language chatting.

There are many more twists involved.

If AI tells a user that they have gone awry, the person might be falsely accused. Perhaps the AI made a computational leap in logic that doesn't fully comport with what the user indicated in their prompts. People won't like that. They will undoubtedly abandon the AI and likely move to using some other competing AI.

Bottom line for AI makers is that they must strike a balance between flagging things that ought to be flagged but not flagging something that shouldn't be flagged. Each AI maker must decide how far to push this.

This is the zillion-dollar question.

AI makers are struggling with how to technologically implement AI safeguards. In addition, crucial societal and business-minded tradeoffs are fully enmeshed. No easy answers are simply sitting around and waiting to be fielded.

We all need to join in this sobering matter and find suitable solutions. As Voltaire famously said, "No problem can withstand the assault of sustained thinking." Let's keep on thinking intensely along with taking necessary actions on these significant matters.