ChatGPT and the oil spill

Johannes Stiehler
Random Rant
Cover Image for ChatGPT and the oil spill

"Data is the new oil," said British mathematician Clive Humby in 2006. He probably didn't realize how right he was and how far this analogy can be taken.

  • Data is like oil, without processing it is useless.
  • Data is like oil, essential processes are driven by it.
  • Data is like oil, it is - in a sense - bad for the environment.
  • Data is like oil, it is valuable, though available in relatively large quantities.

But one of the key differences is that, unlike oil, data doesn't deplete. You can use the same data sets over and over again in different contexts.

Classical machine learning has always been about collecting and, more importantly, preparing data first. Automatic translation, for example, hinges on the existence of so-called "alined texts", i.e. documents that are available in two different languages and can ideally be compared sentence by sentence. On this basis, a machine learning model can then be trained to spit out the (often) correct counterpart also for new sentences. In the contest of such systems, the decisive factor was often who had the cleaner and larger data sets and not who had the better algorithm or the better technology.

The technical approaches are often based on freely available research anyway; it's the data that makes the difference.

With deep learning, this hunger for data has become much bigger. While previous methods worked quite well with 100 positive examples per category (e.g. 100 financial news vs 100 political news to train a classifier), deep learning methods needed at least one order of magnitude more. But then suddenly something crazy like image search or image categorization started to work.

With Large Language Models (and other forms of generative AI) we have ignited the next level of data craving. All the talk here is really about billions and then some. Granted, this input data is just uncategorized "tokens" instead of manually presorted documents, but the amounts of data used are still gigantic.

A key difference, however, is that in the case of ChatGPT and other such models, someone else has collected all sorts of multi-colored data for me and mixed it into a mash - an oil spill, so to speak.

As a result, the model can answer almost any question without any further data input, but conversely cannot guarantee truly reliable, accurate, or even reproducible answers for any question. The big data mush spits out an answer to almost everything, but the thinner the data was in the training set, the more questionable the answer becomes.

Huh, but ChatGPT was trained on all the knowledge in the world, wasn't it?

Even if it were (both "total" and "knowledge" and "world" are probably not appropriate terms), it needs not just one document on each topic, but a great many to end up with factually correct answers. You can see this very nicely if you go a bit to the "periphery" of the world knowledge: If there were few documents on a topic in the training set and if one perhaps even surprises the system with aberrant query languages such as German, then hallucinations are never far away:

Amtsgericht, 1. attempt

No, that is not correct, Georg Lohmeier wrote this series and the cases are all set in 1911/12.

But maybe the hyphen was confusing, in the original it is written without it:

Amtsgericht, 2. attempt

No, that is equally wrong. Interestingly, the model remains in the correct "environment", both Ernst Maria Lang and Max Neal are humorists from Bavaria in the broadest sense. But can the hyphen cause such a reinterpretation?

Let's rather try this again with a literal repetition (in the same chat window):

Amtsgericht, 3. attempt

Ah, another Bavarian humorist, again the wrong one. For all mentioned gentlemen (including the real author) Wikipedia pages exist and also other information in the net. Nevertheless, ChatGPT-4 (in the very latest manifestation from Sep 12, 2023) happily fantasizes in different directions.

If we move a little bit more into the center of the world of ChatGPT, the answers become correct, consistent and above all repeatable again:

Oily data, 1. attempt

Every repetition of this question produces (for me, as of today) the same answer. And even the English version is consistent and virtually a 1:1 translation of the German one:

Oily data, 2. attempt

But does it matter? Is anyone really interested in the "Königlich Bayerisches Amtsgericht"?

Probably yes, but the more substantial problem is that countless start-ups are founded in extremely narrow niches with the claim to build a great "AI-powered" product for that niche, which in reality often means that these products send prompts to ChatGPT in a more or less imaginative way and then process its output.

The aforementioned example documents vividly what quality can be expected from such approaches once you leave the lush core of OpenAi training data.

Johannes Stiehler
CO-Founder NEOMO GmbH
Johannes has spent his entire professional career working on software solutions that process, enrich and surface textual information.

There's more where this came from!

Subscribe to our newsletter

If you want to disconnect from the Twitter madness and LinkedIn bubble but still want our content, we are honoured and we got you covered: Our Newsletter will keep you posted on all that is noteworthy.

Please use the form below to subscribe.

NEOMO is committed to protecting and respecting your privacy and will only use your personal data to manage your account and provide the information you request. In order to provide you with the requested content, we need to store and process your personal data. If you consent to us storing your personal data for this purpose, please check the box below.

Follow us for insights, updates and random rants!

Whenever new content is available or something noteworthy is happening in the industry, we've got you covered.

Follow us on LinkedIn and Twitter to get the news and on YouTube for moving pictures.

Sharing is caring

If you like what we have to contribute, please help us get the word out by activating your own network.

More blog posts


ChatGPT "knows" nothing

Language models are notoriously struggling to recall facts reliably. Unfortunately, they also almost never answer "I don't know". The burden of distinguishing between hallucination and truth is therefore entirely on the user. This effectively means that this user must verify the information from the language model - by simultaneously obtaining the fact they are looking for from another, reliable source. LLMs are therefore more than useless as knowledge repositories.


Rundify - read, understand, verify

Digital technology has overloaded people with information, but technology can also help them to turn this flood into a source of knowledge. Large language models can - if used correctly - be a building block for this. Our "rundify" tool shows what something like this could look like.


Meaningful autocomplete still missing

We take a look at major German media portals and how they deal with the issue of search and autocomplete. A spoiler in advance: The situation remains sad.