The proposed watermarking strategy reduces the likelihood of specific words appearing in LLM results by reducing their probabilities, essentially creating a “to avoid list”. If a text contains these low-probability words, it is likely human-generated, because LLMs are programmed to bypass these terms.
Post hoc watermark
Post-hoc watermarks insert a hidden message in the LLM generated text for later verification. To check the watermark, we can extract this hidden message from the text in question. These watermarking methods mainly fall into two categories: rule-based and neural-based.
Inference time watermark
This method modifies the selection of words during a decoding phase. The model creates a probability distribution for the next word and embeds a watermark by fine-tuning this selection process. Specifically, a hash code generated from a previous token classifies the vocabulary into “green list” And “red list” words, with the next token chosen from the green list.
In the illustration below, a random seed is generated by hashing the previously predicted token “a”, dividing the entire vocabulary into “green list” and “red list”. The next “all-in” token is chosen from the green list.
Implementing this approach assumes that users are working with a modified model, which is unlikely. Additionally, if the avoid list is public, one could manually add these words to the AI-generated text to evade detection.
The core of this approach is to create a binary classifier. The task is similar to traditional machine learning, where the accuracy of the model depends on the variety of data and the quality of the feature set.
I’ll delve deeper into several interesting implementations.
DetectGPT calculates the log-probabilities of tokens in a text, considering the conditional probabilities on previous tokens. By multiplying these conditional probabilities, we obtain the joint probability of the text.
The method then modifies the text and compares the probabilities. If the probability of the new text is significantly lower than the original, the original text is probably generated by the AI. If the odds are similar, the text is likely human-generated.
GPTZero is a linear regression model designed to estimate text perplexity. Perplexity is linked to the log-probability of the text, much like in DetectGPT. It is calculated as the negative log probability exponent of the text. Lower perplexity indicates less random text. LLMs aim to reduce perplexity by maximizing text probability.
Unlike previous methods, GPTZero does not label text as AI-generated or not. It provides a perplexity score for comparative analysis.
ZipPy uses Lempel-Ziv-Markov (LZMA) compression ratio to measure the novelty/perplexity of input samples against a small corpus (< 100 KB) of AI-generated text.
In the past, compression was used as a simple anomaly detection system. By running the network event logs through a compression algorithm and evaluating the resulting compression ratio, it is possible to assess the level of novelty or anomaly of the entry. A significant change in input will result in a lower compression ratio, serving as an alert for anomalies. Conversely, recurring background event traffic, already accounted for in the dictionary, will produce a higher compression ratio.
4. OpenAI (no longer available)
On January 31, OpenAI launched a new tool to detect AI-generated text. According to Site description, the model is a refined GPT variant that uses binary classification. The training dataset includes both human and AI-written text passages.
However, the service is no longer available at the moment:
What happened: OpenAI updated its original blog post, stating that it was suspending the classifier due to low prediction accuracy:
As of July 20, 2023, the AI classifier is no longer available due to its low accuracy rate.
The main limitation of black box methods is their rapid degradation. There is a constant need to improve the algorithm to meet new quality standards. This problem is particularly visible in light of OpenAI’s recent decision to shut down its detector.
I decided to conduct my own evaluation to see how well some services detect generated text. To do this, I generate and paraphrase text using different models (GPT-3.5, GPT-4, LLaMA-2 70B, QuillBot). I compiled a dataset containing 10 stories for each item:
- History: “Write me a short story about…”
- Tell : “Rewrite and paraphrase the story…”
- E-mail: “Write me a quick email to Paul and ask him…”
- Mimicry: “Explain to me like a little child what…”
I was amazed by the results; only a few services have managed to accurately identify the generated text:
If you want to explore further research in this area, I suggest you read the article “LLM Generated Text Detection in Computer Science Education», which already offers a complete comparison of these services. Here’s what the article had to say:
Our results show that CopyLeaks is the most accurate LLM-generated text detector, GPT kit is the best LLM-generated text detector for reducing false positives, and GLTR is the most resilient LLM-generated text detector… Finally, note that all LLM-generated text detectors are less accurate with code and other languages (apart from English), and after using paraphrasing tools (as QuillBot).
Here is the full list of services currently available: