By Christian Prokopp on 2023-11-07
OpenAI's DevDay announcement yesterday addresses issues I wrote about in the infeasibility of RAG after building Llamar.ai this summer. Did I get it wrong? Working through the details will take some time, but interesting immediate observations are apparent.
On the surface, two key issues I faced, the cost per token and the context length, are being addressed. The level of improvement to 128k tokens and between half to a third of the cost per token is on the upper limit of what I expected and better than I thought. Many other detailed improvements in natural language processing are promising before we even get to the multi-modal aspects.
For example, function-calling adds parallelism, something I built into Llamar.ai and experimented with, which improves user experience by accelerating complex tasks and queries. Thread persistence and management remove a headache for developers who want to get things done. However, there is a level of sophistication and customisation to threads that may benefit advanced use cases. I experimented with strategies of removing the least important information to manage conversations, which, based on your application, is not necessarily the oldest. Additionally, intelligent rolling summaries can improve user experience tremendously by combining important and recent contexts. But I digress.
One of my main hurdles with Llamar.ai's customer use case was the monthly cost of up to a million queries and the restrictive context window size for RAG. Is it resolved? Firstly, the cost needed to drop to 1/20th, so we are still a magnitude away, and we need this to happen a few more times to be feasible for large-scale, cost-sensitive use cases. It will happen, but I wonder if even OpenAI knows when that will be achieved. Moreover, the benefit of longer context windows comes at a cost: more tokens. What can be captured as cost savings is a function of how good you want your RAG product to be.
The longer context windows are a relief to many, indeed. But be conscious that using it can cost up to a maximum of $3.84 - per query. Realistically, that would probably be closer to $1.5 or $2 when using most of the context window for input. Imagine applications where every interaction can cost you $1. A user having a long conversation, each question tagged on the previous thread, could create $10s of cost in one session. Developers and designers must cleverly utilise their thread context or switch models to manage cost. Power users can undoubtedly ruin your cost profile, and fair use policies will likely become commonplace in T&Cs.
Llamar.ai instantly had competitors in many forms, and lots of investment money has flown into them already. It was accelerated by libraries like LangChain and LlamaIndex, and the maturing of vector databases that simplified the prototyping and deployment of RAG apps. One option in such an active, crowded field is throwing PE money at the problem, swiftly capturing the market with lots of sales, and ignoring the cost. By now, it should be evident that it is a dangerous move. There is no guarantee that the cost profile will develop favourably before investment money dries up and you go bankrupt.
Assuming that does not kill you, what about the data integration monster? The easy use cases are public-facing and low-value, like RAG over documentation, blog posts, user forums, etc. The real value is locked away in Jira, Confluence, Salesforce, PDFs, Office, FTPs, RDBMS, mainframes, CRMs, etc. But if you ever had to integrate processes between them, you will know that there is no magic SaaS or single adapter to solve this. That is before we talk about data governance, privacy and IP hurdles.
And then there is OpenAI. If you paid attention, you would have noticed a little mention of Retrieval as one of the three core tools besides Function Calling and Code Interpreter in the latest OpenAI documentation. Retrieval "... will automatically chunk your documents, index and store the embeddings, and implement vector search to retrieve relevant content to answer user queries", which is RAG by API. That means capable organisations can quickly achieve RAG capabilities independently. At the same time, fine-tuning is made more accessible and cheaper, with each release eroding it as a differentiator.
Of course, as a service with integration into documentation and chatbots, there is value to be provided to customers. But on top of commoditised APIs - plural - since Amazon, Meta, Alphabet, and open-source players like Anyscale will join. For RAG, it will result in a race to the bottom and maximum scale for a few players and specialisation for the rest. The latter will have to tackle data integration and high-value problems for customers or reinvent themselves with new products.
Lastly, technology providers of ERP, CRM, BI, storage and other tools are not standing still, integrating AI capabilities into their products and tackling the market from the incumbent's strength. We saw that play out with Analytics over the last years, and now it is AI. For some providers, that will mean updating their marketing copy and sales pitch, while others will genuinely innovate. Either way, it creates a confusing and competitive landscape for customers.
RAG as a product is on the fast track to commoditisation. Anyone playing in the field must adapt quickly to find a niche and specialise or gamble on being one of the few who can survive as the GoDaddy or Namecheap of RAG. Undoubtedly, there will be unforeseen upsides of simplifying RAG, fine-tuning, thread management and agent development that will further push the imagination and opportunities for product development with AI.
For organisations, this is another opportunity wrapped in a headache. Another place for data to be moved to for insights, analytics, automation and value generation. Like with Big Data, Data Science, Analytics and Machine Learning previously, the hype and cost precede the value, and the complexity and risks are plentiful. And like previously, start with the business problem and value proposition, not the technology, before spending big.
Christian Prokopp, PhD, is an experienced data and AI advisor and founder who has worked with Cloud Computing, Data and AI for decades, from hands-on engineering in startups to senior executive positions in global corporations. You can contact him at christian@bolddata.biz for inquiries.
2024-04-12
128k tokens are 96k words in English for ChatGPT 3.5 and 4. The ratio is estimated to be 0.75 words per token. However, the answer is not straightf...
2023-11-09
Today, I received access to the new custom GPT feature on ChatGPT, and it appears to do what Sam Altman demonstrated. The implications are far-reac...
2023-09-27
Over four months, I created a working retrieval-augmented generation (RAG) product prototype for a sizeable potential customer using a Large-Langua...
2023-04-12
Learn to harness the potential of ChatGPT4, your virtual programming partner, with nine prompting tips. Improve your programming skills by communic...
2023-04-05
Test-driven development in Javascript with ChatGPT-4 works. An example demonstrates it using a precise description and refined prompt engineering.
2022-05-25
When I mentor university students or discuss careers with the people I lead, I often draw from four pieces of advice. I wish I had known these when...