Blog

Below is a list of blog posts discussing topics including Data, Analytics, Machine Learning, Artificial Intelligence, and Large-Language Models. This is page 1 of 4.

← Older Posts

How many words are 128k tokens?
2024-04-12
128k tokens are 96k words in English for ChatGPT 3.5 and 4. The ratio is estimated to be 0.75 words per token. However, the answer is not straightforward for other languages, but we can approximate it using English. Confused? I was. Let me explain.
Free Amazon Product and Bestseller Data
2024-04-11
Today, we release a massive dataset for non-commercial use, i.e. research or personal projects. The dataset covers Amazon product data for all of 2023. The drop contains nearly 1.8 billion rows (1,791,511,444) in more than 200 GB of ZSTD compressed Parquet files. This includes bestseller and product data from Amazon.com, Amazon.co.uk, and Amazon.de. Hopefully, this addresses many of the queries we have received from hobbyists and researchers over the last few years.
Introducing Tax Shrink
2024-03-14
Tax Shrink is a new online tool that helps owner-operators of Limited companies in the UK calculate and visualise the ideal salary-to-dividend ratio to maximise their net income, considering various taxes and national insurance contributions.️
How to Build Training Datasets and Fine-tune ChatGPT
2023-11-29
Large-language models (LLMs) are great generalists, but modifications are required for optimisation or specialist tasks. The easiest choice is Retrieval Augmented Generation (RAG) with in-context, which has become easier with OpenAI's custom GPTs. The hardest and most unattainable option is creating a specialist LLM from scratch. But you can achieve much of it by simply adjusting an existing LLM like ChatGPT, a process called fine-tuning. You can do that with open-source or commercial models...
OpenAI GPT-4 Turbo's 128k token context has a 4k completion limit
2023-11-23
Recently, OpenAI released GPT4 turbo preview with 128k at its DevDay. That addresses a serious limitation for Retrieval Augmented Generation (RAG) applications, which I described in detail for Llamar.ai. That amounts to nearly 200 pages of text, assuming approximately 500 words per page and 0.75 words per token and 2¹⁷ tokens.
OpenAI Custom GPTs: A Content Backdoor to Omniscience?
2023-11-09
Today, I received access to the new custom GPT feature on ChatGPT, and it appears to do what Sam Altman demonstrated. The implications are far-reaching, beyond the death of the RAG business model. How can OpenAI achieve and capitalise on Omniscience in a world of organisational data silos and increasingly defensive content creators and legislators?
Is OpenAI killing Retrieval Augmented Generation?
2023-11-07
OpenAI's DevDay announcement yesterday addresses issues I wrote about in the infeasibility of RAG after building Llamar.ai this summer. Did I get it wrong? Working through the details will take some time, but interesting immediate observations are apparent.
Llamar.ai: A deep dive into the (in)feasibility of RAG with LLMs
2023-09-27
Over four months, I created a working retrieval-augmented generation (RAG) product prototype for a sizeable potential customer using a Large-Language Model (LLM). It became a ChatGPT-like expert agent with deep, up-to-date domain knowledge and conversational skills. But I am shutting it down instead of rolling it out. Here is why, how I got there and what it means for the future.
9 Proven Programming Productivity Prompt Tips for ChatGPT
2023-04-12
Learn to harness the potential of ChatGPT4, your virtual programming partner, with nine prompting tips. Improve your programming skills by communicating clearly, engaging in conversation, using the proper syntax, and iterating on complexity. Keep context fresh, and ChatGPT4 will be invaluable in your coding journey.
Javascript TDD with ChatGPT
2023-04-05
Test-driven development in Javascript with ChatGPT-4 works. An example demonstrates it using a precise description and refined prompt engineering.

← Older Posts