Below is a list of blog posts discussing topics including Data, Analytics, Machine Learning, Artificial Intelligence, and Large-Language Models.
Large-language models (LLMs) are great generalists, but modifications are required for optimisation or specialist tasks. The easiest choice is Retrieval Augmented Generation (RAG) with in-context, which has become easier with OpenAI's custom GPTs. The hardest and most unattainable option is creating a specialist LLM from scratch. But you can achieve much of it by simply adjusting an existing LLM like ChatGPT, a process called fine-tuning. You can do that with open-source or commercial models...
Recently, OpenAI released GPT-4 turbo preview with 128k at its DevDay. That addresses a serious limitation for Retrieval Augmented Generation (RAG) applications, which I described in detail for Llamar.ai. That amounts to nearly 200 pages of text, assuming approximately 500 words per page and 0.75 words per token and 2¹⁷ tokens.
Today, I received access to the new custom GPT feature on ChatGPT, and it appears to do what Sam Altman demonstrated. The implications are far-reaching, beyond the death of the RAG business model. How can OpenAI achieve and capitalise on Omniscience in a world of organisational data silos and increasingly defensive content creators and legislators?
OpenAI's DevDay announcement yesterday addresses issues I wrote about in the infeasibility of RAG after building Llamar.ai this summer. Did I get it wrong? Working through the details will take some time, but interesting immediate observations are apparent.
Over four months, I created a working retrieval-augmented generation (RAG) product prototype for a sizeable potential customer using a Large-Language Model (LLM). It became a ChatGPT-like expert agent with deep, up-to-date domain knowledge and conversational skills. But I am shutting it down instead of rolling it out. Here is why, how I got there and what it means for the future.
Learn to harness the potential of ChatGPT4, your virtual programming partner, with nine prompting tips. Improve your programming skills by communicating clearly, engaging in conversation, using the proper syntax, and iterating on complexity. Keep context fresh, and ChatGPT4 will be invaluable in your coding journey.
Open Source libraries offer user documentation. But expert users and contributors have a deeper understanding of the inner workings stemming from a mental model and architecture derived from deep dives into the code. That understanding and model are helpful to employ the library more effectively, debug issues when using it, and teach interesting concepts on how to structure complex reusable code.
Five tips for Cloud Engineers to deploy Databricks' Delta Lake on AWS safely.
Prevent errors and inconsistencies with Delta Lake's robust data management technology.
Discover the power of the Delta Lake transaction log - ensuring Data reliability and consistency.
Get the insights you need to compete with Amazon with a comprehensive dataset.
Microsoft could follow Google's $100bn loss. I tried the new Bing Chat (ChatGPT) feature, which was great until it went disastrously wrong. It even started arguing with me while being wrong and making source code up.
The Battle of the AI Chatbots Begins: Google's Bard Takes on ChatGPT.
ChatGPT is a state-of-the-art language model developed by OpenAI, utilising the Transformer model and fine-tuned through reinforcement learning to produce accurate and ethical text responses.
ChatGPT can combine Data with natural language and has extensive information about most subjects. That lends itself to novel applications like creating informative data dictionaries.
Programming with ChatGPT using an iterative approach is difficult, as I have demonstrated previously. Maybe ChatGPT can benefit from Test-driven development (TDD). Could it aid LLMs as it does humans?
Can ChatGPT help you develop software in Python? Let us ask ChatGPT to write code to query AWS Athena to test if and how we can do it step-by-step.
ChatGPT and similar language models have recently been gaining attention for their potential to revolutionise code generation and enhance developer productivity. I was curious to see what all the hype was about, so I decided to try it out for some development work.
How Bold Data achieved an astonishing 2.3x improvement by switching from x86 to ARM.
Data is the root of all my worries ...
OpenAI's ChatGPT has made the news recently as a next-generation conversational agent. It has a surprising breadth which made me wonder, could OpenAI generate specific technology content good enough to post, and what would that imply for the future?
Finally. AWS re:Invent 2022 brought the answer to both Databricks and Athena's worst limitations. Athena Spark promises to bring Delta Lake scale-out processing effortlessly and inexpensively.
Should you switch your Data Lake to a Delta Lake? At first glance, Delta Lakes offer benefits and features like ACID transactions. But at what cost?
There is one simple thing most companies miss about their data. It has been instrumental in my work as a data professional ever since.
I have worked with data for decades. There are the two key lessons I share with every customer, stakeholder and beginner in the field. Firstly, follow the data, not (only) your gut or experience. Secondly, do not trust the data; never. Why and what has this to do with Luddites?
Insurance works because it shares costs in the face of uncertainty. What happens when Tesla removes uncertainty and distributes cost seemingly more fairly? First partially and eventually wholly? Will insurance fail, doing more harm than good?
I never wanted to be a solo founder. Yet, in 2021, I quit my job and started Bold Data to mine the Internet single-handedly. Trust me, it sounds as insane to write as to read. What on earth possessed me, and more importantly, would I do it again?
Your hard work is not appreciated. So why should you still do it? There is a good reason.
When I mentor university students or discuss careers with the people I lead, I often draw from four pieces of advice. I wish I had known these when I started, but they come with experience, perspective and confidence. Three things most of us lack at the beginning.
According to an adage, big data is anything too big for Excel, i.e. more than 1,048,576 rows. It is a bit cheek-in-tongue, but, as with many jokes, it is grounded in truth. Many business processes run on Excel to this day. That is an issue when analysing datasets like Amazon product data for valuable insight on pricing, production and supply planning, and new product or category development. Excel cannot load a single country's Amazon bestseller list. Even if you use more scalable systems, ma...
Get huge, valuable datasets with 4.9 million Amazon bestsellers for free. No payment, registration or credit card is needed.
Many Amazon marketplace customers know that its huge product catalogue has data quality issues. However, they might expect its top sellers, which they frequently see and buy, to be accurate. Bold Data, which is processing 100s of millions of products daily, has a unique ability to find hidden insights and issues. For example, active Amazon bestsellers with names resulting from data processing errors.
Public data has an enormous commercial and social impact. For example, in Ukraine, it affects war and peace, and with the Coronavirus, it involves life and death. We must keep public data accessible for the public good.