[Nexus] Join us to learn more about RAG on Feb.28!
💡 Editor’s Note: I have been comparing various models in the past months and extensively on Gemini 2.0 and O1-pro since it came out. The performance on long contexts for these reasoning models are amazing - and how to structure instructions is also quite interesting. That being said, there are still limits on the context and finding the right information is still the key to boost performance for both non-reasoning models and reasoning models. I have compiled a list of readings towards the end of the article for those who are curious!
Join us to learn more about RAG on Feb.28!
On February 28, Steven Bonilla, an LLM Engineer at Microsoft, will share how he built a RAG-powered chatbot for financial analysts. From architecture to best practices, his insights will help you leverage RAG to harness AI for real-time, high-impact applications.
Sign up today at https://lu.ma/3e0y47ei !
The era of huge contexts - How to give bot what it needs?
AI models, especially large language models (LLMs), can technically take in a lot of context – Gemini boasts context windows long enough to fit whole books or even an encyclopedia. But just because we can give an AI all our information at once doesn’t mean we should.
Why?
I still remember few years ago when I tried to explain my research to my lab mates - a rather complicated effort with lots of maths, programs and psychology concepts intertwined. Back then I was still rather in experienced in presentations (still improving today, of course) and I showed every single detail of the project. At the time I felt pretty good, but of course, much of my friends didn't quite follow my train of thought and was overwhelmed by the large amount of information I threw at them.
Sound familiar? Turns out giving too much information all at once is not really a good way for presentation - or just about any conversation. Just like my lab mates and Profs who got lost in my sea of details, AI models can get overwhelmed, too!
Aside from the large amount of contexts, there comes time where we cannot upload information to public AI chatbots too - privileged research data, unpublished manuscript, client-specific information etc. but we still want to leverage the convenience of LLM. (On that note, make sure to turn off data sharing when using public chatbots!).
In short, too much data, private data source, regulations etc, all makes it difficult to us to provide AI all it needs to perform its task - even today.
Deep Dive: Interestingly, the newer reasoning models does quite well (Databricks, 2024). However, such is not the case for all models. Older models like GPT-4o (yes, older. this is how fast things evolve : D) was not quite good at handling long contexts, utilizing ~10-20% of the tokens supplied (Kuratov et al., 2024). OpenAI’s o1(pro) and o3 models accepts up to 200k tokens, while Gemini boasts 2 million tokens (2.0 thinking), in contrast deepseek-R1 has a limit of 128k.
So... what can we do?
One of the methods that we could use is Retrieval-Augmented Generation (RAG). In real life, we humans don’t memorize entire textbooks (well, unless we have an exam coming up tomorrow, then we cram it all! And, of course, forget everything soon after), we just look up what we need at the moment, that is most closely related to our problem.
RAG does the same thing for AI. It pulls in only the relevant bits of information (chunks of contexts, think how you chop up a baguette - it would be rare to see someone eat it whole) right when the bot needs them. This way, the AI isn’t juggling a mountain of irrelevant details; it’s laser-focused on the stuff that matters. The result? Fewer mistakes, clearer answers, and a much happier bot (if bots had feelings, of course!). Now, there are many complex terms in Figure.1, which I recommend check out Chapter 6 in AI Engineering (Huyen, C., 2025).
Now, getting relevant chunks optimally is an interesting challenge in itself - Anthropic has a good article on contextual retrieval (Anthropic, 2024). Think of it as 'decorations' to help you find items easier. Like the name stickers for my kids diapers at day care - parents often buy the same brands, and it would be quite difficult to tell it apart without some form of name tags on it!
Then when should we consider it?
Figure.3 provides one possible decision process for choosing RAG. Now, the reality can be much nuanced because there are many practical considerations such as modality (if you need multi-modal model), cost ($$), regulation, technical expertise of your team (can you maintain a local RAG set up in the long term).
If you do choose to use RAG, and would like to learn more about the trade-offs and design considerations, then join our study group for a in-depth discussion on the topic from Steven Bonilla@Microsoft!
Up Next
Stay tuned for our next “Builder’s Corner” article: Working with Thinking Chatbots: How to Speak Simply and Directly, where we’ll explore how you can leverage reasoning models effectively and with lots of contexts. Seriously, how can I tell the model where to look? Btw, be sure to check out ‘Reasoning best practices’ by OpenAI (2025). The article says just keep your prompt ‘Simple and Direct’ - a feat not as so simple as it sounds.
What’s the last time you were frustrated because you couldn’t get your words across to one another? If we can all just speak ‘simple and direct’ life will be much easier - and there won’t as much misunderstandings / miscommunications.
It is the same with reasoning models. Oftentimes I would find myself describing a task to the model, then instantly realize what I have said could be misinterpreted - and not surprisingly it often ends up misinterpreted... If you have similar struggles with reasoning model, please comment below!
References & Reading List
Databricks. (2024). The long context RAG capabilities of OpenAI o1 and Google Gemini. https://www.databricks.com/blog/long-context-rag-capabilities-openai-o1-and-google-gemini
Kuratov, Y., Bulatov, A., Anokhin, P., Rodkin, I., Sorokin, D. I., Sorokin, A., & Burtsev, M. (2024). Babilong: Testing the limits of LLMs with long context reasoning-in-a-haystack. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
Anthropic. (2024). Introducing contextual retrieval. https://www.anthropic.com/news/contextual-retrieval
Huyen, C. (2025). AI engineering: Building applications with foundation models. O'Reilly.
OpenAI. (2025). Reasoning best practices. https://platform.openai.com/docs/guides/reasoning-best-practices