Previously, I explored the design patterns of AI agentic applications. After getting started with this new type of agentic application, I naturally wanted to try building my own. Today, I’ll briefly share a prototype I coded in about 2–3 hours, modeled after a GPT agent.
Feel free to get AI-generated Voice Summary in 2 mins via below podcast source.
This post is about: Don't you think the whole AI Agent idea is just amazing? An AI Agent is so much more than just a chatbot window. It's like it can actually think for you, get tasks done, and then give you a neat summary of the results.
We can start with the basics by building our own AI Agent, chose the GPT Agent as our reference. This simple project will help us really understand the core elements of a great AI Agent: Reasoning, Planning, and Tool Use.
From there, we can look at more advanced features to make it even better, like Memory, Reflection (which is like self-correction), and Perception.
It's pretty cool when you think about it—if you look closely, GPT has been an agent all along, even from the very beginning. It doesn't just answer your questions; it follows a whole process of reasoning, planning, and using tools to get you what you need.
I hope this gives you a better sense of what an AI Agent is all about. Feel free to reach out anytime to chat more and keep exploring this fascinating world together!
1. Building an AI Agent – a GPT Agent-like
In this experiment, we created a case where an Agent can automatically transform a user’s intent into a concrete action—a service scenario similar to a GPT Agent.
In the actual demo, the user asked the following question:
“I really want to visit that beautiful island. Could it be in Japan?”
The system’s response was then displayed (with a few seconds of screen delay due to technical issues, which can be ignored).
This demo was built in a short time—around 2–3 hours—to connect several technical components. While there are still many details to refine and only a few tools were used, the demo already demonstrates how a GPT Agent–like system could operate.
The workflow can be summarized as follows:
User Intent Reasoning (LLM — Intent Reasoning)
Infer the user’s intent from the input.
Decomposition into Actions (LLM — Planning, Intent to Action)
Break down the inferred intent into executable actions.
Tool Matching (Match Tool Definition)
Align each action with the corresponding tool or reference configuration.
Execution Loop
Execute the actions step by step; if a tool is required, dispatch the appropriate tool.
Result Collection
Gather outputs for each action (e.g., screenshots and text extraction).
Final Report Generation (LLM — Summary Generation)
Integrate the results into a comprehensive summary report.
Reception and triggering of user intent.
Through LLM reasoning and planning, the input is decomposed into structured actions that fit the provided tool definitions.
Execute the tools step-by-step according to the actions above, and perform the necessary data collection. (My requirements: take screenshots and collect the final HTML content.)
Final results obtained via the tools (screenshots and the collected final HTML content).
The LLM is then applied again to generate a summary and bilingual (Chinese and English) versions based on the previously collected results, forming the final report.
2. Key Design Points of the AI Agent
The key points of this type of agentic system are as follows:
Core Component — The Planner At the heart of the AI model play is the core component, which in this case acts as the Planner (I used Gemini here). Based on the prompt design, it analyzes the user’s intent and converts it into corresponding practical actions (Intention → Action).
Enough Tool Providing Another crucial aspect is matching the appropriate tools that can fulfill these “intentions translated into actions.” In this example, I configured a browser operation tool.
Good Quality of Tool Match Between these steps, there must be a strong adaptation mechanism that maps or aligns the defined actions into tool-usable text. This ensures smooth orchestration of operations, ultimately completing a series of automated tasks that simulate user actions.
3. Prototype Code of the AI Agent
Mainly Reasoning、Planning、Execution
The Reasoning and Planning parts are simply guided using prompts.
This also includes reference information and requirements for tool matching.
Finally, execute the Execution step iteratively until completion.
A simple experiment comparing four different modes, showing how to handle and decompose the user’s intent.
It demonstrates the advantages of ReAct.
We explored the effectiveness of multiple approaches. Under the basic single standard (Normal Prompt), both ReAct and CoT showed limited performance. Therefore, we conducted hybrid experiments (CoT-SC + ReAct or ReAct + CoT-SC), where increasing the number of iterations can achieve higher accuracy.
It also mentions that ultimately, combining internal and external knowledge is necessary to improve results, since the training data of large models is inherently limited to a specific point in time.
Regarding CoT, although CoT-SC improves upon CoT, it is still prone to hallucinations—a traditional issue—since it only relies on internal iterative reasoning. In contrast, ReAct’s actions can combine internal and external knowledge, and the ability to incorporate up-to-date external information is a major advantage.
The combination of these two approaches results in a more effective method.
The corresponding experiment conducted on GPT at that time.
This paper provided a lot of inspiration. Reflecting on what is now called an Agent, or the so-called post-AI software era, what are the core differences that define it?
It is clear that the core of the Agent era lies in AI (mostly LLMs), which allows software to play far more roles than before. Unlike traditional software, which could only operate based on rule-based logic, we are no longer relying on large-scale rule sets. Instead, we base the system on LLMs and probabilistic reasoning, enabling key components to exhibit greater diversity and flexibility.
The following key roles are both core and common, forming the foundation of basic Agent applications:
Reasoning: The Agent can use reasoning to understand user intent, analyze the environment, evaluate the potential outcomes of different actions, diagnose problems, and determine when and how to use tools.
Planning: Based on reasoning results, the Agent generates an action blueprint. It can decompose a high-level goal into a series of coherent, ordered, and executable sub-steps. Planning also considers practical tool usage and coordinates the workflow in text form.
Tool Use: An important ability extension. With access to various tools, the LLM is no longer limited to textual outputs; it can orchestrate different tools, integrate their results, and enhance the overall value of the application.
Going a step further, we can strengthen the following directions to continuously improve the quality and depth of our applications:
Memory: Pure reasoning, planning, and tool use are limited to single operations. The ability to remember is an important feature of LLMs. Effective management of memory, as well as strategies for compressing and extracting key memories, can be further developed to enhance applications.
Reflection: Planning may go wrong, and tool usage may fail. An Agent needs the ability to evaluate its own actions and outcomes. Adding a Reflection layer to the original flow allows the Agent to think iteratively, check results, and propose better reasoning and actions, ultimately optimizing the final output.
Perception: While Tool Use can also be a means of gathering information, broader perception—especially in multi-modal Agents—is essential. This includes the recently popular concept of Context Engineering, which involves sensing and understanding the overall environment, thereby improving the quality of final outputs.
5. Exploring More Applications of AI Agents
Regarding agentic applications, you can also refer to some well-known GitHub repositories showcasing various LLM applications. However, these often require preparing the necessary API keys and setting up the corresponding environment in advance.
Shubhamsaboo / awesome-llm-apps
GitHub – Shubhamsaboo/awesome-llm-apps: Collection of awesome LLM apps with AI Agents and RAG using OpenAI, Anthropic, Gemini and opensource models.
Collection of awesome LLM apps with AI Agents and RAG using OpenAI, Anthropic, Gemini and opensource models. – Shubhamsaboo/awesome-llm-apps
6. The Next to Improve
This is quite interesting. Essentially, this approach can be used to implement most automated agents. In the end, it all comes down to refining each scenario and each detail — and I believe that’s also the deciding factor for this kind of application. For example:
Reasoning: How to reason accurately and effectively, interpreting various user intents and performing correct reasoning.
Planner: How to decompose reasoning into correct and reasonable actions.
Tool Use: Tools are not only abundant but also allow the Planner to select the most appropriate tool for each task.
Workflow Refinement: For example, adding a Router played by the LLM before the Planner to classify intentions, or stacking Reflection / Evaluator-Optimizer layers into the workflow, enabling iterative thinking and optimization.
Extension: Integrating and providing more tools, such as connecting to Google Workspace or other productivity tools.
In short, this was a quick experiment built in just a few hours, documented here as a simple sharing.
At this stage, I haven’t used any frameworks or SDKs. Later, to speed things up, I could leverage LangChain, LangGraph, Google Agent SDK, GPT Agent SDK, and others.