6 May 2025

Vibe-Coding And The Making Of A Lost Game

I’ve recently been experimenting with vibe-coding, where I describe to a Large-Language Model what I want it to do in a prompt. The LLM generates the software, and I only test and refine the source-code that it generates. When used with a powerful model which is high-performing on coding benchmarks (like Open AI’s 4.1 or o3 family of models, Claude 3.5/3.7 & Gemini 2.5 Pro) I’ve found that these models are surprisingly capable of quickly generating decent quality code, and those with higher context windows are capable of understanding the code-base exceptionally well.

Over the last few weekends and evenings, I’ve been trying to recreate a game that I used to have on my phone. Unfortunately it is now outdated and no longer works on the latest versions of my phone’s operating system because the original developer hasn’t kept it up to date. I thought this could be the perfect candidate for a vibe-coding challenge where I could recreate the game as a Progressive Web App (PWA) so that it could be downloaded onto my phone, but also would work on the web such that I wouldn’t have to worry about software versions any more.

I began off using an open source PWA starter-kit using TypeScript, which is a particularly powerful language to use with LLMs as it annotates basic JavaScript code with types that can help the LLM debug issues, and improve its coding quality. It also helps me to understand the code and intentions of the LLM so I can be quicker to spot misinterpretations and that also helps me debug (more on that later!). I started by using Cline, a free extension for my VS Code IDE and I used Gemini 2.5 Pro because Google Cloud gave me $300 of free credits and I was excited by the very large context windows that it can have. I used the ‘Plan’ mode of Cline and gave it permission to read files without asking me for permission. The plan mode meant that I could use the LLM to think about how it would structure the code after I informed it how the game should work and the technology I had selected. If I’d have asked it, it could also have helped me choose the technology. After I was happy with the plan, I switched Cline into ‘Act’ mode and then it got to work on the plan. It went file-by-file and wrote code to execute the plan’s items, asking me for permission to save the changes it made. The LLM didn’t get everything right first time, but Cline automatically sent it the errors and warnings from my IDE, and it was usually good at fixing them.

After it had generated the basic game, it had got a few things incorrect with how I expected the game to work, so I prompted it to correct those areas that it usually did very well. There were a few times that I stepped in and adjusted some things to keep it on the right track, but these were usually minimal and were only because it would have taken me longer to prompt the AI, than to make the changes myself. Whereas normally, I would review all the code, for this experiment, I only took a cursory look through and never gave a thorough code review.

This worked well, and I was able to work with the AI to add features and get the game closer to what I remembered from the original, however I had noticed one bug where the game was stacking pieces in the wrong order. I started by explaining the issue, and it tried to fix it, but with every change it made, either it made the problem worse, created other problems, or it made no difference despite the many code changes it had done. After about five attempts at this, with different prompts, including getting it to think through each stage that the code could be going wrong. I realised I had to get my hands dirty and deep-dive into the issue and conduct a root-cause analysis myself.

Because I hadn’t written the code myself, it took me a long time to understand what all the relevant code was doing. Eventually, by using traditional debugging techniques, I established that the issue was because of an accidental duplication of entries into a Map due to the setter sometimes adding entries with the key as a string, vs other times where the key was a number. This was a subtle bug, and was caused by the LLM incorrectly using type-casting which hid the issue from TypeScript’s oversight. Because of the fact that I didn’t write the code, the debugging took me a lot longer, probably a couple of hours once all accounted for. The fix in the end was really quite trivial and I implemented this manually.

Overall, and despite this debugging experience, I found vibe-coding an incredibly effective way of turning my hope of getting my lost game back into a real product. Quite frankly, I wouldn’t have sunk the time into making the game manually, but by using the LLM agent to do most of the work for me, I was able to do it significantly quicker and with much less effort from my side. The two hours lost to debugging, pale in comparison to the time that would have been lost to a handwritten implementation.

I think this technique is particularly effective for low-stakes implementations, like one-off scrips, games, or prototypes. It might have a good effect on less throw-away implementations, but the reviewing overhead would increase significantly and from this experiment, I can’t guarantee that this technique would prove effective in this context. In addition, while my prompts started at a few cents each, by the time I was finishing, each prompt was costing over a dollar. In total I used about $65 of my free credit, which I think is expensive, despite Gemini 2.5 Pro being quite a cost-effective model (see below how I could have decreased this cost below). For a business doing the same amount of work but for a prototype, this could be a very reasonable cost when compared to paying a engineers by the hour.

View the finished game at https://ur.sint.dev
View the code at https://github.com/davidsint/ur

P.S. How to decrease the cost of the AI usage

In hindsight, I could have significantly reduced the cost of using the model by adopting two techniques. The primary cause of the expense was because my conversation with the AI happened in one chat which meant I was using an absolutely enormous amount of context. This was often not necessary and if I could have reduced the amount of input tokens used had I:

Worked with AI to create a rules file (i.e. .clinerules) that explained the codebase and files, so that only the important ones needed to be put into context given a specific request
Put it each separate task into a different ‘chat’. The main chat that I had within Cline used 32m input tokens, 137k output tokens with an API cost of $66.71. However, subsequent tasks in fresh chats each cost less than $0.50 each (example tasks I did in unique chats are to add SEO, accessibility features, standardise styling, and to save game state).