This project was to introduce more style and presentation with the Blog Posts on this website, testing how different LLMs would tackle the problem.
The winner was DeepCoder, but it had some user experience issues that I plugged with the implementation provided later by Tulu3:70b. The conversations leading to the final mergeable code can be found in the folder with this ReadMe.
More information about how the judging went can be found in This Github branch for the Rankings and related LLM Reasons for Rank.
There is a video series of working with the LLMs at LLM Sessions by Mind of a Fighting Lion Enthusiast.
There is also collection of Pull Requests that reflect the changes offered by each LLM given the timebox and context I could provide.
It did not go very well, but part of that could have been me ramping up with what I wanted from the experiments
This one was the winner, so it went pretty well.
It was fast to respond, but it often had the wrong answer the first time around. The code did not fare well in the Rankings either.
It did pretty well, but it could not figure out how to fix the Text Flow issue at the end, and I borrowed the answer from another LLM.
It got things wrong a whole lot, and it did not do well in the rankings.
Got everything done in first go without many mistakes. Felt good and productive.
Relatively quick, but I ran into several issues that needed to be fixed before getting to the right output.
This one did work with Roo, although it encountered bugs which forced me to switch over to Continue. Got through all the experiment stages in one session, although it did get stuck on the Back button persistence issue for a long time.
It is good to have a mix of LLMs to query. I liked working with Tulu3 the most, but I think several LLMs have merit. In future experiments, I will probably drop phi4, qwen3, nemotron, and gemma3 because they did not feel good to work with.
My scientific process is not very consistent for each iteration. In future experiments, I could spend more time in the Preparation stage to try to get a more consistent test, where I give less input in each loop.