In Practice: Turning an Idea into a Finished Project (A Case Study)

November 11, 2024 8 minutes read | 1501 words by Ruben Berenguel

I got some questions about how exactly did I use Gemini when creating a project, after posting my previous post. This is the answer.

As I have mentioned in the past, this year I have started and finished more than 15 personal projects. This has been thanks to using Gemini to avoid getting stuck early, as I explained in Turning ‘Someday’ Ideas into ‘Today’ Projects with Gemini.

Just after publishing that post I started and finished a new project: goita. It is a Chrome extension to annotate screenshots from web pages. Below you can see a video of it in action (but likely you want to open it in a new page).

As the others, this project was bootstrapped with the help of Gemini, and I used it to save time reading documentation of implementing annoying parts. I had a knowledge of all the parts that would be involved in the implementation, which means that getting Gemini to do the right thing was going to be easier.

I will use this as an example of how I approach starting a project nowadays.

For Goita, I had a conversation with Gemini involving around 40 or so related questions or refinements. Sometimes I would ask for A and I would get something close to A but not quite right, so I’d need to ask again in a different, more precise way.

Project goal: screenshot annotation extension

The goal of the project (the definition of done, if you will) is to have:

A Chrome extension that
Can take a screenshot of the current tab and
Annotate it with floating SVG which then
Exports as a data:base64 encoded image
With the URL of the annotate tab clickable and/or visible.

I start projects with Gemini exactly in the same way I start projects without Gemini.

What is my main unknown for this project?

In almost every project there is some question you do not know the answer for, or some particular area that may be more complicated. Out of the requirements:

I have done 2 extensions already this year, I know how to do this.
I know there is an API for this, but have not used it.
I would not be able to implement it out of the top of my head, but I know how to do this with no issue.
I have done similar things before and I know the limitations of URL length.
This is pretty much an implementation detail.

Let’s start the project

So, the basic problem is get a screenshot. Which is exactly how I started my conversation with Gemini.

Give me the minimum amount of code to have a working Chrome extension that can take a screenshot of the current webpage and display it in a page (a page controlled by the extension)

Gemini provided manifest.json, popup.js and popup.html, but these were not quite right: they were not using the Chrome APIs I knew I needed to use.

This does not work (tab id is not provided). Also, there is a native Chrome screenshot capability for extensions I think.

The answer was now using tabs.captureVisibleTab, successfully getting a screenshot. From here, we went into the back-and-forth, this conversation spanned a couple of days and around a hundred comments, code snippets, and clarifications.

Sample code to draw a rectangle in SVG on click + mousemove
Dragging the rectangle
Fixes for SVG z-index and pointer events
Tweak how the screenshot triggers
Refine dragging
Add a drop shadow to the rectangle when selected, and do the filter-creation once.
How to add an invisible fill to them.
Ok, now draw arrows, drag them, cancel drawing them.
Add a text block, drag it.
Scratch that and let us use instead use contenteditable divs.
Make dragging smoother.
Get a screenshot of the container, to be able to export the annotated image.
Add two buttons to do that, one for image and one for data URL.
Fix issues with extension permissions.
Add the source link.
Add keyboard shortcuts for the extension.
Tweak SVG rectangles with rounded corners.
Embed the image as a SVG contained image instead of a div under a SVG overlay.
Paste images from the clipboard into the SVG canvas (and all things involved with that).
Clip the original screenshot with clip paths.
Allow dragging the clip path and the screenshot.

For instance, when I wanted to add a drop shadow to a selected rectangle, the conversation went like this:

Ok, can we add a subtle red drop shadow on the selected rectangle so it’s obvious it is selected?

[Gemini provides code for that]

Pressing escape should deselect any selected rectangle and remove any drop shadows

[Gemini provides a new keydown handler]

We shouldn’t keep adding drop shadow filters, one is enough. Maybe it can be added on canvas creation and applied where needed?

[Gemini adjusts the creation of the SVG to add the filter there]

At some point (between adding text and images) I started cleaning up the code: Gemini was good at getting 80% of the work done, but there was a lot of repetition.

Another pain point was the constant use of globals. Creating an arrow or a rectangle would use a rect or arrow global that was then reused for dragging and selection. While this is manageable with rectangles and arrows, it becomes unwieldy when we have 5 or 6 different entity types, as is the case here. Since it was getting out of hand, I started reusing some state and adding TODOs to fix it better later.

With the code complexity more or less under control, I pushed to finish it. I know that for me, perfect is the enemy of the good, and getting to a good working project makes me more likely to polish it later into a good working project.

The interaction with Gemini above got me to version 0.1 or something like that: all code in one file, with a big global state machine that would be hard to extend. But it was working and usable.

Refactoring with AI assistance

For cleaning up, I used Gemini for the annoying parts once I knew what I wanted. Reading the code, the most obvious improvement in readability and maintainability was going to be moving each “object” to its own class, keeping most of its state and properties internal. So, I asked for that:

Given the current state of the code I provided a bit earlier, how could we encapsulate an Arrow object (javascript object) so that I can start refactoring the code in a way that removes most logic of detecting state and keeping global coordinates from the mousemove and mousedown event handlers?

The answer to this request was accurate. One thing missing was adding identifiers to the generated objects: Gemini kept missing the point of not repeating lookups, or using lists instead of maps. That went against the design I was thinking of, and no matter how many times I told it, kept writing differently. I also rewrote part of its proposed code to handle dragging in a more consistent way, where my goal here was to have the same methods in all objects.

I kept refactoring in a loop together with Gemini, moving each block out into its own class, tweaking the drag and click methods for consistency and smoothness, and removing unnecessary code from the main loop.

For example, one thing I asked during this process:

I’d like you to rewrite updatePosition for Rect to actually be updateShape, and take in the mouse event directly. This way I can have a uniform interface across arrows and rects [block of code]

and after getting a perfectly pasteable answer:

Awesome. Can you do the same for Arrow?

I was trading keystrokes, English for code. These changes were straightforward, even mechanical. I could type them, or I could just ask for the result, and get it.

Finishing strokes

As I was getting closer to the end, the questions became less about how to implement something, and more about Javascript or browser handling. I would want to do something, like handle the blur event of the text editing div in a particular way, and I would ask for how to do that in the abstract.

I did not want Gemini to provide code any more: it was no longer good enough to be pasted without editing.

Final steps and conclusion

Once I had the full refactor, I adedd in-browser tests with Mocha and Chai. For this, there were no questions aside from the typical brain farts I always get, like how to iterate through all keys in a Javascript map.

This project was the perfect example for me, because I finished it in record time (4 days), only because I had Gemini assist with

Getting over the initial unknowns.
Working faster through boring parts and refactors.
Avoid some of the documentation.

I hope this brings some light to what I explained in Turning ‘Someday’ Ideas into ‘Today’ Projects with Gemini from the practical side of a specific project.

Programming