Recently, I was presented with an intriguing challenge by a client. They were using a tablet-based point of sale (POS) system to sell memberships. The process involved a salesperson entering the buyer’s information and then having the buyer provide their signature after verifying all the details. However, once the contract was generated, it frequently failed validation because the details entered by the sales rep didn’t match the buyer’s license. This discrepancy wasted a significant amount of time and effort.
While investigating this issue, I discovered that the app was already capturing and storing an image of the buyer’s driver’s license. This sparked an idea: why not use Optical Character Recognition (OCR) to convert the driver’s license image into text and pre-fill the form? However, just using OCR wasn’t enough; the extracted information needed to be structured properly in JSON format. It became clear that I needed a Large Language Model (LLM) to achieve this. Eventually, I implemented a solution using Google Vision AI combined with GPT-4o, and it performed quite well.
Although the initial solution was effective, I knew it could be improved further. That’s when I turned to DSPY, a framework designed to program (not just prompt) language models. I decided to fully integrate GPT Vision with DSPY, allowing me to read the image and receive structured JSON with a single API call. Since DSPY didn’t have a built-in abstract LM class to support GPT Vision, I had to do some digging online and on GitHub. Here’s what I found:
The above implementation worked beautifully, especially when combined with my signature:
So far, I am quite satisfied with this implementation. I’m not sure what I’ll add next, but I’m excited about the possibilities!
As a thorough software architect, I bring precision and passion to every software project I tackle. My goal is to always produce innovative and high-quality software that pushes the boundaries of what's possible. I have a love for experimenting with new programming languages, and you can catch me blogging about my experience and insights in the software development world. Join me in my journey as I explore the ever-evolving world of technology and programming.