Skip to main content

Spring AI's Transcription API

Audio Editing
Photo by Elijah Merrell on Unsplash

Some large language models (LLMs) can transcribe audio into text. Different businesses are rapidly adopting this technology and reaping productivity benefits.

We've seen glimpses of this technology in Zoom, Microsoft Teams, and other collaboration and communication tools, where call transcriptions can be generated automatically. Further, the entertainment industry is rapidly adopting it in Movies, Advertisements, and other fields.

Spring AI aims to provide a unified transcription API that integrates with LLM providers like OpenAI and Azure OpenAI. This article will explore how the Transcription API can be used to transcribe an audio file using OpenAI.

Transcription API Key Classes

We'll start learning some important Spring AI Transcription API classes:

Transcription API key classes

We can divide the components into two groups, one specific to the underlying LLM service provider and the other generic. The OpenAiAudioTranscriptionModel and AzureOpenAiAudioTranscriptionModel are client classes that help make API calls to the underlying provider-hosted transcription APIs. Similarly, the OpenAiAudioTranscriptionOptions and AzureOpenAiAudioTranscriptionOptions classes help set up the provider-specific transcription API parameters. The Spring framework autoconfigures them by default if it finds their configurations in the application property file.

However, we can also programmatically configure the client classes using the AudioTranscriptionOptions classes such as OpenAiAudioTranscriptionOptions and AzureOpenAiAudioTranscriptionOptions. It can help set the common properties like the audio input language format, the model name, and the temperature.

Further, we can encapsulate the provider-agnostic classes like Resource and the AudioTranscriptionOptions in the AudioTranscriptionPrompt object. Later, we invoke the Model#call() with the prompt object to invoke the LLM service. Finally, we receive the response in the AudioTranscriptionResponse object.

Transcribe an Audio File

Now that we know the key classes, we're ready to develop a program that transcribes an audio file using OpenAI's transcription service.

First, we'll develop a program that doesn't require a customized transcription client, so we won't override the configurations in the application property file. In the other example, we'll programmatically change the prompt configurations where runtime customization is necessary.

Prerequisites

As usual, the Spring Boot application must import the additional Spring AI's OpenAI-specific starter library:

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-openai-spring-boot-starter</artifactId>
    <version>1.0.0-SNAPSHOT</version>
</dependency>

Generally, using the online Spring Initializr tool is the error-free way to import libraries.

The next step is to sign up on the OpenAI platform and subscribe to a service plan to use its APIs. We must have an API key to invoke the LLM service. For our example, we'll use the Whisper Speech-to-Text model.

Finally, we'll use an mp3 file for our experiment with the transcription API:

Listen to Twinkle Twinkle Little Star

Client Configured from Properties files

For OpenAI the Transcription API's configuration properties are defined under the namespace spring.ai.openai.audio.transcription. However, we'll consider only a few that are just enough for auto-configuring the client. We'll define them in the application-tr.properties file:

spring.ai.openai.api-key=sk-proj-xxxxxx

spring.ai.openai.audio.transcription.api-key=sk-proj-xxxxxx
spring.ai.openai.audio.transcription.options.model=whisper-1
spring.ai.openai.audio.transcription.options.temperature=0.2
spring.ai.openai.audio.transcription.options.language=en
spring.ai.openai.audio.transcription.options.response-format=text

We specified some important configurations such as:

  • API key for invoking the OpenAI's transcription service
  • Language format of the input audio file
  • Open AI's Transcription Model to invoke

Now, let's look at a Spring Boot test that invokes an OpenAI Transcription API call:

@SpringBootTest
@ActiveProfiles("tr")
public class TranscriptionAILiveTest {
    private final Logger logger = LoggerFactory.getLogger(TranscriptionAILiveTest.class);

    @Autowired
    private Model<AudioTranscriptionPrompt, AudioTranscriptionResponse> aiModel;

    @Test
    void whenOpenAIConfigsDefinedInPropFile_thenTranscribe() {
        assertInstanceOf(OpenAiAudioTranscriptionModel.class, aiModel);
        String audioFile = "/audio/song.mp3";
        Resource resource = new ClassPathResource(audioFile);
        OpenAiAudioTranscriptionModel openAiAudioTranscriptionModel
          = OpenAiAudioTranscriptionModel.class.cast(aiModel);
        String transcribedText = openAiAudioTranscriptionModel.call(resource);
        logger.info("The transcribed text: {}", transcribedText);
        assertTrue(transcribedText.toUpperCase().contains("LIKE A DIAMOND IN THE SKY"));
    }
}

The test class uses the tr profile. Hence, the Spring Boot application uses configurations from the application-tr.properties file to auto-configure the OpenAiAudioTranscriptionModel bean. In the program, we have autowired the Model bean. At the beginning of the test method, we verify that aiModel is of type OpenAiAudioTrnascriptionModel. Later, we pass the audio file resource to the OpenAiAudioTranscriptionModel#call() method. And Finally, we get the transcribed text:

The transcribed text: Twinkle Twinkle Little Star How I wonder what you are Up above the
world so high Like a diamond in the sky

Client Configured Programmatically

Real-world applications require customized AI transcription clients to deal with many dynamic scenarios. Hence, let's learn to configure it dynamically to override the configurations set from the properties file:

@Test
void whenOpenAIConfigsDefinedProgrammatically_thenTranscribe() {
    OpenAiAudioTranscriptionModel openAiAudioTranscriptionModel
      = OpenAiAudioTranscriptionModel.class.cast(aiModel);

    String audioFile = "/audio/song.mp3";
    Resource resource = new ClassPathResource(audioFile);

    OpenAiAudioTranscriptionOptions transcriptionOptions 
      = OpenAiAudioTranscriptionOptions.builder()
      .language("en")
      .model("whisper-1")
      .temperature(0.5f)
      .responseFormat(OpenAiAudioApi.TranscriptResponseFormat.VTT)
      .build();
    AudioTranscriptionPrompt audioTranscriptionPrompt
      = new AudioTranscriptionPrompt(resource, transcriptionOptions);

    AudioTranscriptionResponse audioTranscriptionResponse
      = openAiAudioTranscriptionModel.call(audioTranscriptionPrompt);

    String transcribedText = audioTranscriptionResponse.getResult().getOutput();
    logger.info("The transcribed text: {}", transcribedText);
    assertTrue(transcribedText.toUpperCase().contains("LIKE A DIAMOND IN THE SKY"));
}

The program builds the OpenAiAudioTranscriptionOptions object and configures the input language, model, temperature, and response format. Unlike the previous example, it sets the transcription format to VTT (Web Video Text Tracks) to display synchronized captions or subtitles with video content.

Later, it uses the audio resource and the transcription options to create the AudioTranscriptionPrompt object. Finally, the program passes the prompt object to OpenAiAudioTranscription#call() and receives the AudioTranscriptionResponse object. 

The response object contains the VTT transcription:

The transcribed text: WEBVTT

00:00:00.000 --> 00:00:19.000
Twinkle, twinkle, little star. How I wonder what you are.Up above the world so high, Like
a diamond in the sky.

Conclusion

The Audio Transcription API in Spring AI is an excellent feature for integrating applications with service providers' transcription APIs, such as OpenAI and Azure OpenAI. Unfortunately, it is not fully abstract enough to allow service providers to be switched without code changes. In the future when it adds support for more service providers, it may support a common AudioTranscriptionModel interface.

So far, the existing integration works satisfactorily and we must wait and watch for further upgrades.

Visit our GitHub repository to access the article's source code.

Comments

Popular posts from Kode Sastra

Qdrant DB - Spring AI Integration

Designed by Freepik This tutorial covers Spring AI's integration with Qdrant DB . It's an open-source, efficient, and scalable vector database. We'll insert some unstructured data into the vector DB. Then, we'll perform query and delete operations on the DB using the Spring AI framework. Brief Introduction to Qdrant DB It's a highly scalable multi-dimensional vector database with multiple flexible deployment options: Qdrant Cloud offers 100% managed SaaS on AWS, Azure, and GCP and a hybrid cloud variant on the Kubernetes cluster. It provides a unified console, to help create, manage, and monitor multi-node Qdrant DB clusters. It also supports on-premise private cloud deployments. This is for customers who want more control over management and data. Moreover, IAC tools like Terraform and Pulumi enable automated deployment and managemen...

Implement Rag with Spring AI and Qdrant DB

Designed by Freepik Earlier, we discussed Spring AI's integration with Qdrant DB . Continuing on the same lines, we'll explore and try implementing the Retrieval Augmented Generation (RAG) technique using Spring AI and Qdrant DB. We'll develop a chatbot that helps users query PDF documents, in natural language . RAG Technique Several LLMs exist, including OpenAI's GPT and Meta's Llama series, all pre-trained on publicly available internet data. However, they can't be used directly in a private enterprise's context because of the access restrictions to its knowledge base. Moreover, fine-tuning the LLMs is a time-consuming and resource-intensive process. Hence, augmenting the query or prompts with the information from the private knowledge base is the quickest and easiest way out . The application converts the user query into vectors. Then, it fires the q...

Building AI Assistance Using Spring AI's Function Calling API

Photo by Alex Knight on Unsplash Building AI assistance in existing legacy applications is gaining a lot of momentum. An AI assistant like a chatbot can provide users with a unified experience and enable them to perform functionalities across multiple modules through a single interface. In our article, we'll see how to leverage Spring AI to build an AI assistant. We'll demonstrate how to seamlessly reuse existing application services and functions alongside LLM capabilities. Function Calling Concept An LLM can respond to an application request in multiple ways: LLM responds from its training data LLM looks for the information provided in the prompt to respond to the query LLM has a callback function information in the prompt, that can help get the response Let's try to understand the third option, Spring AI's Function calling ...