Spring AI's Transcription API

Some large language models (LLMs) can transcribe audio into text. Different businesses are rapidly adopting this technology and reaping productivity benefits.

We've seen glimpses of this technology in Zoom, Microsoft Teams, and other collaboration and communication tools, where call transcriptions can be generated automatically. Further, the entertainment industry is rapidly adopting it in Movies, Advertisements, and other fields.

Spring AI aims to provide a unified transcription API that integrates with LLM providers like OpenAI and Azure OpenAI. This article will explore how the Transcription API can be used to transcribe an audio file using OpenAI.

Transcription API Key Classes

We'll start learning some important Spring AI Transcription API classes:

We can divide the components into two groups, one specific to the underlying LLM service provider and the other generic. The OpenAiAudioTranscriptionModel and AzureOpenAiAudioTranscriptionModel are client classes that help make API calls to the underlying provider-hosted transcription APIs. Similarly, the OpenAiAudioTranscriptionOptions and AzureOpenAiAudioTranscriptionOptions classes help set up the provider-specific transcription API parameters. The Spring framework autoconfigures them by default if it finds their configurations in the application property file.

However, we can also programmatically configure the client classes using the AudioTranscriptionOptions classes such as OpenAiAudioTranscriptionOptions and AzureOpenAiAudioTranscriptionOptions. It can help set the common properties like the audio input language format, the model name, and the temperature.

Further, we can encapsulate the provider-agnostic classes like Resource and the AudioTranscriptionOptions in the AudioTranscriptionPrompt object. Later, we invoke the Model#call() with the prompt object to invoke the LLM service. Finally, we receive the response in the AudioTranscriptionResponse object.

Transcribe an Audio File

Now that we know the key classes, we're ready to develop a program that transcribes an audio file using OpenAI's transcription service.

First, we'll develop a program that doesn't require a customized transcription client, so we won't override the configurations in the application property file. In the other example, we'll programmatically change the prompt configurations where runtime customization is necessary.

Prerequisites

As usual, the Spring Boot application must import the additional Spring AI's OpenAI-specific starter library:

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-starter-model-openai</artifactId>
    <version>1.0.0-M8</version>
</dependency>

Generally, using the online Spring Initializr tool is the error-free way to import libraries.

The next step is to sign up on the OpenAI platform and subscribe to a service plan to use its APIs. We must have an API key to invoke the LLM service. For our example, we'll use the Whisper Speech-to-Text model.

Finally, we'll use an mp3 file for our experiment with the transcription API:

Listen to Twinkle Twinkle Little Star

Client Configured from Properties files

For OpenAI the Transcription API's configuration properties are defined under the namespace spring.ai.openai.audio.transcription. However, we'll consider only a few that are just enough for auto-configuring the client. We'll define them in the application-tr.properties file:

spring.ai.openai.api-key=sk-proj-xxxxxx

spring.ai.openai.audio.transcription.api-key=sk-proj-xxxxxx
spring.ai.openai.audio.transcription.options.model=whisper-1
spring.ai.openai.audio.transcription.options.temperature=0.2
spring.ai.openai.audio.transcription.options.language=en
spring.ai.openai.audio.transcription.options.response-format=text

We specified some important configurations such as:

API key for invoking the OpenAI's transcription service
Language format of the input audio file
Open AI's Transcription Model to invoke

Now, let's look at a Spring Boot test that invokes an OpenAI Transcription API call:

@SpringBootTest
@ActiveProfiles("tr")
public class TranscriptionAILiveTest {
    private final Logger logger = LoggerFactory.getLogger(TranscriptionAILiveTest.class);

    @Autowired
    private Model<AudioTranscriptionPrompt, AudioTranscriptionResponse> aiModel;

    @Test
    void whenOpenAIConfigsDefinedInPropFile_thenTranscribe() {
        assertInstanceOf(OpenAiAudioTranscriptionModel.class, aiModel);
        String audioFile = "/audio/song.mp3";
        Resource resource = new ClassPathResource(audioFile);
        OpenAiAudioTranscriptionModel openAiAudioTranscriptionModel
          = OpenAiAudioTranscriptionModel.class.cast(aiModel);
        String transcribedText = openAiAudioTranscriptionModel.call(resource);
        logger.info("The transcribed text: {}", transcribedText);
        assertTrue(transcribedText.toUpperCase().contains("LIKE A DIAMOND IN THE SKY"));
    }
}

The test class uses the tr profile. Hence, the Spring Boot application uses configurations from the application-tr.properties file to auto-configure the OpenAiAudioTranscriptionModel bean. In the program, we have autowired the Model bean. At the beginning of the test method, we verify that aiModel is of type OpenAiAudioTrnascriptionModel. Later, we pass the audio file resource to the OpenAiAudioTranscriptionModel#call() method. And Finally, we get the transcribed text:

The transcribed text: Twinkle Twinkle Little Star How I wonder what you are Up above the
world so high Like a diamond in the sky

Client Configured Programmatically

Real-world applications require customized AI transcription clients to deal with many dynamic scenarios. Hence, let's learn to configure it dynamically to override the configurations set from the properties file:

@Test
void whenOpenAIConfigsDefinedProgrammatically_thenTranscribe() {
    OpenAiAudioTranscriptionModel openAiAudioTranscriptionModel
      = OpenAiAudioTranscriptionModel.class.cast(aiModel);

    String audioFile = "/audio/song.mp3";
    Resource resource = new ClassPathResource(audioFile);

    OpenAiAudioTranscriptionOptions transcriptionOptions 
      = OpenAiAudioTranscriptionOptions.builder()
      .language("en")
      .model("whisper-1")
      .temperature(0.5f)
      .responseFormat(OpenAiAudioApi.TranscriptResponseFormat.VTT)
      .build();
    AudioTranscriptionPrompt audioTranscriptionPrompt
      = new AudioTranscriptionPrompt(resource, transcriptionOptions);

    AudioTranscriptionResponse audioTranscriptionResponse
      = openAiAudioTranscriptionModel.call(audioTranscriptionPrompt);

    String transcribedText = audioTranscriptionResponse.getResult().getOutput();
    logger.info("The transcribed text: {}", transcribedText);
    assertTrue(transcribedText.toUpperCase().contains("LIKE A DIAMOND IN THE SKY"));
}

The program builds the OpenAiAudioTranscriptionOptions object and configures the input language, model, temperature, and response format. Unlike the previous example, it sets the transcription format to VTT (Web Video Text Tracks) to display synchronized captions or subtitles with video content.

Later, it uses the audio resource and the transcription options to create the AudioTranscriptionPrompt object. Finally, the program passes the prompt object to OpenAiAudioTranscription#call() and receives the AudioTranscriptionResponse object.

The response object contains the VTT transcription:

The transcribed text: WEBVTT

00:00:00.000 --> 00:00:19.000
Twinkle, twinkle, little star. How I wonder what you are.Up above the world so high, Like
a diamond in the sky.

Conclusion

The Audio Transcription API in Spring AI is an excellent feature for integrating applications with service providers' transcription APIs, such as OpenAI and Azure OpenAI. Unfortunately, it is not fully abstract enough to allow service providers to be switched without code changes. In the future when it adds support for more service providers, it may support a common AudioTranscriptionModel interface.

So far, the existing integration works satisfactorily and we must wait and watch for further upgrades.

Visit our GitHub repository to access the article's source code.

Kode Sastra

Search this blog