Dataset Language Translation with OpenAI

tag

Intro

Yes, it’s another OpenAI blog post (although it’s my first one, so i’m definitely late to the party).

In this post, we’ll explore how to utilise an OpenAI endpoint from a Jupyter workspace within Domo. Our objective will be to translate data from a dataset, from English into multiple languages and to then provide users with the option to switch the language of the content they see via a filter at the dashboard level.

We’ll be using the following components to do this:

Domo Stories Layouts
Domo Variables
Domo Dataset Views
Domo Jupyter Workspaces
OpenAI

To give you an idea, here’s what the end result will look like: we have a filter at the top of the page to select a language, which changes the language of the departments in the bar chart series labels and in the data table, you can try it out below…

Process & Prerequisites

The general flow will be to take our dataset, which contains a column of departments in English, and load it into a Jupyter workspace. From there, we will extract the unique list of departments, pass them to OpenAI, and receive translations in three languages: German, French, and Spanish. We will then use Dataset Views to join the newly translated columns back onto our original dataset before creating the dashboard.

In the dashboard, we’ll utilise variables to allow the user to select a language, which will replace the relevant department column with the chosen language.

before we get started you will need an account with OpenAI, this is free and comes with some credits to use against their API. This example cost me less than 10 cents of the $18.00 free developer credits I have with my account.

Once set-up grab your API key from this page.

Starting Dataset

Here’s a view of our starting dataset with the department column we’ll be translating…

Jupyter

I’ll create a new Jupyter workspace in Domo with the kernel set to Python, i’ve added my dataset as an input and specified an output to contain our translated departments…(my standard disclaimer applies here, i’m not a data scientist, so it’s very possible there’s a better way to write this code).

If you’d like to just clone the whole workbook in to your Jupyter you can do so from my GitHub here.

1
2
3
4
5
6
7
#Import libraries 
import domojupyter as domo
import os
import json
import pandas as pd
!pip install openai
import openai

1
2
3
4
5
6
7
8
9
# Read load in the dataset (swap this for your own dataset)
input = domo.read_dataframe('Sample Sales for Translation', query='SELECT * FROM table')

# Select the Department column, rename it to 'text' and remove duplicates to give a distinct list.
df = input[['department']]
df = df.rename(columns={
                      'department': 'text'})
df = df.drop_duplicates()
df.tail(10)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Swap in your OpenAI token below, then define a function that will construct a prompt
openai.api_key = "your_openAI_token_here"

def translate_text(text, target_language):
    response = openai.Completion.create(
        engine="text-davinci-003",
        prompt=f"Translate '{text}' to {target_language}:",
        max_tokens=1024,
        n=1,
        stop=None,
        temperature=0.7,
    )
    translation = response.choices[0].text.strip()
    return translation

1
2
3
4
# Specify the languages to translate to & create a new dataframe to store our results.
target_languages = ['fr', 'es', 'de'] # French, Spanish, German

translated_df = pd.DataFrame(columns=['text'] + [f'translated_{lang}' for lang in target_languages])

1
2
3
4
5
6
7
8
9
# Here we loop through each department in our list and through each of the three languages calling the function we created earlier to construct the prompt.

for i, row in df.iterrows():
    text = row['text']
    translation_row = {'text': text}
    for lang in target_languages:
        translation = translate_text(text, lang)
        translation_row[f'translated_{lang}'] = translation
    translated_df = pd.concat([translated_df, pd.DataFrame(translation_row, index=[0])], ignore_index=True)

1
2
3
# Print translated dataframe

print(translated_df)

1
2
# Write our resulting dataframe out to a new Domo dataset
domo.write_dataframe(translated_df, 'Translated Departments')

After the notebook runs I have an output dataset that looks like the dataset below, with 1 column per language for each department. we’re ready to join this back on to our original dataset, now at this point you could do this in Jupyter, you could do it in Magic ETL and you could do it it in a Dataset View, it’s really up to you and what your own use-case dictates.

tag

To keep it simple, i’m going to join using a Dataset View where i can rename my new columns with a suffix of the language code, so i have department, department_es, department_de etc.

Visualisation

Next we’ll build out some charts and a variable.

Create a new chart, I’m using a bar chart where I want the series to be the department name which can be changed by the variable.

I’ll create a new variable:

tag

Then a beast-mode to handle the switch with the following code, where the ‘Language’ column is the variable we just created.

1
2
3
4
case when `Language` in ('English') then `department`
when `Language` in ('German') then `department_de`
when  `Language` in ('Spanish') then `department_es`
when `Language` in ('French') then `department_fr` end

Finally, we can add the variable control to the page And we’re done!

Closing Thoughts

This is just one example of an OpenAI integration for language translation, and it was surprisingly quick and easy to set up and to integrate with Domo. OpenAI provides a range of sample and boilerplate code that you can lift and drop in to your own solutions and workflows to quickly get going with some testing an exploration of the technology. I think there’s likely to be so many more uses for this in the data and analytics space that we’ll see cropping up as the technology continues to develop. We’re already seeing integrations with NLP for data analysis as well as code generation and AI as a pair programmer (see GitHub Co-Pilot) for app development.

Intro#

Process & Prerequisites#

Starting Dataset#

Jupyter#

Visualisation#

Closing Thoughts#