Rakuten’s Open LLM Tops Performance Charts in Japanese

Japan’s open-source AI community is taking a major leap forward, thanks to the groundbreaking efforts of Rakuten’s AI engineers and scientists.

Last week, Rakuten’s AI team unveiled a suite of large language models (LLMs) with exceptional performance in Japanese*1. Seven billion parameter models are the result of training on top of Mistral-7B, another open-source model with high performance in English.

The Rakuten team adapted it to improve performance in Japanese while retaining English performance, and has now released their work for the open source community to use.

Led by Rakuten Group Chief Data Officer Ting Cai and his team of experts in AI, a total of three models have been released, all available under a commercial license.

One is the foundation model – something that has the knowledge but hasn’t been fine-tuned in any way and can be adapted to serve a wide range of purposes, while the other models perform more specific tasks. The second is the instruct model, which users can prompt to generate specific outputs, and the third is the chat model, which is finetuned on conversational data to act as a chatbot.

High performance in Japanese

The headlining feature is its exceptional Japanese performance.

“Most LLMs in the public domain are focused on English as the primary language – especially the open ones,” Lee Xiong, an expert in AI and member of Cai’s team explains. “Some Japanese data definitely leaks in, and most models can do basic Japanese, but Japanese is not the main focus. They’re meant to address the mass market, which still is English.”

Cai hopes his team’s work can unlock new possibilities for Japan’s AI community.

“We’ve taken an open model which was trained for English and carried out further training for Japanese, so that the Japanese community can also reap the benefits of open LLM research.”

Curating the high-quality training data necessary for this was no easy task. “The main secret sauce always boils down to data cleaning – how much effort do you spend on cleaning the data?” Cai reveals. “We also improved our tokenizer to make it more efficient for Japanese.”

LLMs use tokenizers to break words or phrases into tokens, which are then processed by the model. More tokens mean more processing and a higher cost.

“Say you give it one sentence and the model breaks it down into five tokens, so you pay for five tokens. Our model might break it down into, say, three tokens.”

Example of how Japanese words can be broken into fewer tokens.
Example of how Japanese words can be broken into fewer tokens.

When using English-based models, however, single Japanese characters are often fragmented into multiple tokens.

“Ideally, that shouldn’t happen, because that means for Japanese text you will be paying a higher cost,” Cai says. “So we extended the tokenizer. Now the model’s vocabulary is higher – it knows more Japanese tokens, so it will be breaking Japanese characters into fewer, larger chunks.”

In-house engineering for speedy, scalable development

“Significant effort goes into the engineering to build these models,” Cai reveals.

Graphics processing units (GPUs) are often leveraged for AI purposes like LLMs. But a single chip only has so much processing power; to accelerate training, multiple GPUs are linked in a cluster.

“We have to scale the training to multiple nodes. To do that reliably without the process crashing, we have to build the right infrastructure.”

But scaling the clusters is not a simple matter of adding more GPUs.

“Once you add more to the cluster, there is a communication and synchronization overhead, because the same model is sitting in each of the cards and the parameters need to be in sync,” Cai explains.

The multi-node cluster was built entirely in-house.

“Our engineering team helped us build a cluster to train these models in a scalable fashion. Otherwise, instead of two months, training might have taken eight or ten.”

Rakuten’s contribution to the open source community

“Our release is the best performing open LLM in Japanese right now, in terms of certain evaluation benchmarks,” Cai boasts. “We were able to release models that are better in Japanese, while retaining performance in English.”

The LLM's performance was exceptional among other open Japanese LLMs.
Rakuten’s LLM performed exceptionally well compared to other open Japanese LLMs.

The achievement emerged in part thanks to a pivot from a completely different foundation.

“Previously we were using a different model to Mistral. We built up the knowhow of how to train, how to improve, and we were already on top of the leaderboard,” Cai remarks. “But then this improved Mistral model came out. We thought, if we swap these models, using the same data and the same training, could we build something better?”

Cai hopes others will build upon Rakuten’s open work, just as his team built upon Mistral-7B.

“We have high hopes for what people in the Japanese community can do with it. If someone wants to continue improving it further, they can take our foundation model and make a newer one,” Cai says. “Being the highest performing Japanese model, we expect a lot of people to pick this up and build upon it.”

The possibilities are endless for how the community might leverage Rakuten’s models.

“It could be used for pretty much any natural language generation task,” Cai conjectures. “Or one could use it for simple classification. If someone just wants to use our chat model as is, they can do that too.”

What’s next for Rakuten AI?

The Rakuten team now has a single mission to rally around.

“That’s the exciting part – we don’t have to worry about specific problems if the base model is good enough. We can just work on the big model and it can solve most of the problems,” Cai says. “Previously, the classical approach in NLP (natural language processing) was to have everyone working on one problem at a time. Now, one team can work on pretty much all the problems by just working on the big model.”

As they continue to improve the model, Cai and the team are broadening their scope to explore potential applications.

“We are determined to constantly improve. We are also focusing on other aspects of the models, like safety and alignment,” Cai says. “If we want to integrate these things into Rakuten services, we cannot compromise on safety.”

Contributing to the Rakuten Ecosystem is one major goal.

“Ultimately we are doing research with the broader goal of serving the Rakuten Ecosystem, services and products,” Cai said. “That’s one reason we want a model that is good in Japanese.” Cai hopes to see a future in which the team’s achievement will bear fruit in myriad ways for Rakuten’s clients and customers.

Whatever the future holds, Rakuten’s contribution to Japan’s open-source AI community has something exciting for everyone.


*1 Results of evaluation tests carried out on LM Evaluation Harness from January to March 2024. The models placed at the top among the open Japanese language LLMs in their respective categories.

The models are released under the Apache 2.0 license and are available from the official Rakuten Group Hugging Face repository.

Tags
Show More
Back to top button