AI models can add Navajo and related languages to online translators

Although Google Translate instantly recognizes over 100 languages, it still completely fails to recognize Navajo, the most extensive language in Native American languages. But according to new research from Dartmouth College, how artificial intelligence can accurately identify endangered Indigenous languages with near-perfect precision.
The study, presented at a May 1 conference in Albuquerque, found that researchers can train AI models to identify Navajo with 97-100% accuracy. The results show that major tech companies can easily expand their language recognition tools to include Native American languages and may support the conservation efforts of these endangered cultural treasures.
This study is a critical moment when many indigenous languages face extinction due to technological integration and minimal educational resources.
Close the digital language divide
The Dartmouth team discovered this issue while testing Google’s language identification tool (Langid), which powered Google Translate. When introducing the Navajo text, Langid repeatedly mistakes it for irrelevant languages such as Iceland, lingala or Wolof.
“By building on the ideas behind Langid, we found that classifiers can be developed to identify indigenous languages,” said Ivory Yang, a PhD candidate at Dartmouth, the first author of the study. “From Google’s perspective, adding a new language involves rigorous validation, which makes sense on scale. I hope to show that meaningful advancements are still possible even with limited resources.”
The researchers used a data set of 10,000 Navajo sentences to create a model that correctly identifies the language with significant consistency. More importantly, they found that this approach might extend to relevant languages that could even extend to less available data.
Bridges for related languages
The team’s findings suggest that Navajo can be used as a language bridge to help translation tools identify other relevant languages in the Athabaskan family, including Apache and several native Alaska languages.
When they tested their own models on languages such as Western Apache, Mescalero Apache, Jicarilla Apache, and Lipan Apache, the dataset used sometimes uses 20 sentences, which identifies them as Navajo due to their language similarity.
“We noticed that they are so linguistically similar to Navajo that they can eventually be used to identify these related languages without the same data,” Young explained. “This could mean that higher resource languages can act as bridges to general resource languages.”
This bridge concept may be crucial to preserve the language of remaining speakers or limited written material.
Exceeded translation recognition
This work represents only the first step in making digital tools more inclusive of indigenous languages. Simply being recognized by technology is the basic starting point of any language in the digital age.
“Many indigenous languages even lack the basic dignity of online recognition,” said Soroush Vosoughhi, senior author of the paper and assistant professor of computer science at Dartmouth. “Revitalization begins with visibility, and visibility begins with recognition.”
The current research is specifically focused on language recognition, but the team’s ambitions have further expanded:
- Extended model to identify Native American languages outside the Athabaskan family
- Develop translation capabilities in Navajo and related languages
- Create more comprehensive language tools to support language learning and preservation
- Explore partnerships with indigenous communities to ensure technology respects cultural values
“The next step in the team’s latest model is to translate the original sentences into the Navajo people,” Yang said. “Basically, we want to convert from identification to translation. The ultimate goal is translation, but it’s more difficult. Now, we know we can do identification.”
The foundation of revival
The study is part of a larger Dartmouth initiative that focuses on using AI to help revitalize endangered languages. The team previously created a framework called Nüshurescue, which transformed the Chinese into Nüshu, a script of an endangered century history that was traditionally used by women in the southern Hunan province.
What makes this work particularly important? In a world of increasingly connected connections, digital existence often determines the survival prospects of a language, and technologies that support indigenous languages can help reverse centuries of decline.
For an estimated 350,000 speakers of Navajo and relevant Asabasian language speakers, the language is recognized by mainstream technology platforms more than just convenience – it is a form of cultural verification that can consolidate ongoing conservation efforts and inspire younger generations to maintain their linguistic heritage.
As AI language technology continues to evolve, the question remains whether major tech companies will incorporate these endangered languages into their systems, or if a more localized community-driven approach will eventually prove more effective for language revitalization in the digital age.
If our report has been informed or inspired, please consider donating. No matter how big or small, every contribution allows us to continue to provide accurate, engaging and trustworthy scientific and medical news. Independent news takes time, energy and resources – your support ensures that we can continue to reveal the stories that matter most to you.
Join us to make knowledge accessible and impactful. Thank you for standing with us!