Terminology Extraction

Information on extracting terminology

Introduction

Terminology is the sum of the terms which identify a specific topic. Extracting terminology is the process of extracting terminology from a text.

The idea is to compare the frequency of words in a given document with their frequency in the language. Words which appear very frequently in the document but rarely in the language are probably terms.

Technology

It uses Poisson statistics, the Maximum Likelihood Estimation and Inverse Document Frequency between the frequency of words in a given document and a generic corpus of 100 million words per language. It uses a probabilistic part of speech tagger to take into account the probability that a particular sequence could be a term. It creates n-grams of words by minimizing the relative entropy.

Why have we developed this?

Translated has developed this technology to help its translators to be aware of the difficulties in a document and to simplify the process of creating glossaries.

We also use it to improve search results in traditional search engines (es. Google) by giving a better estimation of how much a keyword is relevant to a document.

I want it!

If you are interested in this technology, please read more on Translated Labs and our services for natural language processing.

I can do better!

We are constantly looking to hire great engineers with a global mindset.
Get in touch if you think you can improve any of these these applications.

Get in touch

Explore our experiments

Spoken Language Identifier

The Spoken Language Identifier automatically detects the language of a spoken text. You can use it to classify recordings from 1 second to 1 minute. It currently supports 8 languages.

Learn more or Get API

Terminology Extractor

This tool automatically extracts the terminology of a technical topic from a written text. It can help translators identify the difficulties in a document, and simplify the process of creating glossaries.

Learn more or Get API

Readability analyzer

Written information, especially on the Internet, must be easy to read and well structured. This application helps you understand if a text is easily readable, or if it needs improvement.

Learn more or Get API

Language Identifier

The Language Identifier automatically detects the language of a written text. It can also be used to identify the topic of a written text in a language you do not understand.

Learn more

Semantic relationships

What do the words airplane, bird, and helicopter have in common? This application searches for semantic relationships in a text by analyzing the statistical properties of words.

Learn more

Translation Party

What happens when you translate an English sentence into Japanese, and then again into English, as if it was an infinite loop? Well, give it a try! And don't forget to share the funniest results with your friends.

Learn more