4-Nov 5-Nov 6-Nov
8:15   Regency Hotel --> UM Campus (8:15)  
8:30 Regency Hotel --> UM Campus (8:30) Regency Hotel --> UM Campus (8:30)
Opening (8:45-9:15)
9:00 Writing Tips for Machine Translation Papers
Invited Talks
Better Translation by Modeling Translation Models
10:00 Best Paper Presentation (10:00-10:30)
Coffee Break (10:15-10:35)
10:30 Coffee Break (10:30-11:00) Coffee Break (10:30-10:50)
Oral Presentation - Machine Translation
Panel Discussion
11:00 Ensemble Learning for Machine Translation
Closing (12:20-13:00)
12:30 Lunch
13:00 Lunch
14:00 Deep Learning for Natural Language Processing and Machine Translation - Part I
14:30 Invited Talks
Macau Heritage Tour
15:30 Coffee Break (15:30-16:00)
16:00 Deep Learning for Natural Language Processing and Machine Translation - Part II
Coffee Break (16:00-16:20)
Oral Presentation - Natural Language Processing on Minority Languages
18:00 Welcome Reception
UM Campus --> Regency Hotel (18:15)
18:30 Banquet
20:30 UM Campus --> Regency Hotel (20:30)
21:00   Regency Hotel --> UM Campus (21:00)

Oral Presentation - Machine Translation

Paper ID: 22 Difficulties and Countermeasures of Structural alignment annotation of Chinese - English Discourse Treebank
Authors Wenhe Feng, Yancui Li, Guodong Zhou; Wuhan University
Abstract Chinese-English Discourse Treebank (CEDT) is a parallel corpus annotated with alignment discourse structure information for Chinese and English. Its core task is alignment annotation under the basic principle of structural and relational alignment. Because of the Chinese-English bilingual differences, there are some difficult issues in the annotation of discourse segmental, structural, relational and central alignment. Based on annotation practices, this paper summarizes the major difficult issues on all levels, and proposes the corresponding solution strategy. Practice shows that the corresponding difficulty Countermeasures can effectively improve the quality and efficiency of the corpus annotation.

Paper ID: 24 Common Error Analysis of Machine Translation Output
Authors Hongmei Zhao, Qun Liu; Key Laboratory of Intelligent Information Processing
Abstract Based on the manual evaluation for the 9th China Workshop on Machine Translation, this paper gives a further detailed analysis and statistics on several common error types appearing in the generated translations, and discloses the main sources of these errors. Meanwhile, this paper also presents the performance differences on each common error type between rule-based MT system and statistical MT system.

Paper ID: 25 Research on Parameter Learning of Non-linear Model in Statistic Machine Translation
Authors Huadong Chen, Shujian Huang, Yinggong Zhao, Xinyu Dai, Jiajun Chen; State Key Laboratory for Novel Software Technology, Nanjing University
Abstract Although the log-linear model achieves great success in SMT, it still suffers from some drawbacks: first, the features which are used in the model must be linear with respect to the model itself; then, it cannot further interpret the features to reach their potential information, which makes feature designation hard and often needs human knowledge and understanding to help. The log-linear model combines features with mere simply dot product operation which cannot make full use of the features and also makes some of efficient features are ignored. This paper focuses on the possibility of using non-linear model in SMT and conducts experiments to verify which object function and sample method are suitable for this kind of situation.

Paper ID: 28 Data Selection via Semi-Supervised Recursive Autoencoders for SMT Domain Adaptation
Authors Yi Lu, Derek F. Wong, Lidia S. Chao, Longyue Wang; University of Macau
Abstract In this paper, we present a novel data selection approach based on semi-supervised recursive autoencoders. The model is trained to capture the domain specific features and used for detecting sentences, which are relevant to a specific domain, from a large general-domain corpus. The selected data are used for adapting the built language model and translation model to target domain. Experiments were conducted on an in-domain (IWSLT2014 Chinese-English TED Talk) and a general-domain corpus (UM-Corpus). We evaluated the proposed data selection model in both intrinsic and extrinsic evaluations to investigate the selection successful rate (F-score) of pseudo data, as well as the translation quality (BLEU score) of adapting SMT systems. Empirical results reveal the proposed approach outperforms the state-of-the-art selection approach.

Paper ID: 3 Semantic computing in HowNet MT system
Authors Zhendong Dong, Qiang Dong, Changling Hao; HowNet Technologies Inc.
Abstract This paper describes HowNet English-to-Chinese machine translation system (HowNet MT). HowNet MT is a rule-based system, in which HowNet is used as its common-sense knowledge support. After giving a comprehensive outline of the system, the paper introduces logical semantic relationships (LSR) which function as the core of the HowNet MT. LSR is the goal of the analysis of the input English and the basis of the transfer and synthesis of the output Chinese. By giving fine examples the paper shows the semantic computing in HowNet MT and its depth. The paper presents some main functions employed in the system in general and the resolution of sense disambiguation and sense induction in particular.

Paper ID: 4 Sense Colony Testing in HowNet MT System
Authors Qiang Dong, Zhendong Dong, Changling Hao; HowNet Technologies Inc.
Abstract This paper is a companion piece of “Semantic computing in HowNet MT system”. In terms of the relatedness to syntax, the ambiguity can be roughly divided into two types: syntax-related and non- syntax-related. The latter is closely associated with the discourse and topics. This paper proposes and discusses an original method and technique for solving the latter type of ambiguity, which is called sense-colony-related type. The technique is named Sense Colony Testing, SCT, which is wholly based on HowNet and can be operated bilingually, and can process text-level discourse. The paper describes SCT’s linguistic principle and technical mechanism, and depicts the SCT as a tool used in HowNet machine translation, Lastly in the section of discussion the paper demonstrates the fineness of features that a NLP system have to employ.

Poster / Demo

Paper ID: 5 A phrase table pruning technique based on fusion strategy
Authors Zhengshan Xue, Dakun Zhang, Qian Zhang, Jie Hao; Toshiba (China) R&D Center, Beijing
Abstract This paper proposes a mixed phrase table pruning method based on fusion strategy, which combines relevance pruning method, significance pruning method and entropy-based pruning method. The proposed method prunes phrase table based on phrase hit counts, which could reduce the risk of useful phrases being removed from phrase table by using any single pruning criterion. The experimental results show that our method can achieve as high as or even better result than baseline system, while keeping only 3.8% of the phrase table. Compared with single pruning method, this method can achieve the quality in the same level with fewer phrase table usage.

Paper ID: 7 Reexaminating on Voting for Crowd Sourcing MT Evaluation
Authors Yiming Wang, Muyun Yang; Harbin Institute of Technology
Abstract We describe a model based on Ranking Support Vector Machine (SVM) used to deal with the crowdsourcing data. Our model focuses on how to use poor quality crowdsourcing data to get high quality sorted data. The data sets which we use for model training and testing has the situation of data missing. And we found that our model achieves better results than voting model in all the cases in our experiment. Including sorting of two translations and four translations.

Paper ID: 8 A Novel Hybrid Approach to Arabic Named Entity Recognition
Authors Mohamed A. Meselhi, Hitham M. Abo Bakr, Ibrahim Ziedan, Khaled Shaalan; Zagazig University
Abstract Named Entity Recognition (NER) task is an essential preprocessing task for many Natural Language Processing (NLP) applications such as text summarization, document categorization, Information Retrieval, among others. NER systems follow either rule-based approach or machine learning approach. In this paper, we introduce a novel NER system for Arabic using a hybrid approach, which combines a rule-based approach and a machine learning approach in order to improve the performance of Arabic NER. The system is able to recognize three types of named entities, including Person, Location and Organization. Experimental results on ANERcorp dataset showed that our hybrid approach has achieved better performance than using the rule-based approach and the machine learning approach when they are processed separately.

Paper ID: 9 An English-Chinese Bi-Directional Hybrid Machine Translation System Guided by RBMT
Authors Hu Xiao-Peng, Yuan Qi, Geng Xin-Hui; China Center for Information Industry Development
Abstract This paper first reviews several typical techniques most commonly used and the most promising ones in the R&D of HMT guided by RBMT. It then gives a more detailed description of the various data-driven statistical approaches adopted by a practical English-Chinese bi-directional HMT system guided by RBMT that integrates linguistic knowledge models and statistical approaches developed by CCID. Such approaches include extracting glossaries, terminologies and translation templates from parallel and comparable corpora and extracting MWEs in native English from three-tuple comparable corpora. This paper also presents a comprehensive performance evaluation of this practical HMT system, illustrates typical applications of the system, and finally provides a vision for the future work.

Paper ID: 15 Local Phrase Reordering Model for Chinese-English Patent Machine Translation
Authors Xiaodie Liu, Yun Zhu, Yaohong Jin; Beijing normal University, Institute of Chinese information processing
Abstract This paper describes a rule based method to identify and reorder the translation units (a smallest unit for reordering) within a long Chinese NP for Chinese-English patent machine translation. By analyzing the features of translation units within a long Chinese NP, we built some formalized rules to recognize the boundaries of translation units using the boundary words to recognize what to reorder. By comparing the orders of translation units within long Chinese and English NPs, we developed a strategy on how to reorder the translation units for according with the expression of English habit. At last, we used a rule-based MT system to test our work, and the experimental results showed that our rule-based method and strategy were very efficient.

Paper ID: 16 Distributed Word Representation based Translation Disambiguation for Context Vector
Authors Chunyue Zhang, Tiejun Zhao; China Center for Information Industry Development
Abstract In the task for extracting bilingual lexicon from comparable corpus, often the accuracy of extracted lexicon is influenced by the quality of seed lexicon. Because many words are polysemous, lots of noise will be generated when translating context vector using seed lexicon. This paper proposes a distributed word representation based disambiguation method for context vector to strengthen the seed lexicon. In the Chinese to English bilingual lexicon extraction task, the experiments show that the accuracy of extracted lexicon will be significantly improved over the standard approach method.

Paper ID: 18 Integrating Syntactic Information into Phrase-Based Statistical Machine Translation
Authors Dingxin Song, Degen Huang; Dalian University of Technology
Abstract The phrases in the phrase-based statistical machine translation are not grammatical ones, while the phrases in syntax-based statistical machine translation are grammatical phrases getting from the syntactic analysis. this paper proposes to integrate the grammatical phrases in the phrase-based machine translation in order to improve the performance of the Chinese-English machine translation system. First, the grammatical phrases are extracted from the syntactic trees in the two languages. Based on the phrase translation table and the principle of consistent phrase bilingual phrase translation pairs are formed. Finally, add the pairs and syntactic features in the phrase table of phrase-based machine translation system. The experiment result shows that bilingual grammatical phrases can be used to improve the statistical machine translation system and integrating the syntactic features in the system can improve the performance of the system significantly(0.56 BLEU over the baseline system)

Paper ID: 19 Analysis of the Chinese - Portuguese Machine Translation of Chinese Localizers Qian and Hou
Authors Chunhui Lu, Ana Leal, Paulo Quaresma, Márcia Schmaltz; University of Macau
Abstract The focus of the present article is the two Chinese localizers qian (front) and hou (back), in their function of time, in the process of the Chinese-Portuguese machine translation, and is integrated in the project Autema SynTree (annotation and Analysis of Bilingual Syntactic Trees for Chinese/Portuguese). The text corpus used in the research is composed of 46 Chinese texts, extracted from The International Chinese Newsweekly, identified as source text (ST), and target texts (TT) are composed of translations into Portuguese executed by the Portuguese-Chinese Translator (PCT) and humans. In Portuguese the prepositions of transversal axis such as antes de and depois de, are used to indicate the time before and after, corresponding to qian and hou in Chinese. Nevertheless, inconsistencies related to the translation of the localizers are found in the output of the PCT when comparing it with the human translation (HT). Based thereupon, the present article shows the developed syntax rules to solve the inconsistencies found in the PCT output. The translations and the proposed rules were evaluated through the application of BLEU metrics.

Paper ID: 29 Effective Hypotheses Re-ranking Model in Statistical Machine Translation
Authors Yiming Wang, Longyue Wang, Derek F. Wong, Lidia S. Chao; University of Macau
Abstract In statistical machine translation, an effective way to improve the translation quality is to regularize the posterior probabilities of translation hypotheses according to the information of N-best list. In this paper, we present a novel approach to improve the final translation result by dynamically augmenting the translation scores of hypotheses that derived from the N-best translation candidates. The proposed model was trained on a general domain UM-Corpus and evaluated on IWSLT Chinese-English TED Talk data under the configurations of document level translation and sentence level translation respectively. Empirical results reveal that sentence level translation model outperforms the document level and the baseline system.

Demo ID: 1 TsinghuaAligner
Authors 清华大学自然语言处理与社会人文计算实验室
Introduction TsinghuaAligner是由清华大学自然语言处理与社会人文计算实验室研发的双语词语对齐系统,能够自动发现双语词语之间的对应关系。系统具有以下特点:(1) 语言无关性:系统基于平行语料库训练对齐模型,可用于任意语言;(2) 易扩展性:系统基于对数线性模型,可以加入任意知识源;(3) 有监督训练:可以在标注数据上利用最小错误率训练算法优化模型参数;(4) 无监督训练:可以在无标注数据上利用对比学习和top-n采样法快速准确优化模型参数;(5) 支持丰富的结构约束:支持多对多、ITG和block ITG约束;(6) 支持连线后验概率:能够为每条连线输出后验概率。

Demo ID: 2 HowNet MT System
Authors Qiang Dong, Zhendong Dong, Changling Hao; HowNet Technologies Inc.
Introduction If you never see a rule-based MT system, then it will be a good chance to get to know how a rule-based system was tempered. Please come and try and compare it with Google Translate or Baidu Translate by yourself!

Demo ID: 3 NiuTrans Server and NiuParser
Authors Tong Xiao and Jingbo Zhu
Introduction NiuTrans Server and NiuParser are a multi-lingual MT system and a language parsing system. In this poster we will introduce the techniques, utilities and applications of NiuTrans Server and NiuParser.

Demo ID: 4 CloudTranslation (Yunyi)
Authors Xiaodong Shi, Yidong Chen, Boli Wang, Changxin Wu, Jianxi Zhen
Introduction Yunyi is a multilingual collaborative translation platform bringing Machine Translation, Translation Memory and Term Sharing to both freelancer translators, translation companies and translation clients using advanced cloud-based technology.

Oral Presentation - Natural Language Processing on Minority Languages

Paper ID: 10 Tibetan Functional Chunks Boundary Recognition Based on Syllables
Authors Tianhang Wang, Shumin Shi, Congjun Long, Heyan Huang; School of Computer Science and Technology, Beijing Institute of Technology
Abstract Tibetan syntactic functional chunk parsing is aimed at identifying syntactic constituent in Tibetan sentences. In this paper, based on the Tibetan syntactic functional chunk description system, the author proposed a method which puts syllables in group instead of word segmentation and labeling and uses the Conditional Random Fields, CRF to identify the functional chunk boundary of a sentence. According to the actual characteristics of the Tibetan language, the paper firstly identifies and extracts, through the text pretreatment, the syntactic markers which are composed of the Sticky written form and the non-Sticky written form as identification characteristics of syntactic functional chunk boundary. And then it identifies the syntactic functional chunk boundary through CRF. Experiments have been made on 46783 syllables of Tibetan language corpora, and the precision, recall rate and F value respectively achieves 75.70%, 82.86% and 79.12%. The experiment results show that the proposed method is effective which can provide infrastructural support for machine translation and other natural language processing applications.

Paper ID: 11 Character Tagging-Based Word Segmentation For Uyghur
Authors Yating Yang, Chenggang Mi, Bo Ma, Rui Dong, Lei Wang, Xiao Li; The Xinjiang Technical Institute of Physics & Chemistry Chinese Academy of Sciences; University of Chinese Academy of Sciences
Abstract To effectively obtain information in Uyghur words, we present a novel method based on character tagging for Uyghur word segmentation. In this paper, we suggest five labels for characters in a Uyghur word, include: Su, Bu, Iu, Eu and Au, according to our method, we segment Uyghur words as a sequence labeling procedure, which use Conditional Random Fields (CRFs) as the basic labeling model.  Experimental show that our method collect more features in Uyghur words, therefore outperform several traditional used word segmentation models significantly.

Paper ID: 13 A Statistical Method for Translating Chinese into Under-Resourced Minority Languages
Authors Lei Chen, Miao Li, Jian Zhang, Zede Zhu, Zhenxin Yang; Institute of Intelligent Machines, Chinese Academy of Sciences
Abstract In order to improve the performance of statistical machine translation between Chinese and minority languages, most of which are under-resourced languages with different word order and rich morphology, the paper proposes a method which incorporates syntactic information of the source-side and morphological information of the target-side to simultaneously reduce the differences of word order and morphology. First, according to the word alignment and the phrase structure trees of source language, reordering rules are extracted automatically to adjust the word order at source side. And then based on Hidden Markov Model, a morphological segmentation method is adopted to obtain morphological information of the target language. In the experiments, we take the Chinese-Mongolian translation as an example. A morpheme-level statistical machine translation system, constructed based on the reordered source side and the segmented target side, achieves 2.1 BLEU points increment over the standard phrase-based system.

Paper ID: 23 Chunk-based Dependency-to-String Model with Japanese Case Frame
Authors Jinan Xu, Peihao Wu, Jun Xie, Yujie Zhang; Beijing Jiaotong University; Beijing Samsung Telecom R&D Center
Abstract This paper proposes an idea to integrate Japanese case frame into chunk-based dependency-to-string model. At first, case frames are acquired from Japanese chunk-based dependency analysis results. Then case frames are used to constraint rule extraction and decoding in chunk-based dependency-to-string model. Experimental results show that the proposed method performs well on long structural reordering and lexical translation, and achieves better performance than hierarchical phrase-based model and word-based dependency-to-string model on Japanese to Chinese test sets.

Paper ID: 27 Study on the Effect of Different Granularity on Uyghur Chinese Word Alignment
Authors Mairebaha Aili, Miliwan Xuehelaiti, Maihepureti Maimaiti; Xinjiang University
Abstract We tried to cope with the complex morphology of Uyghur by applying different schemes of morphological word segmentation to refine the word alignment, further improve the SMT result in Uyghur-Chinese. In this method, we aimed at, firstly, to minimize the affects of data sparseness by stemming Uyghur words; Secondly, to produce more refined alignments by aligning Chinese words with Uyghur affixes which has meanings; Lastly, to reduce the over long length of sentence which is the result of regarding affixes as a token of alignment unit. We apply these schemes to the training, developing and test data. Experiment results show, this method plays positive role on improving Uyghur-Chinese word alignment and further machine translation.

Best Paper Presentation

Paper ID: 21 Making Language Model as Small as Possible in Statistical Machine Translation
Authors Yang Liu, Jiajun Zhang, Jie Hao, Dakun Zhang; Institute of Automation, Chinese Academy of Sciences; Toshiba (China) R&D Center
Abstract As one of the key components, n-gram language model is most frequently used in statistical machine translation. Typically, higher order of the language model leads to better translation performance. However, higher order of the n-gram language model requires much more monolingual training data to avoid data sparseness. Furthermore, the model size increases exponentially when the n-gram order becomes higher and higher. In this paper, we investigate the language model pruning techniques that aim at making the model size as small as possible while keeping the translation quality. According to our investigation, we further propose to replace the higher order n-grams with a low-order variable-based language model. The extensive experiments show that our method is very effective.